Big Technology Podcast - Is ChatGPT A Step Toward Human-Level AI? — With Yann LeCun
Episode Date: January 25, 2023Yann LeCun is the chief AI scientist at Meta, a professor of computer science at NYU, and a pioneer of deep learning. He joins Big Technology Podcast to put Generative AI in context, discussing wheth...er ChatGPT and the like are a step toward human-level artificial intelligence, or something completely different. Join us for a fun, substantive discussion about this technology, the makeup of OpenAI, and where the field heads next. Stay tuned for the second half, where we discuss the ethics of using others' work to train AI models. Enjoying Big Technology Podcast? Please rate us five stars ⭐⭐⭐⭐⭐ in your podcast app of choice. For weekly updates on the show, sign up for the pod newsletter on LinkedIn: https://www.linkedin.com/newsletters/6901970121829801984/ Questions? Feedback? Write to: bigtechnologypodcast@gmail.com
Transcript
Discussion (0)
LinkedIn Presents
Welcome to Big Technology podcast,
a show for cool-headed, nuanced conversation, of the tech world and beyond.
Yon Lecun is the chief AI scientist at Meta.
He's also a professor of computer science to NYU,
and he's a pioneer of deep learning, which is the AI methodology behind
many of the jaw-dropping advances you're seeing today.
And Yan is here today to add what I think is some needed context to the discussion around
generative AI.
Obviously, we're all seeing this giant technological leap firsthand because we can now literally
talk to the technology through advances like chat GPT.
But I wonder, is this stuff getting us any closer to human-level intelligence?
And how smart is it, actually?
I think these are the key questions we need to ask about this new wave of generative AI and where it's heading, and we're going to ask those questions today.
Jan and I go way back all the way back to 2016, which is about seven years ago, when I profiled him and his interest in artificial intelligence that can both think and perceive the world like humans in a story for BuzzFeed News.
It's always fun to speak with him, and I think you're going to love this episode.
All that and more is coming up right after this.
Yeah, and welcome to the show.
Awesome to be here.
Great to have you back.
This time we get to do it one-on-one, which is nice.
It was great having you and Daniel Kahneman on.
And you and I have been speaking for a long time since 2015, 2016,
about the possibilities for artificial intelligence.
And I think there was like a quick moment where everyone started talking about crypto,
but now I think we're focused on the right stuff again.
That was a distraction.
So let's talk a little bit about.
the wave of generative artificial intelligence that we're seeing and how that might relate to
general intelligence. So, look, I think when we were talking for the first time, we were talking
about how can we make a machine that looks like and that thinks like a human. And you had told me,
okay, there's going to be a time where they start to predict. And if they can predict, they can plan.
And that's how we're going to get close to artificial intelligence. And I said, okay, that sounds nice,
yeah, but that's never going to happen. At least that's what I was thinking.
now we're starting to chat with some of these advances like chat GPT,
and I'm starting to think, oh, okay, maybe it wasn't that far off.
So where are we now in terms of the pursuit for artificial general intelligence?
And is this a big step forward towards that, or is it, again, maybe a distraction?
The short answer is it's not a particularly big step towards, you know, more like human,
human-level intelligence.
I don't like the term AGI, artificial general intelligence,
because I think human intelligence is very specialized.
So if you want to designate the type of intelligence
that we observe in humans by general,
that's a complete misnomer.
So I know maybe that chip has sailed,
but I want to make that point that human intelligence
is actually very specialized.
So no, the short answer is,
So first of all, from the scientific point of view, chat GPT is not a particularly interesting
scientific advance. Of course, we don't know all the details because OpenAI has not published any.
Which is interesting for a company called Open AI.
Yeah, right. No, it's the least open that you can imagine.
But they started out as wanting to be open and then they realized they couldn't fund their
research unless they went slightly secretive.
So, by the way, can we just pause on that?
Because that's interesting.
Yeah.
What does that say about the AI research world that if you wanted to start out as open and you couldn't, you had to go for profit.
Now they're like cap profit.
Is that, it's interesting that it's impossible to fund this type of research without that.
And does something structurally need to change because of that?
Well, there are a lot of non-profit AI research organizations, right, called universities.
But there are also other non-profits like, you know, the Allen Institute for AI, for example, is non-profit in Seattle.
Open AI initially was non-profit and then switched to for-profit and originally was publishing everything they were doing and now kind of basically keep everything secret.
They've become sort of a contract research house for Microsoft to some extent.
And it's because the funding model is not clear.
There's a reverse phenomenon that occurred at Google when I created Fair nine years ago.
We had a big drum role about the fact that we're going to be completely open, et cetera, and we still are.
We are holding that line.
And as a result, it caused Google brain at the time to become much more open than they were because that's what the scientists wanted.
You know, if you tell a scientist, you can come work for us, but you can't say a word about what you're doing.
you're killing their career so you know so you have to basically buy their life right so and open
research is much more efficient you just get more more stuff out of it you get things that is
more reliable you know and and you attract better people you you have you know a better intellectual
impact which is which means people have kind of more respect for you I want to work for you and
things like that right so there's a lot of advantages to this but there has to be an economic model
and the economic model, the only one I know outside of universities and philanthropy,
is industry research lab inside of a large company that is profitable
and is sufficiently well-established in its market that it can think for the long term
and invest in fundamental research.
So that's the case for certain corners of Google.
That's the case for Fair at Meta.
And not quite for Deepvine.
So Deepvine was an interesting thing because they started out as a startup.
And you can't, you can't absolutely cannot do research in a startup.
You just can't because you just don't have the funds, right?
Or the ability to wait long term.
That's right.
For a return.
So, you know, you can do it for two or three years, but then you basically have to focus
your entire attention to, you know, building products and getting revenue and getting
the company to survive.
So what allowed them to do?
do what they're doing is that they got bought by Google.
But then, still then, after that, their economic model was not obvious because they were
kind of sort of an ivory tower separated from Google, and to some extent they still are.
And so, you know, Google had the foresight to fund them regardless of whether they were producing
something that was useful.
But in the current context of, you know, more efficiency and money saving in the tech industry,
that model might have to change.
it's not clear, right?
So the economic return
after 10 years or so
or nine years that Google has gotten
from DeepMine is not clear
it's worth their investment.
So they're banking on like
bigger, longer investment.
At Fair, the business model is very clear.
Like Fair has had a huge impact
on the company, mostly indirectly
through other groups, right?
Because Fair does not work on products,
but there's been like a huge amount of impact.
So long, long riders,
Sorry, the Open AI could not keep doing what they were doing unless they were commercial, essentially, and it caused them also to make kind of wide promises.
And so what they have to do to be able to raise enough money from Microsoft and others is to make very flashy demos.
And they're very good at that.
So they make very flashy demos.
their technology is not particularly innovative
from the scientific point of view.
It's very well engineered.
So they put together, you know, large-scale, scale-up system
with trained with very well-curated data.
I mean, they know what they're doing.
But in terms of advance, there's nothing.
Well, not much.
So let me take this from, and by the way,
thank you for that little diversion.
I think it's good we went into that discussion
of how this stuff gets funded and where it goes.
But going back to our discussion of the march towards
general intelligence or sorry I'll use your term human level intelligence you can use a GI it's
okay I'm not going to get processed from a layperson's point of view it does feel like oh okay now I'm
talking to AI now AI understands what I think and can actually draw it now I can take my voice
and start talking on its own so why isn't that a step towards this human level intelligence
Because the understanding that the current systems have of, you know, the underlying reality that language expresses is extremely shallow.
So those systems have only been trained with text, a huge amount of text.
So they can regurgitate texts that they've seen and, you know, interpolate for new situations, things like that.
They can, you know, even produce code and stuff like that.
But they do not understand, they have no knowledge of the underlying reality.
They have no, they've never had any contact with, you know, the physical world.
You know, if I take a piece of paper, let's say, looking for a piece of paper.
And I, you know, I hold it like this, right?
And I tell you, I'm going to lift my hand from one side.
You can exactly predict what's going to happen.
Right.
And for listeners, that paper is being held horizontally.
And so when you take your hand off, half of it drops.
And I'm sort of, you know, with my two hands.
And then one hand moves away.
So one part of the paper kind of droops because of gravity.
And you know exactly how it droops because, you know, you know the properties of papers and stuff like that.
Right.
So this type of knowledge that all of us have learned in the first few months of life, none of those systems have any of this.
Hold on.
But I could chat with chat GPT and say, what happens if I'm holding a paper with two hands and I let go with one?
And it will tell you it will droop.
No.
I mean, it might.
It will answer that.
But you think it might just won't understand it.
No, it might actually not tell you because it will depend, whether you'll tell you or not.
It depends on whether they were kind of similar.
situations in text that it's been trained on.
But it's not at all.
And I can come up with, you know, a huge stack of similar situations
that each one of them will not have been described in any text.
So then the question you want to ask is, you know,
how much of human knowledge is present and described in text?
And my answer to this is a tiny portion.
Like most of human knowledge is not actually language related.
It's completely more linguistic.
I don't know, you do a, I don't know, do carpentry, right, build a piece of furniture.
Your ability to predict what the piece of furniture is going to look like as a consequence of how you build it is completely non-linguistic.
And everything that involves any kind of motor control planning, things like that, that there's basically no linguistic knowledge.
So now think about the entire collection of knowledge in every animal.
is obviously non-neguistic because they don't have language,
or at least not human-type languages.
You might include a few species like dolphins and stuff, right?
Now, you know, dogs and cats know a lot of stuff
about how the world works, and all of that knowledge,
humans have it too, to some extent, not to the same degree,
but in all domains, because we're all specialized.
but none of that knowledge is captured by any current AI system, essentially.
That's a lot.
Okay, let's build on that.
I'm just going to read the response.
I asked chat GPT, if I'm holding a paper horizontally with two hands and let go with one hand, what will happen?
Now, I'm not going to say that you're wrong.
Obviously, you're right, but I'm just going to read it to you for the sake of discussion.
Chat Chepti response, if you are holding a paper horizontal,
with two hands and let go with one hand, the paper will tilt or rotate in the direction of the
hand that is no longer holding it due to the unbalanced forces acting on the paper. If the paper was
initially still, it will also move in the direction of the hand that let go do the force of gravity
acting on it. If the paper was moving in a certain direction before you let go, it will continue
in that direction, but may also be affected by air resistance and other external factors.
That's a pretty cool response. It sounds correct, and it's completely wrong.
It's exactly wrong. It's actually the exact opposite that's happening, right? It's not moving.
I mean, it's the part that you let go that droops, right? And this is saying the opposite.
Oh, that's true. Yeah, yeah. Yep.
So, right. So it sounds correct. It's grammatically correct. The general theme is correct because there probably were descriptions of similar situations that the system was trained on and it kind of memorized it.
And it tries to adapt the text so that it is relevant to the current situation.
But it gets it completely wrong.
And it gets wrong things like comparing numbers.
So you tell it, you know, you do a prompt.
You say, you know for a fact that seven is smaller than five or bigger numbers.
Like 250 is smaller than, you know, 100.
And then you start kind of, you know, telling a story with numbers.
And it will assume that what you prompted it with is right, even though it's false.
But you don't even have to do this, right?
I mean, there's very, a lot of situations like this where the system will just not say things that are factually correct.
Now, why is that?
It's because large language models are trained to predict the next word in a text.
and they're trained on enormous amounts of text
and they have enormous amounts of memory.
But they basically
probabilistically generate the next word
and then re-inject that word into their context of a few thousand
previous word that they've said or the prompt
and then generate the next word again
and then re-inject that in the input, etc.
There are various ways to do this more efficiently,
but that's the basic idea.
So now, the issue with this is that there is no way to specify a task that the system has to accomplish other than by specifying that task inside of the prompt, which is a very circuitous, inefficient, and complicated way of specifying a task.
It's not very controllable, okay?
Okay, that's the first thing.
The second thing is that system,
so that system is not like optimizing an objective, if you want,
like trying to satisfy an objective, right?
It's just kind of generating one word after the other.
And because it's generating one word after the other,
it's not doing any planning.
So it's not like planning to tell a story or an answer,
whether it's, you know, like a kind of aligned to the story
or, you know, a set of facts and things like this.
it just generates one word after another.
It has no capability of generating commands for a tool,
like, say, a calculator or anything like that,
or a physics simulator, for example.
You could have simulated that piece of paper and then observe the result
and then kind of tell you what the result was.
That's what we do in our head.
When we are being described this kind of situation,
we have our own internal mental simulator.
And because we've learned how the world works,
we can simulate what goes on,
and then describe the result, right?
LLMs do not have that.
They don't have any internal model of the world
that allows them to predict.
And then in addition to this,
you would like when the system produces a statement,
you'd like to be able to verify that that statement
is factually correct or does not break any kind of logic
of any kind compared to another statement that was made before.
And there is no way in the current architecture
those systems to do this.
Right.
Until we build systems that have some internal model of the world,
it allows them to kind of simulate the world, if you want,
some way of generating actions on the world to use tools like a calculator
or something of that, or interrogating a database or a search engine.
And an objective that it has to satisfy for the tasks that we ask it to accomplish.
and a method by which it can plan an answer that satisfies the objective is factually correct
or not, depending on the desired behavior, and perhaps interrogates the right sources of information.
We're not going to have anything resembling human-level intelligence.
Okay, and I definitely want to get to the type of research and the models that might get us there.
But first I want to talk a little bit about this hallucination that chat GPT just had with my interaction with it.
Because hallucination is definitely a big issue.
And I'll be honest, and this is embarrassing to admit as a journalist.
But as I read it, I was ready to believe it because it was like, oh, here's AI answering a question with a somewhat plausible answer and stating it so confidently.
And that is an issue with these models, right, is that they do hallucinate.
That's probably why we haven't seen Google bring it into search.
Well, that's why you haven't seen any kind of system of this type from either Google or Meta, despite the fact that they have the technology.
Okay.
Okay.
So, I mean, you have to realize that most of the underlying techniques used in ChatGBT, have been invented Google and Meta.
Right.
And the whole thing has been built with Pitech, which is, you know, made by Meta.
It's not owned by Meta anymore, but it originated there.
So, for example, you know, it uses large transformers, transformer architectures.
those were originally invented at Google.
Transformer themselves use something called associative memory.
I mean, it's called self-attention, but it's basically the same principle.
Those were basically proposed by META many years ago.
They use self-supervised pretraining by removing words.
That's a denozing auto-encoder.
Those are techniques that go back to the 1990s or even 1980s in some cases.
They've been popularized by the birth-style language models, again,
proposed at Google and then a number of techniques for dialogue system.
So there is a very active dialogue system research group at Meta that has proposed lots and
lots of methods, which inevitably, whether they say it or not, open AI must have been
influenced by.
And then they use a technique now that chat GPT is available using a technique that is
called reinforcement learning through human feedback, R.HF, which was proposed by deep.
mind, actually. So, you know, they've done a good job at sort of integrating a lot of things that, you know, have been proposed in the literature and sort of engineering a system that kind of produces impressive demo. And they have to produce impressive demos because that's the economic model. That's how they're going to raise money from Microsoft and others. Whereas if you are meta or Google, you could think about, like, you know, putting out a system of this type that you know is going to spew nonsense.
And, you know, because you are a large company, you have a lot to lose, but, you know, people are kind of making fun of you for that.
And it's not clear what the benefits are, okay?
Right.
So we're still working on those things to make them useful.
But didn't meta put out a system of its own, Galactica?
And this was something that was.
So, yeah, talk us through what happened there because this was a system that was supposed to summarize scientific literature and do lots of other cool things.
It comes out.
And then three days later, it goes back behind closed doors.
That's right.
So what happened?
There was a previous system also called Blenderbott,
and there's another story about that that we'll talk about.
Oh, right.
That's the thing that started talking about how Mark Zuckerberg is a sort of money-hungry capitalist.
You didn't get called into Zuck's office and be like, hey, what are you guys talking, telling it about me?
Yeah.
No, no, it just, I mean, it just trained on the, you know, the general conversations.
You read in the media, you know, Mark Zuckerberg is.
very often painted as, you know, some sort of money-hungry bad guy, which is not at all,
but that's the way he's painted. So, yeah, so let's start with Blenderbot then. So Blenderbot
was put out and several months ago, and it's a dialog system, a bit like chat GPT. It's designed
to be entertaining more than anything else. And it's capable of having multiple personalities
so we can talk into like several styles and things like that. And it has provisions
to somewhat verify factual correctness,
although not particularly well-developed.
But it does have a lot of kind of guardrails
and kind of systems to prevent it from saying things
that might be offending or whatever, or even objectionable
or even controversial, right?
So if you try to get it to talk about anything related to politics or religion or anything like that, it will change the topic.
And it won't tell you it's because I don't want to talk about this.
It would just change topic, right?
And so people thought this thing was really stupid and boring because it doesn't want to talk about anything that's kind of controversial or fun, which is the kind of stuff you want to talk about.
You know, everybody wants to talk about.
And it's frustrating because it would change topic.
any time you wanted to do that.
So it was not nearly as convincing.
So you could say that the reason it was not that impressive in the end
was because it was made to be safe, essentially.
And if it's too safe, it's boring.
So now let's go to Galactica.
So Galactica is a different animal.
It's also a large language model.
And that large language model has been trained.
on the entire scientific literature.
So this is something like, you know, millions of scientific papers.
And the purpose of it, it's used, is to help scientists write papers.
So think of it.
So it's not going to write a scientific paper for you.
It's not going to answer scientific questions, although you can try to use it for that.
But sometimes it might make stuff up.
But it's designed to be essentially a, you know, predictive keyboard on steroids.
Right.
So you start typing a paragraph.
about something.
And it will, you know,
complete the text, the entire paragraph.
It will insert relevant citations.
If you say, you know, the state of the art
in object recognition on the ImageNet database is,
it will find the correct reference.
It will actually, you know, build a table of results
with links to the references and stuff like that, right?
But the same way driving assistance systems for cars
are just that,
driving assistance. This is just writing assistance, right? So in the end, your hands have to be
on the wheels, on the wheel at all times. You are responsible for the text that in the end is finished.
It just helps you. It's a tool that helps you write more efficiently, particularly if you are
not a, you know, native English speaker, which, you know, most scientists aren't. Right. I mean,
I even use chat GPT that way. I put in the beginning of the paragraph and say, hey, which ways
could this go. Understanding that, it might not be accurate.
And that's the way you should treat it, really.
Right.
As, you know, as a predictive keyboard on theory and something that just helps you write,
but it's not going to, you know, write, invent new things, answer questions, do science, blah, blah, blah.
So what happened was that when we put out Galactica, people tried to break it.
So people are not scientists, like, didn't understand what the use of it was going to be.
and they would prompt it with things like, you know,
what are the benefits of eating crushed glass or something like that?
And, of course, that's kind of a leading question.
So the system will kind of make a story of like why it's good to, you know,
to eat crushed glass.
And then the reaction on Twitter was, oh, my God, you know,
people are going to eat crushed glass because they're going to listen to it, right?
Which is, you know, insane.
I mean, it's stupid.
People are not that dumb.
Well, Jan, I think you overestimate people a little bit, but sorry, continue.
Well, there might be a tiny proportion, but, like, you know, are they likely to use things like this?
It's not clear.
You know, particularly Galactica, which really ultimately was designed to be, you know, integrated into tools that scientists used to write papers, right?
And then, you know, others more seriously said, oh, this is going to destroy scientific publication because now, you know, anybody can just.
generate a nice sounding scientific paper, and then we'll submit it to a conference, and this
will completely flood and overwhelm the reviewing system that we have in science and destroy science.
And I thought that was a completely ridiculous argument, because the reason why you might want
to submit a paper is because you want to prop up your CV. And so you have to put your own name on it.
Otherwise, what's the point? And if you put your own name and it's garbage, it's bad for you.
It's bad for your career.
Like, you know, if you, you know, send 100 papers, there are complete nonsense to a conference with your name on it, it's not going to be good for your career.
Like, absolutely not.
So, so I don't, so I think this kind of knee-jerk reaction was completely unwarranted.
And it really mirrors a lot of knee-jerk reactions that have happened in the past when new tools or new communication technologies have appeared, where, you know, all of them was going to destroy society.
So I think it's the same kind of knee-jerk reaction that we're observing with AI today.
You know, this is not to say that there is no danger, but it's not like the horrible things that people make it to me.
Right. So then why not keep it up?
Well, so what happened was the team that worked on it, which is within fair is called Papers with Code.
There was so distraught by the reaction that they just couldn't take it.
It's like, you know, we're just going to take it down.
Like, this was not a high-level decision.
This was not a decision by communications departments or the management.
This was them.
Okay.
It's a small team, a few people.
And they felt like really good about what they had produced.
They wrote a very long paper.
They open-source their code.
They took down the demo, but they left the code.
So anybody can download Galactica and run it on their own computer.
Interesting.
And this is sort of what we're going to see.
I just wrote this story.
and we've talked about on the podcast about how the battle over, you know, AI ethics, applications of AI is just going to be extremely intense.
And I think we're starting to see some of that.
So we've covered hallucination.
Good.
We got to that.
Let's talk about the type of models that you think.
So you talk about our move to artificial or to human level intelligence, needing an understanding of the environment, things of that that can't be expressed with words.
Now when you tell me that, I'm back to kind of where I was.
early days of our conversations saying there's no way technology is going to be able to do that.
But it sounds like you think that there is a chance that it can.
How do we get there?
And what type of advances are we seeing today that might lead us to think that we do have a chance?
So I actually wrote a position paper about this, which is rather long, but the intro is
easy readable by non-specialists.
And the title of it is A Pass Towards Autonomous Machine Intelligence, where basically I
I lay out a plan or path forward to address those questions, to build AI systems that are
capable of planning whose behavior is controlled by objectives that need to be satisfied,
which can be specified by hand or learned. So things like, you know, factual correctness and
blah, blah, blah, and answer this question and, you know, don't spew offensive stuff and things
like that, and have internal models of the world or the thing they are interacting with,
which could be a person. If there are dialogue systems, then when you're talking to someone,
you need to have a mental model of what that person knows and can understand. If you,
if you speak in terms that the person cannot understand, then, you know, communication fails, right?
So when you talk to someone, you have something to say and you may have to tell them, you know, some background information, depending on what you think they know, and then sort of take them to the stage where you think they've absorbed the information that you think would be useful to them.
So you need to talk to someone, you need an internal model of what that person, how that person will react to what you tell them.
And if it's another type of agent that generates actions in the physical world, or even in a digital world, like a robot that, you know, a domestic robot that, you know, needs to cook or fill up the dishwasher, that model, that system needs to have an internal model of the world that allows it to predict where the state of the world is going to be as a consequence of its own actions, because that's what you need to be able to plan.
If you're going to plan a sequence of action to arrive at a goal, you need to be able to imagine, if I take this action, this will happen, and then if I take that action, this will happen, et cetera, et cetera.
And so you can optimize your sequence of action so that the resulting sequence of states that the world is going to follow is going to satisfy your ultimate goal.
And then an issue with this is that, you know, how do we get a machine to learn models of the world?
baby humans and animals do this in the first few months of life,
mostly by observation,
and understand a huge amount of background knowledge about the world,
basically just by observation.
This is the type of learning that we need to reproduce in machines
that we currently cannot do except for simple things like text,
because you have to understand that text is actually simple compared to the real world.
Right. So how do you do it?
Okay, so...
Sure, it's simple.
There's one way to not do it, which is the way large language models are trained.
So the way you pre-trained a large language model is that you don't train it to just predict the next word in the text.
You take a long piece of text, a few thousand words typically, and you remove some of the words.
You blank them out. You replace them by a blank marker, or you substitute another word, or you do various corruption things.
And then you train some gigantic neural net to predict the words that are missing.
Okay. So this is called self-supervised running.
And this particular instance is what's called a generative model,
which is much more general than the usual kind of colloquial use of the term generative model.
It's generative in the sense that it produces signals that are the same as the input.
It tries to fill in the blanks, if you want.
Okay. So it generates the missing information, if you want.
if you want.
And this particular instance
is called a denoising auto encoder.
Why auto encoder?
Because you give it an input
that is corrupted and you ask it to produce
the clean version of that input.
That's called a denoising auto encoder.
And again, that concept goes back to the 1980s.
Nothing new there.
Except the application of this idea
of denoising auto encoder to text
using a transformer architecture,
which is those very large neural nets
that we use nowadays,
with 40, 90 layers or whatever
and hundreds of billions or at least tens of billions of parameters,
works amazingly well, like surprisingly well.
So in the process of being trained to filling in the blanks,
those models basically learn to represent language as a series of numbers
that represents basically the meaning of the input sentence
to the extent that this meaning is useful to fill in the blanks.
Okay.
So inevitably, by being trained to do this, those systems understand a little bit about the real world, but not much.
It's very superficial.
Now, if I train the system with sentences of the type, you know, the cat chases the blank in the kitchen.
You know, the blank there can only be a few things, right?
It's either a mouse or an insect or some type or maybe a laser spot that someone is, you know, playing with a cat.
or something like that, right?
There's only a few options.
And so those systems are trained to produce essentially a long list of numbers
that are scores for each of the possible words in the dictionary
for as, you know, how likely they are to appear at that location.
And to be able to do a good job at this,
the system has to understand a little bit about, you know,
what's a kitchen and what's a cat and, you know, cat,
cats chase mice and things like that.
And so it learns that, but it's very superficial.
And then what you do, once you've trained the system
to do this, you chop off the top layers, and then you can use the internal representation
as input to a subsequent downstream task that you can train supervised, like, for example,
translation, or hate speech detection, for example. So this technique has become completely
mainstream in natural language processing. And so much so that companies like meta and alphabet
that in their various services have deployed those things massively
for doing things like content moderation, right?
So hate speech detection, for example,
has made enormous progress over the last three, four years.
And it's essentially entirely due to the fact
that we're not using those techniques.
We can do a speech detection in multiple languages,
basically with a single model, with we don't
have to train it with a lot of data in each language,
because we don't have a lot of data in every language,
We have a lot of data in English and others and French and German and blah, blah, blah.
You know, maybe the main language in India, but India has, you know, like an enormously large number of languages that people use, including online, you know, and local dialects and stuff.
So, you know, how you make the hate speech detection work in 500 languages, and even 500 would not cover everything.
So you need those techniques, and, you know, this has made a huge amount of progress.
Now, one thing you can do with those things, you can fine tune them.
to just predict the next word.
And that's what large English models are.
Okay.
Now, coming back to this question of planning,
last time we talked with Danny Kahneman,
which was an amazing discussion.
Danny is famous for, you know, this book,
thinking fast and slow
and for this idea that we have basically two ways of acting.
system one and system two equals them system one system two so system one corresponds to tasks that
you accomplish subconsciously you don't need to plan anything it's completely reactive um so think
about like you know you're you're an experienced driver and you're driving on the highway
you're barely paying attention and you know you're you're you're not calling on your sort of
high-level cognitive functions to do that.
You can do it more or less automatically.
Anything that involves real-time reaction,
like playing tennis or something like that,
that's mostly subconscious.
You just don't have time to plan.
So it has to be built into your muscle memory, as we call it.
But even complex acts, like, you know, you play chess.
And you are a chess grandmaster.
You're playing a simultaneous game against 50 bad players like me.
You don't have to think.
You can just look at the board and just immediately play.
You don't have to plan anything because it's so easy for you.
You know, it's become kind of a subconscious task.
Now, all of those things, though, all of those tasks,
when you learn them or when you're not very good at them,
you have to use your entire cognitive resources.
You learn to drive and you pay attention to everything.
You imagine all kinds of catastrophe scenarios.
You drive slowly.
you're using your frontal cortex, your model of the world
that tells you I'm driving next to a cliff
and I know that if I turn my steering wheel to the right
I'm going to fall off the cliff and nothing good is going to happen.
You don't need to try it to know that something bad is going to happen, right?
Because you have this model that you've built in your head
for the last 17 years if you are a 17-year-old.
So that model of the world allows you to predict the consequences
sense of your actions and allows you to learn extremely quickly any new skill.
You know, same with chess.
If you're a bad chess player, you will have to think for, you know, 15 minutes or more
when you play against a challenging player and plan all kinds of strategies.
So what characterizes intelligence is the ability to predict, first of all,
And then the ability to use those predictions as a tool to plan by predicting the consequences of actions you might take.
Prediction is the essence of intelligence.
Okay, so now here's the problem.
This idea of denoising auto-encoder that we use to pre-trained natural language processing systems works for text.
It doesn't work for anything else.
So it doesn't really work for things like video or images.
So natural idea is you take an image, you block some pieces of it,
and then you train some system to predict the pieces, reconstruct the parts of the image.
That's how Dali works, huh?
Not really, no.
Okay.
It's, you know, it uses, I mean, Dali too uses diffusion model,
which is kind of a slight different idea.
But I have to get to that later.
Okay.
But if you do it the way I just described, there is like one or two models that sort of work that use that.
One is called MAA by my colleagues at Fair.
That means mass auto-encoder.
But it doesn't work as well as other techniques.
And those other techniques are not generative models.
So there are models that do not attempt to reconstruct missing information.
There are techniques that attempts to reconstruct missing information, but not reconstruct the image itself, but reconstruct a representation of that image, an internal representation of that image.
Those techniques, I call them joint embedding architectures.
So essentially, in the denozoic motor encoder, you have an encoder that produces representation of the input, whether it's an image or text or video or whatever.
And then you try to reconstruct the uncorrupted version of the input, the input being corrupted, right?
That's through a decoder.
In a joint-termading architecture, you have two encoders.
One encoder sees the perfect version of the input.
The other encoder sees a corrupted version or distorted version of some kind.
You run those two things through both encoders, and then you tell the encoder that sees the corrupted input.
Can you predict the representation of the full input?
But you're not trying to reproduce all the details.
And it makes sense for images, or video.
So let's imagine we're on a scenario with video, right?
So I have a video clip.
And I have a complete video clip.
And what I do is I mask the last half of the video clip.
Okay?
The corrupted version of video clip is just the first half of that video clip.
Okay?
The rest is invisible.
And then the complete version, of course,
it's a full video clip, right?
So you run this full video clips with some neural net
that produces some representation of that video clip, right?
And then you train this guy to predict the representation
that that guy has produced.
And of course, implicitly, what it has to do
is predict the rest of that video clip,
but in representation space.
Now, why is that better than just reconstructing
the piece of the video clip that is missing, just predicting it?
The reason is there is an infinite number of things
that can happen.
after a particular video clip, right?
And we don't know how to represent
a distribution
of all the possible things
that could possibly happen.
We cannot do it in pixel space, right?
So, for example, you're seeing a green screen
behind me right now.
You're not seeing the back of my head, okay?
You can make...
So if I start rotating my head,
you might be able to predict
what the back of my head will look like.
And when you look at it, you might be surprised by what you see.
Maybe I have a small ponytail or something.
I don't.
But, you know.
And so, and there is, you know, an infinite number of variations of what could be possible there.
And I could decide to, you know, change the action and move my head in a particular way or something.
So you can't just predict every detail of what's going to happen.
Let's say you want to build a, it's very important if you want to do things like building
cell driving cars.
Because to be able to drive safely, you'd like to be able to predict what cars around you are going to do or what pedestrians are going to do.
You see a kid on the sidewalk and a soccer ball kind of crossing the road.
Good chance that the kid is going to run after it, right?
So you slow down.
So, you know, we have this sort of interesting models of the world that allow us to predict and then as a consequence plan kind of safe actions.
But then in that same street, where the kid is with a soccer ball, it's lined with trees and there is a pond behind the trees.
It's a windy day, so the tree, you know, all the leaves are moving and there is, you know, ripples on the pond and everything.
And you don't want to spend any resources predicting all of those details, which are essentially unpredictable.
And so that's why generative models essentially fail when you train them on images, because there's just so many details to predict.
the system gets completely lost in predicting all kinds of irrelevant stuff, which we call noise.
But, you know, how would the system know?
So one of the main things I'm proposing in that piece is to abandon those generative models, basically,
and focus on those joint-termading architectures for getting models to learn, system to learn models to the world, predictive models to the world.
Okay. And maybe that's what gets.
there. Okay, let's take a quick break here and then come back for about 10 more minutes
on the show. We have Jan Lecun with us. He's the VP and Chief AI scientist at Facebook
and known as the father of deep learning. So plenty to talk about. We'll be back right after
the break. Hey, everyone. Let me tell you about the Hustle Daily Show, a podcast filled with
business, tech news, and original stories to keep you in the loop on what's trending. More than
two million professionals read the hustle's daily email for its irreverent and informative takes
on business and tech news. Now, they have a daily podcast called The Hustle Daily Show,
where their team of writers break down the biggest business headlines in 15 minutes or less
and explain why you should care about them. So, search for the Hustle Daily Show and your
favorite podcast app, like the one you're using right now. And we're back here with Jan Lecun,
the VP and chief AI scientists at Facebook.
Facebook meta.
I don't know.
Is it the same?
Is it different?
Meta.
It's called meta.
I have two jobs, actually.
I'm also a professor at NYU.
And professor at NYU, right?
Of course.
Can't forget that.
So, Jan, let's talk a little bit.
Let's just go one level deeper about this in the time that we have left.
So Facebook, I know released a application or at least has it internally where you can type
a sentence.
and it will make not like an image with Dali,
but like a small little video that actually resembles that sentence.
So is that taking us closer to this ability
to predict and understand the world that you're talking about?
Yes and no.
So, yeah, there is two little demo applications.
One is called Make a Scene,
and this is one, you know, a bit similar to Dali,
where you type a description of an image and it just produces an image.
It's based on slightly different ideas, but it's, you know, it came out before Dali, actually.
Right.
The main author of Dali, by the way, was an DTIRamesh, a brilliant young guy who he was an undergraduate student with me and did some research projects in my lab before going on to Open AI as an intern and then being hired as a scientist.
So there's this thing called Make a Seed that produces fixed images,
and then there is a similar thing called Make a Video that produces short video clips, essentially.
And the kind of systems that can do this now from Google as well and, etc.
They're not really sort of publicly available.
Yeah, you can't use it.
I saw the blog post.
I was like, oh, I want to use this.
Then I realized it wasn't public.
I imagine the problems that you had with Galactica would pale in comparison to whatever happens with this movie maker.
So then why not release it?
There's something with language that people kind of, I think, you know,
pay more attention to.
Now, the main issue with those things is like where does the training data come from and things like that?
So, you know, before we can weld them out, it has to be trained on data that is acceptable and all that stuff.
There is a similar system also in the works.
There's been some publication on it that can generate audio.
So it can generate sound effects, also by textual description, or simultaneously with the video.
There are systems also from Fair that produce music.
They can do music continuation.
So it's like a large language model, but with music, right?
So you feed it audio, and it just continues.
Interesting.
So you're going to see more and more of those things.
And meta is really interested in those things in sort of creative aids because everyone
who is on an online service, whether it's a social network, like, like, you know, Facebook or
Instagram, or whether it's going to be, you know, in the Metaverse, everyone is going to need
to have easy ways to generate content and be creative without necessarily having all the, you know,
being technically astute enough in terms of art to do all that thing. So being able to generate,
be creative and generate content, I think is something that's very, very important.
And so you're going to see things like that popping up on, you know, Meta's services in various, various spaces.
Interesting.
Generating images, effects, modifying images, generating video, generating music, sound effects, 3D models, okay, which of course is important for the Metaverse.
And eventually, you know, all the way out to intelligent, virtual characters and stuff like that.
Really? Okay. So Meta's going to go in all in on this stuff.
Oh, totally. Yeah.
Yeah. I want to ask you a question.
about it because, you know, the thing about, you know, art, song, videos, all this stuff,
there's a musician, songwriter, Nick Cave, who did this post about Dali writing,
I mean, not Dali, ChachyPT writing songs and the authenticityness of all of it. Now, there's been
this debate, is this going to take everybody's job, or is it too soulless and dumb to actually
do this? This is sort of the argument that this AI really can't do,
humans do. So this person writes, Nick writes, chat GPT may be able to write a speech or an essay
or a sermon or an obituary, but it cannot create a genuine song. It could perhaps in time
create a song that is on the surface indistinguishable from an original, but it would always be,
it will always be a replication, kind of a burlesque. Songs arise out of suffering, by which I mean
they are predicated upon the complex internal human struggle of creation and, well, as far as I know,
algorithms don't feel.
So I'm curious what you'd think about that,
like from the perspective of a songwriter.
Can this stuff actually produce?
Yeah, there's a lot of questions around this, right?
So first of all, it is true that current AI systems don't feel, okay?
But it's not going to be true for much longer.
So if we have systems of the type that it was describing before,
they have objectives and that plan actions to optimize objectives,
they will have the ability to predict the outcome of a situation,
which means if they are able to satisfy their goal given a situation,
they're going to feel the equivalent of elation.
If they can predict that they're not going to be able to satisfy their goal,
they're going to be disappointed with themselves.
If they predict that the situation may be dangerous for either themselves
or the person they are interacting with,
they might actually have, you know, the equivalent of fear.
So if you have autonomous AI systems that work by optimizing objectives, and they have
the ability to predict, they are going to have emotions.
It's inseparable from autonomous intelligence.
Fascinating.
So maybe we'll have an AI that tries to fulfill some objective and fails miserably and
write a song about it, and that will be.
But, but a big part of art, I mean, it's certainly true for a song.
writing, but also for novels and for jazz improvisation and all kinds of stuff, right,
is basically a vehicle to communicate human emotions.
And Nick Cave has a point, which is that there is nothing to communicate unless you have
emotions that the auditor or reader can relate to, right?
So, and that really depends on the person.
Like, you know, I'm totally taken by, you know, good jazz improvisation.
But, like, words on songs, like, you know, in the case, like, do nothing to me.
Like, I mean, we have different days, right?
Different ways of reacting to different things.
But so, you know, it's right in the way that if you want to generate experience of art,
of communicating human experience or human emotions, it has to come from another human.
Even if machines eventually have emotions, they're going to be very different from humans.
So it's not going to replace this type of.
generating art.
Could be a different form of art.
Yeah, like, you know, project yourself, like, back a couple hundred years ago,
where, like, every, like, you wanted to buy a salad bowl or something,
it would be handmade, right?
It would be like a potter that would, you know, kind of...
Make that ball.
Right, and just make it by hand and bake it and everything, right?
as a consequence would be relatively expensive for most people.
And then came industrialization.
So now it became possible to build, you know,
ceramic bowls for pennies in sort of large quantities.
And did that kill the whole idea of making balls by hand?
No.
In fact, handmade objects became more expensive as a result.
Right. And that leads me to a question that I have to ask you, which is that everybody's asking about the training data that these things are using.
In fact, I put a tweet out asking, hey, I'm going to talk about generative AI.
So we'll end with this generative AI ethics question.
What would you like to know?
Everybody wanted to know about how the people who these databases or programs are trained on are actually going to end up being compensated.
and whether we should actually even allow them to train on stuff that people have created without their consent.
Here's like two of those comments.
So Roger McNamee, early Facebook investor now critic, says, and he didn't know it was going to be a Facebook conversation,
a conversation with someone with Facebook, but he says, why should it be legal for computer scientists,
entrepreneurs, and investors to profit from unlicensed exploitation of the works of creative humans?
And we also had one more from Marshall Marr.
he said the inevitably of the inevitability of intrusive tech is a uniquely American phenomenon.
I don't know why American.
But anyway, this is what he says.
They asked you to post baby pictures and are now training lucrative AI engines with your images.
This was not disclosed at the time.
Pay me now.
So when you think about that, what's your what's your reaction and how should we think about it?
It's going to be a debate for society to figure out because I don't think the answer is totally clear.
You know, for example, photography, the invention of photography shrunk the market for portrait, painted portrait by a lot.
It's not like portrait is disappeared, but it certainly reduced the market for it.
Recorded music reduced the market for performance musicians.
And in every instance of those things, there was.
you know,
collectives of artists to say, like, you know,
we have to stop this because this is going to kill our business.
They were universally unsuccessful.
Okay, so you're not going to stop technology, right?
Now, the question is a legal one.
So if you assume that current legal interpretation of copyright,
if you want, is used,
then you cannot let those machines plagiarize.
So if you use a generity model that's been trained on whatever, and it produces, regardless of the process, and it produces a piece of art that is too similar to an existing one, the audience that produced that existing one is entitled to sue the person who is distributing this new piece of art and ask for compensation.
Now, but what if that piece of art is not copyrighted?
That generated piece of art is not copyrighted, so nobody can profit from it.
who are you going to sue?
You know, is there grounds for suing?
Now, if, again, it's a copy, yes, there is ground for suing even if you give it for free.
If that piece of art is in the same style as a known artist,
but it's not the same painting or whatever it is,
then that's where things become complicated
because human artists are absolutely
authorized to get inspired
and straight down copy
someone else's style.
That happens absolutely all the time in art.
And so would it make sense to apply a different rule
for, let's call them
artificial artists
that generate things.
Like, you know, they can
get inspired by whatever they see
they've been trained on.
But then they cannot
just reproduce.
So that would be a perfectly
fine thing to do.
I imagine that a lot of artists
would not be happy with this.
But,
you know, that's a
definite possibility.
Now,
perhaps,
What you might want to do is what, you know, the early internet also authorized or put in place,
which is that, you know, you can put a little file called, you know, robot.txte or something that tells search engines and quarlers do not use my content for anything like, you know, you can't index it, you can't use it for anything.
So if you're an artist and you don't want your content to be used, you know, lobby for this kind of stuff and this kind of policy.
to be respected by, you know, calling algorithm.
And then it could be like, you know, the next step would be, like, if you use my artistic
production as training data, you owe me money.
And I think that's a bit extreme.
So, you know, the answer is probably somewhere in between and not clear what.
And it's not for people like me to decide.
know, I have no particular legitimacy to, like, tell people this is the right thing to do.
I'm a scientist, I'm an AI scientist.
I think this is for society at large to decide through the usual democratic process.
But you had to be careful, like in one of the remarks that you mentioned, it was said, like, you know,
why should, like, a large tech company in California, I profit from my art, me poor,
artist. This is the wrong way to frame this because within a year or two, you know, any teenager in their parents' basement is going to be able to do this. So you don't want to use the, you know, current bad press that the tech industry has to motivate people to kind of go after this because that's, you're going to lose if you do this.
This is not something, you know, the big tech company are just the first ones to have the technology to do this, but eventually everybody is going to be able to use this kind of stuff and train their own models and crawl the internet, right?
I mean, you know, countless startups and, and, you know, young programmers.
Yeah, and it is a point. I made this point in my most recent newsletter, but that we are not going to see the centralization.
I'm sure you agree that we saw with, with, you know, communication online, like with the Facebook and the Twitter's in AI.
it's going to be much more distributed.
And the last thing I'll say is that I'm so torn about this because I have been plagiarized by someone using my words, feeding them into generative AI systems, and then publishing them as their own.
That's already happened to me.
However, I just love the systems.
I love using Dali to illustrate my stories.
And I love speaking with chat GPT where, I mean, I have it on speed dial.
It's so fun to speak with.
So, okay, that will do it for us.
Unfortunately, out of time, Jan, I feel like we can speak for hours each time, but unfortunately, that's the end of the show.
So thank you so much for coming on.
We'll have to do it again soon.
Well, thanks for having me, guys.
Okay, great.
It was a pleasure.
And that'll do it for us here on Big Technology Podcast.
Thank you so much, Jan Lecun, for joining.
Always great to have you on.
Hopefully we could do it again sometime soon.
And definitely, I feel like we could use at least two or three hours talking about this stuff.
Thank you, Nick Guatany for handling the audio.
I appreciate it. As always, thank you, LinkedIn, for having me as part of your podcast network.
Thanks to all of you, the listeners. Some programming notes, we're going to have our weekly episode
recapping the news with Ron John Roy. I know we're already going to talk about the activist investor
that's in Salesforce and plenty more on Friday. So that will be live on LinkedIn Friday a little bit
earlier than a normal time and then dropping gear on the feed right after that. So keep an eye out.
The following week, week from today, Brian Stelter, former host of CNN's reliable.
sources is coming on. That's going to be a fun discussion. I can't wait for that one.
Please, if you can't, rate and review the show. It always helps other people find the show that
helps us get better guests and the show gets better. So it's a virtual cycle. Rating and
reviewing five stars would be great. That's really applicable on Apple and Spotify, but anywhere
you can find a set of stars and you want to hit five of them. That would be awesome.
Okay. Well, that will do it for us. So thanks again for listening and we will see you next time
on Big Technology Podcast.