Front Burner - AI video’s groundbreaking, controversial leap forward
Episode Date: February 20, 2024 OpenAI has just introduced a new tool, Sora, which turns text prompts into short, shockingly realistic videos. Sora hasn’t been released to the public yet, but it’s already sparking controve...rsy about its potential implications for industries like animation and video games, as well as for deepfake videos — and for democracy as a whole.Today, Gary Marcus — a cognitive scientist, AI researcher and entrepreneur, and author of the forthcoming book Taming Silicon Valley — talks to us about the promise and potential consequences of Sora and other generative AI video tools.
Transcript
Discussion (0)
In the Dragon's Den, a simple pitch can lead to a life-changing connection.
Watch new episodes of Dragon's Den free on CBC Gem. Brought to you in part by National Angel
Capital Organization, empowering Canada's entrepreneurs through angel investment and
industry connections. This is a CBC Podcast.
Hi, I'm Jamie Poisson.
So on Thursday afternoon, sitting around the office, everyone has one eye on a script, the other on Twitter or X, whatever.
And these videos, they start popping up.
A hamster riding a half duck, half dragon as the sun sets.
Two golden retrievers sitting on a blanket on a mountain podcasting.
A train zooming through a futuristic city.
They were impressive, weird, but impressive.
And the product of OpenAI's new tool, a text-to-video product called Sora.
Watching these videos, we were all like, wow.
But then, as more and more of them kept getting posted, oh wow, it kind of turned to, oh no.
There was a realistic food influencer making gnocchi, a grandmother blowing out her birthday cake candles.
And while they weren't 100% accurate compared to where the tech was a year ago,
they are definitely getting closer to mirroring reality.
Today, I'm speaking with Professor Gary Marcus.
He's a cognitive scientist, an AI researcher, an entrepreneur,
and author of the forthcoming book, Taming Silicon Valley.
And we're going to talk about SORA, what we know and don't know about how it works,
and its potential implications for creative industries, deep fakes, and even democracy as a whole.
Gary, thank you so much for coming on to FrontBurner.
It's great to be back.
So what is Sora and what do we know about how it works?
Sora is something where you type in a piece of text, a description of something, and it generates a short video for you.
So you can say, make me a point of view video of some ants walking in an ant colony.
And we'll actually do that. It may not work perfectly, but it'll give you something that looks at least a little
bit like what you're talking about.
Often something that looks a lot like what you're talking about.
And often something that if you look really carefully, isn't quite right.
Last week, the OpenAI CEO, Sam Altman, he got people to tweet at him with like prompts, those text prompts that you're talking about.
And the videos that came out right away, I mean, there were some that were pretty amazing, like two golden retrievers podcasting on the top of a mountain.
Or one with dolphins and other animals riding bicycles on top of the ocean.
And they were fantastical scenarios,
but these ones, they did look incredibly realistic. I'm looking at one right now.
It's a woman in a leather jacket walking down a Tokyo street filled with warm sort of glowing neon
signs. And it is very impressive. I have to say. It looks almost completely real.
It looks almost completely real. Most of them have problems. So for example, the woman walking
on the street, there are people in the background, if that's the one I remember, where it looks like
they're almost like zombies kind of floating around. The woman actually takes two left steps
at about 28 seconds in, which is not biologically possible. So when you start
to look carefully at them, and of course, that's this generation, and we can ask what will happen
in future generations. But if you start to look carefully at these videos, there are often a lot
of violations of physical laws. The one that's maybe most disconcerting to me is that objects
pop in and out of existence, which six monthmonth-old babies realize can't actually happen.
So there's one, for example, of wolf pups.
And if you look at it carefully, the number of wolf pups changes from one frame to the
next.
There's another one where, I don't know if they're archaeologists, what they're supposed
to be, but they dig a chair out of the ground.
The chair starts to levitate at some point.
And one of the people walks behind another. And when the camera
kind of shifts around that, the first person I think is in a tan shirt has just disappeared
altogether. So there are violations of physical laws. Another video showed, I mentioned already,
the ant walking through a colony. And if you look carefully at it, the ant only has four legs. And
most normal, well, all normal ants have six legs. It'd be very weird to
see an ant with four legs. And so somebody posted, wow, I can't
believe they even got the dynamics of the legs right. But
they didn't get the dynamics of the legs right. Wasn't even the
right number of legs. And if you watch that whole video, there's
like this weird two headed ant that pops up and so forth. So
there are a lot of you might call them glitches.
And from the perspective of cognitive science,
this matters because you want to know,
does this thing really understand the world?
And I think the answer is no.
I think something else is going on here.
But I would say they look fantastic.
They look photorealistic, detailed, sharp graphics.
So there's something absolutely stunning about them. But
there's also something, if you look carefully, for many of them, maybe not all, but quite a lot of
them, there are little glitches that don't quite meet reality. But having said all that, and that
was fascinating listening to you go through all those examples. I could like see the glitches in
the woman in Tokyo video in real time as you were describing it.
But having said all that, the speed at which this technology can improve is pretty breathtaking, right?
Like if we look at another example, the OpenAI tool ChatGPT, when it was first released in the fall of 2022,
and then now it's gotten so much better.
And what could that tell us about what Sora might be able to do
in even a few months from now?
Well, it has and it hasn't.
The prominent idea in the field for the last few years
has been sometimes paraphrased as scale is all you need.
The idea is if we just make them bigger and bigger,
they'll solve all the problems. And I've been arguing against that, saying that although they
get better and better in many dimensions, there's still certain basic problems that they have.
So here's what I think we will see for Sora itself in two years, because they will keep
trying to make it. Although there's not so much headroom given how much data and money I think
they already put into it, but there's surely some headroom.
It will continue to improve for a while.
But I think that these physical errors are going to continue.
Another one was a glass that turns on its side and starts floating in air and liquid
falls through the glass.
The lack of understanding of everyday physical world we have seen in GPT-2, GPT-3, GPT-4. We've seen it in
SORA. I think we will still continue to see it. We might see fewer errors as there's more data
they might reduce. But I think inherently this approach is not about representing things in the
world, saying I see an X here, a Y there. It's really about pixels and predicting patterns of pixels over
time. And that's why we're getting some of these quirks. And I would actually be surprised if the
quirks are entirely eliminated. I had a conversation today on Twitter with somebody about this,
and they said, well, I think these errors will reduce by 80% in two years. And my reply was,
well, if it reduces by 80%, but you're getting something like one per
five seconds of video, that's still actually a lot. That's still enough that for a high production
value film, we wouldn't be enough. It might be enough for an advertisement. It might be enough
for misinformation campaign where people aren't going to look that close.
I mean, certainly I didn't see some of what you pointed out in those examples,
right?
Like I didn't see,
you know,
those glitches in the woman.
A lot of people didn't see the four legs on the ant.
Yeah.
Yeah.
I also didn't see the four legs on the ant.
Most people didn't.
Yeah. Let's say that is correct, right?
And that these models, you hear people say that technology is never going to get worse than it is right now.
But let's say it gets only a little bit better, and then it kind of
plateaus. It could get more dangerous, but it won't get technically worse.
So maybe then, yeah, let's talk about that. Let's talk about how it could be used in the real world,
I guess, either for good or for evil. Could we start with good?
Could we start with good?
Sure.
I mean, I guess the positive use case for Sora is it enables a lot of people to be creative in ways that they couldn't before.
So I couldn't make a 30-second video or wouldn't have the patience for it.
And now, once it becomes available to the public and assuming it's a reasonable price,
lots of people will be able to make short videos and that will be fun. It's going to be empowering in something like the way
that Photoshop is. Not exactly because you have more control over Photoshop, but you can do things
faster in this system. So it's kind of trade-off. So it will be of some use to creative people.
A great use case might be if you want to make a movie, you used to often draw
storyboards one scene at a time. And now even if you don't have drawing skills, you could kind of
do that with Sora. They wouldn't be perfect. You would still need to remake them. So, you know,
there are limits. But I think as a prototyping tool, it's already, you know, looks like it's
pretty good. I haven't actually used it. They haven't released it to the general public, nor to the scientific community in any broad way.
But if it's as good as it looks in the demos, then it's a very cool prototyping tool. And,
you know, undeniably fun. Yeah, I was thinking, you know, if I had written like a short story,
right, I could probably use this tool to try and create a short video
about that story or something like that, right? Even though you say that it has these limitations.
If you wanted a little 90 second or let's say 30 second video that was kind of like an
advertisement for your story, I think you could easily do that. I think if you wanted a 10 minute
video, it would actually
turn out to be really frustrating. So like, it would probably make the protagonist look different
in each scene, the lighting would vary and so forth. And I suspect, although this is really
just speculation from the 12 or so videos that I've seen, I suspect that it would get very
frustrating if you wanted to do anything longer than a single clip.
So in many ways, this system is similar to Dali or Dali 3 mid-journey.
And those are really good.
Which was images, right?
Like, yeah, still images.
Exactly.
They make still images.
But people who have played around with them a lot often get frustrated.
So for amateurs, again, it's a fantastic thing.
But if you want something
precise, we'll have, for example, text that isn't really right, and there's no way to fix that text,
or you want it from a slightly different angle, it can be hard to get it to do exactly what you
want. So in the near term, anyway, if you write a short story, and you really want to turn that
into a movie, I think you're going to be frustrated. You know, if I'm one of these people in these creative industries, in the long term, what
might the implications be here?
Like, should I be really concerned, in your opinion?
It's a double-edged sword.
So one thing we haven't talked about is copyright.
And for working artists, this is a real problem.
You know, if your living is making concept art to help a movie designer,
director figure out what they want, you're in trouble
because this stuff can do a bunch of that.
If your job is to do set design throughout a film,
it's not really going to do that.
But to some, you know, Hollywood artists are already threatened.
Some film studios should be really upset.
You probably know about the New York Times lawsuit against OpenAI, which showed that
OpenAI could essentially plagiarize some of their work.
It was obviously trading on their data.
Yeah, the language story, just for our listeners, the language model behind the chat GPT had
used an enormous amount of New York Times articles to
help train it. So that's why they're suing. That's right. And in the lawsuit, they have
a hundred examples where things are almost word for word identical over a space of paragraphs
between what chat GPT would do with a prompt that was the first few words of a story, and then it
would basically regurgitate the story. So Reed Salathan, who's an artist who's worked with places like Marvell,
and I did some experiments in December, which we published in January in the IEEE Spectrum,
showing that the visual models do the same thing. So for example, you can say something like,
draw me a picture of an Italian plumber, and you're probably going to get Nintendo's Mario
character back. Well, Nintendo's not going to like that. So on the one hand, the film studios, I would suspect are going to be quite
upset about this. On the other hand, they're like, hmm, can I save money if I use this?
Yeah, I was just going to say that.
So I think some of the film studios are hanging back trying to decide what to do. I think a lot
of people are watching that New York Times lawsuit. If it actually goes to trial, it could
set a huge precedent either way. It could almost shut down the whole AI industry, or well, at least this
part of it, the generative AI industry, or it could give them license to use things. Most likely,
it'll be a settlement. So I've been joking that 2023 was the year of generative AI, and 2024 is
the year of generative AI litigation. There's going to be so many lawsuits filed.
In the Dragon's Den, a simple pitch can lead to a life-changing connection.
Watch new episodes of Dragon's Den free on CBC Gem.
Brought to you in part by National Angel Capital Organization,
empowering Canada's entrepreneurs through angel investment and industry connections.
Hi, it's Ramit Sethi here.
You may have seen my money show on Netflix.
I've been talking about money for 20 years.
I've talked to millions of people and I have some startling numbers to share with you. Did you know that of the people I speak to, 50% of them do not know
their own household income? That's not a typo, 50%. That's because money is confusing. In my new
book and podcast, Money for Couples, I help you and your partner create a financial vision together.
To listen to this podcast, just search for Money for Couples.
So I think this is a great spot for us to talk about misinformation
because it seems like the big one here.
And it's particularly concerning to me as a journalist.
It's the potential implications for deep fakes, for misinformations, for videos that look real with real famous people, for example, doing and saying things that they never did.
And talk to me a bit about your concerns there, what it could mean for elections, for democracy, for our understanding of what's even real.
All of those things.
I'm frankly terrified.
of what's even real? All of those things. I'm frankly terrified. So I actually posted about the four-legged ant as an example of this. So there are the obvious cases with elections, right? We've
already seen this. It was at least one election that may have turned on a deepfake. This is
Michael Szymetszka. He is the leader of the main opposition party here in Slovakia. And on the eve
of this country's elections last year, he was the target of a deepfake.
Just two days before voting began in that high-stakes election,
this audio tape began circulating online.
It purported to be a recording of a conversation in which Shemeshka talks about stealing the election.
His party, Progressive Slovakia, went on to lose the election by a few points.
Do you think this year is like 70 elections around the globe this year? The technology is
improving. People are using it more and more. So there's definitely a serious chance of having
impact on elections. You can expect that, you know, in October 24, we'll see, for example,
deep fake footage of, you know, one of the U.S. presidential candidates
following down the steps in order to make them look infirm.
That kind of stuff seems inevitable.
And then there's another problem, which I would call the pollution of the information
ecosphere, which is scammers like to make fake stuff and monetize it.
And we've already seen this in different ways.
Last year, people had fake websites saying
that um mayim bialik if i've got her name right was selling cbd gummies well she wasn't but they
sold ads off of it they didn't care that it wasn't true um now the new york times had an article
yesterday about fake books and some of my friends have been hit by this where they will write a book
and then somebody writes a book with a slightly different author name, slightly different title.
It takes them like five minutes with ChatGPT.
They put it on Amazon, and then the real authors get hurt.
And there are cases where people are putting out books about how to eat mushrooms, and
they're probably filled with mistakes because we know that these systems are not accurate.
And so there's going to start to be increasing risk that people are going to get bad information of that sort. With the four-legged ant, somebody said to me, well,
what's the big deal about, you know, having a video with a four-legged ant? Well, the problem
is that we are soon not going to trust any videos because there are going to be so many of these
fake videos put out in order to try to make money on YouTube and so forth. And we're going to reach
a point where we just have no idea what to believe or not.
And also, like politicians could say that something was fake when it wasn't actually fake.
Maybe they actually said that thing and got caught saying it on camera.
That's going to be a very common thing, right?
Essentially, the evidential value of video is going to drop to zero.
There are some efforts, like Adobe is leading a very nice effort is going to drop to zero. There are some efforts like Adobe is
leading a very nice effort to try to watermark videos. So if you have the right camera, it will
actually give some kind of authentication signal. So there are some ways to deal with this, but I
don't think they're enough. I think they're like, you know, fingers and dikes, and there's just
going to be an enormous amount of this stuff. And, you know, the only solutions I see that work even a little bit are to have governments demand that any AI generated content is labeled as that.
And maybe we will get laws passed that do that.
I think the EU AI Act is somewhat in that direction.
It is adopted. Congratulations.
The European Union's parliament passing a draft law restricting AI, limiting the use of facial recognition software, and requiring AI companies to disclose more about the data behind their programs.
Even that's not going to be enough because a lot of cheaters are going to try to evade whatever protections we have.
So it's going to be like gun laws where if you actually catch somebody, you can do something about it, but it doesn't mean that it's literally impossible for somebody to secure a gun,
somebody that shouldn't have one.
We are going to be in a kind of cat and mouse chase forever on this stuff.
This idea that governments might be able to step in and regulate here and label everything generated by AI so.
Even if they took a real run at that, do you even have confidence that they can do that. I keep thinking about that moment several years back when Mark Zuckerberg
was in front of, you know,
lawmakers and, you know,
one of the lawmakers asked him,
like, essentially,
how does Facebook make its money?
We believe that we need to offer
a service that everyone can afford
and we're committed to doing that.
Well, if so,
how do you sustain a business model
in which users don't pay
for your service?
Senator, we run ads.
I see.
And you're like, oh, my God, they don't even understand how it works.
And so I guess the thing with AI is that it strikes me, though, that the people who made the AI don't even understand how it works.
So first of all, nobody fully understands current AI.
We can do kind of empirical science to say, in this kind of circumstance, what does this
system do?
But nobody understands it in the way that we understand a simple set of equations where
we can just plug in the numbers.
We have to actually run these models, which is expensive.
It takes millions or even billion dollars to train this,
and then it takes a lot of money to run it.
And you can run experiments on it,
like it's an alien from another planet,
but it's not like we really fully understand
exactly how it works.
Nobody does, not at the companies, not in the government,
not outside scientists like myself,
which is cause for concern, right? We have well under control the engineering behind bridges and
airplanes and so forth. We don't have the engineering well under control around AI.
Then there's a sort of milder sense, which is like how many people in the Senate understand
the basic notion that you train on a large set of data, that the quality
of your results depends on the amount of that data, the quality of the data, even basic stuff
like that. And probably, it's a lot less than 100%, but it's a lot higher than zero.
I don't know if that gives me...
Gives you great hopes?
If I find that reassuring.
You should be only slightly reassured.
You should have that, well, I guess it could be worse reaction.
Yeah.
Oh, great.
That's always good.
Yeah.
And I guess also, too, you know, this stuff is coming from all over the world.
So it's hard.
How do you kind of regulate a group of people who are operating out of Moscow too, right?
There are those problems too.
So, you know, the U.S. has to get its own house in order, but then what do you do globally?
I have been advocating for some kind of global governance.
And there are questions about, like, would any country give up its sovereignty in order to be part of some global thing looking out for the dangers of AI?
I think the answer is actually yes.
You know, we've done that, for example, for nuclear weapons.
It's hard to negotiate these treaties.
They take like decades, not weeks, or a decade, I should say, you know, several years to negotiate
treaties like this.
But I think every country, I mean, Russia is on its own little axis, so I don't even
know what to say about Russia.
But China, I've actually talked to a bunch of people from China, including fairly high academics that are in close contact with the government.
China, you know, wants the same things that we do,
which is they want an orderly universe.
They may have a different definition of that than we do,
but they want an orderly universe.
They want their citizens to be safe and so forth.
So like, you know, China doesn't want its citizens
to all be robbed by cyber criminals either, right? Right mean, right. Well, I imagine to that technology out of control
could threaten the leadership, the Chinese government, right? Like, every government
should actually be worried about what Ian Bremmer calls a technopolar world, which would be a world
in which, you know, most of the power is in the hands of a
few unelected tech companies and not the governments themselves. We're already drifting towards that,
right? Well, talking about, yeah, talking about those tech companies, you know, what about
what they're doing, like the companies that are kind of leading the way here. So OpenAI and Sam
Oltman, the CEO of OpenAI, would say that they're putting up guardrails and that they're doing this really carefully.
And what do you think?
So it's very difficult with current technology to build guardrails that really work.
So you have two problems.
One is that they can be too restrictive.
So people start complaining that the systems are too politically correct or too lazy and so forth, as they won't engage in perfectly
reasonable requests. Like I asked the system, what would be the religion of the first Jewish
president? And the guardrail intervened and said, well, it's impossible to tell what the religion
of the first Jewish president would. That's just absurd, right? So there you have a guardrail that
it's too restrictive, and sometimes they're too loose. So I testified in front of the U.S. Senate last May for my Senate testimony.
I made it or had a friend actually make an illustration in which there was a news story purporting that U.S.
senators were involved in a conspiracy with space aliens to keep human beings as one planet species.
And the guardrails were not enough
to stop that whatsoever. So the truth is that current AI does not really understand humans,
does not really understand the world. And so the guardrails are hit or miss.
Gary, just to end this conversation, you've been working in AI for decades now. And I know that you have said that this kind of use of AI didn't even occur to
you. And I'm just curious to know, what did you imagine might be the uses for this technology?
Like, where did you think we would be right now? You know, in some ways we're behind.
I thought maybe we would have the Star Trek computer by 2024.
And then the other side is I thought it would be used for good, like it would solve medicine.
You know, like I thought by now we would have solved Alzheimer's.
Come on.
Like, you know, many, many neuroscientists have worked on Alzheimer's, many companies
and so forth.
We can't seem to get those questions right. You'd think, well, maybe AI could help us
with that. But instead, the number one, you know, sexiest application of AI right now is making
videos that are kind of like screwing artists whose materials are being used to train them.
Like, that's not such a pro-social application. And that's, I didn't go into AI to like,
make some money by ripping off artists. That's not why I'm insocial application. And I didn't go into AI to make some money by ripping off artists.
That's not why I'm in this field.
Too much of it right now has been about making a quick buck without, I think, enough concern
for the ethical consequences of what these systems are used for.
Gary, thank you so much for this.
That was really eye-opening.
Thank you for giving me a chance to say it.
And I really appreciate being back. Thanks for having me again.
All right. That is all for today. I'm Jamie Poisson. Thanks so much for listening.
Talk to you tomorrow.