The a16z Show - Beyond Uncanny Valley: Breaking Down Sora
Episode Date: February 24, 2024In early 2024, the notion of high fidelity, believable AI-generated video seemed a distant future to many. Yet, a mere few weeks into the year, OpenAI unveiled Sora, its new state of the art text-to-v...ideo model producing videos of up to 60 seconds. The output shattered expectations – even for other builders and researchers within generative AI – sparking widespread speculation and awe.How does Sora achieve such realism? And are explicit 3D modeling techniques or game engines at play?In this episode of the a16z Podcast, a16z General Partner Anjney Midha connects with Stefano Ermon, Professor of Computer Science at Stanford and key figure at the lab behind the diffusion models now used in Sora, ChatGPT, and Midjourney. Together, they delve into the challenges of video generation, the cutting-edge mechanics of Sora, and what this all could mean for the road ahead.Resources: Find Stefano on Twitter: https://twitter.com/stefanoermonFind Anjney on Twitter: https://twitter.com/anjneymidhaLearn more about Stefano’s Deep Generative Models course: :https://deepgenerativemodels.github.ioStay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
Yeah, honestly, I was very, very surprised.
I mean, I know the two of us often talk about how quickly the field is moving, how hard it is to keep track of all the things that are happening.
And I was not expecting a model so good coming out so soon.
We generally converged on, it was going to be a when, not an is.
I thought it was maybe six months out, a year out.
So I was shocked when I saw those videos, the quality of the videos, the length, I mean, the ability to generate.
60-second videos, so it's really amazed.
This is obviously the worst that this technology will ever be,
almost definition.
Right?
We're at the earliest stages of progress here.
I've always felt that that is one of the secret weapons of diffusion models
and why they are so effective in practice.
If you were to ask many people at the beginning of 2024,
when we get high-fidelity, believable AI-generated video,
most would have said that we were years away.
But on February 15th,
Open AI surprised the world with examples from their new model.
Sora, bringing those predictions down from years to weeks.
And of course, the emergence of this model, and its impressive modeling of physics and videos of up to 60 seconds,
have spurred much speculation around not only how this was accomplished, but also so soon.
And although Open AI has stated that the model uses a transformer-based diffusion model,
the results have been so good that some have even questioned whether explicit three-year-es.
modeling or a game engine was involved. So naturally, we decided to bring in an expert.
Sitting down with A16Z general partner, Anshameda, is professor of computer science at Stanford,
Stefano Irma, whose group pioneered the earliest diffusion models and their applications in generative
AI. Of course, these approaches laid the foundation of the very diffusion models deployed in SORA,
not to mention other household names like Chachypti and Midgernie. And perhaps most importantly,
Stephanie, Stefano has been working on generative AI for more than a decade, long before many of us had even an inkling of what was to come.
So throughout this conversation, Stefano breaks down why video has historically been much harder than its text image counterparts, how a model like Sora might work, and what all of this could mean for the road ahead.
And of course, if you want to stay updated on all things AI, make sure to go check out A16.com slash AI.
Enjoy.
As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16Z fund.
Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast.
For more details, including a link to our investments, please see A16c.com slash disclosures.
This is like a conversation I've wanted to have with you for a while.
We've been talking about flavors of this conversation for a long time, but I think given
how quickly things are heating up in this space, it sounded like a good time for us to check in
and see how many of the assumptions we've been talking about, about the future of diffusion
models and video models are tracking. You're the world's expert in this area of research.
And so I think it'd just be great to start with their lab's sort of involvement in the origins.
Yeah, I'm excited to be here. Hi, everyone. My name is Stefano, and I'm a professor of computer
science at Stanford. I work in AI and I've been actually working in generative AI for more than 10
years. Way before these things were cool. I teach a class at Stanford on deep generative models.
It's something I started back in, again, 2018. And I think it was the first in the world on this
topic. And yeah, I encourage you to check out the website. It's called CS-236. There's lots of materials
if you want to dig deeper into how these methods work. And I've been doing research in generative
models for a long time, as you mentioned with my former student, a young song, who is now at Open AI,
who did some of the early work on diffusion models or score-based models, as we used to call them
back then. Back in 2019, the time generative models of images, video, audio, like this kind of
continuous data modalities, were really dominated by GANS, generative adversarial networks.
And we were really the first to show that it was actually possible to beat GANS of their own game
using this new class of generative models called
Fusion models where we essentially generate content,
we generate images by starting from pure noise
and progressively denoising it using a neural network
until we turn it into a beautiful sample.
We developed a lot of the theory behind these models,
how to train them, how to use score matching,
a lot of the initial architectures,
and some of those choices are still around today.
And I think that work really kick-started
a lot of the exciting things we're seeing today
around diffusion models,
stable diffusion, SORA, of course.
And in addition to that early work on the foundation of diffusion models,
we've worked on a number of other aspects of diffusion models like DDIM,
pretty widely used, a efficient sample procedure that allows you to generate images very quickly,
without too much loss in quality.
With my student Chen Ling Meng, who is the CTO of PICA Labs,
we developed SD-Edit, is one of the first methods to do controllable generation,
to generate images based on sketches and things like that.
So, yeah, I'm excited to be here today and discuss what's coming next.
So given all of that work, all of the experience you've had in really the ground truth of
diffusion models and their limitations, what was your reaction to seeing a model like SORA
come out last week?
Yeah, honestly, I was very, very surprised.
I mean, I know the two of us often talk about how quickly the field is moving, how hard it
is to keep track of all the things that are happening.
And I was not expecting a model so good coming out.
so soon. I mean, I don't think there is anything fundamentally impossible and I knew it was coming.
It was just a matter of time and just more research, more investments, more people working on
these things. But I was not expecting something that good happening so soon. I thought it was
maybe six months out, a year out. So I was shocked when I saw those videos, the quality of the
videos, the length, I mean, the ability to generate 60 second videos. I was really amazed.
Yeah, I do think every time we've talked in the past, we've generally converged on, it was going to be a when, not an if, that we'd see video generation get so good.
And so it's reassuring to hear you were as suppressed as I was when it came out.
But before we get into the details on what maybe some of the breakthroughs were there on this time frame, maybe we can just spend a few minutes talking about video diffusions sort of 101 for folks who may not be as familiar with these models.
Why has video diffusion been so much more complex than text or image generation?
and what's historically been the main blocker
in making it work?
Yeah, that's a great question.
At a very high level,
you can think of a video,
just a collection of images.
So really, the first challenge you have to deal with
is that you're generating multiple images
at the same time.
And so the compute cost that you need to process
and images is at least 10 times larger
than what you would pay
if you just want to process one of them at a time.
And basically, this means a lot more compute,
a lot more memory,
and just much more expensive.
to train all your large-scale model on video data.
The other challenge is just a data challenge.
I think a lot of the success we've seen in diffusion models for images
was partially due to the availability of publicly available data sets like Lyon,
like large-scale image and captioned datasets.
They were scraped from the internet and they were made available
and people could use them to train large-scale models.
I think we don't quite have that for video.
I mean, there is a lot of video data,
but the quality is kind of like a mixed bag
and we don't have a good way to
filter or screen them
or there's not a go-to data set
that is available and everybody's using
to train these models.
I'm guessing some of the innovations
that went into the SORA model
were actually on just selecting good quality data
to train the models on.
Captions are also hard to get for video.
I mean, the video data is out there,
but getting good labels,
good descriptions of what's happening in the videos
is challenging and you need that
if you want to have good control
over the kind of content that you generate with these models.
And then there is also a challenge of video content.
It's just more complex.
There is more going on if you think about a sequence of images
as opposed to just one that are complex relationships
between the frames.
There is physics.
There is object permanence.
And in principle, I think a high capacity model
with enough compute, enough data,
can potentially learn these things.
But it was always an empirical question,
how much data are you going to need?
How much computer are you going to need?
when is that going to happen?
Is the model really going to discover all this high-level concept and statistics of the data,
essentially?
And it was surprising to see that it's doing so well.
You just laid out very clearly what the general obstacles have been,
both on architecture, on data set, on representations of the world via video as a format.
Since the release came out last week, there's been a lot of speculation around how this model
can achieve such impressive results.
Some folks were even speculating that there might be a game engine or 3D model,
sort of explicit 3D modeling or geometry involved in the inference pipeline.
But in their article, Opinia describes the approach by saying that they train text conditional
diffusion model jointly on videos and images with different durations and resolutions and then
apply to transformer architecture on spacetime patches of video and image latent codes.
And so could you just break down in layman terms for folks who might not be as familiar
with scaling laws and what's going on here?
Sure, yeah, I can try.
I mean, there is certainly some secret sauce here.
and I can try to read behind the lines
of what they said in their release.
The idea of training on videos and images
is not new.
It seems like one technical difference
that they are hinting at
is the use of a transformer architecture
for the backbone, for the denoiser,
for the score model.
People often use the convolutional architecture
back from the days where kind of like young
initially started using units as a score model,
which was actually a key innovation
that really enabled a lot of the success on images
and people kind of still ported those kind of architectures over to video data as well,
because they make sense.
We expect that there is a lot of convolutional structure
and convolutional architecture might be a good idea.
It seems like they moved on to a purely transformer-based architecture
and probably following the work by signing Shay at NYU,
who did some of the initial work in this space
and developing good transformers architectures,
like VIT-based architectures for diffusion models.
it's possible that that gives you better scaling with respect to compute and data and just happens to work better.
They are also referring to latent codes, so it seems like it's unlikely that they're working directly in pixel space.
Working on latent representations was one of the key things, the key innovations behind stable diffusion, latent diffusion,
like the idea of first compressing the data into a latent representation that is a little bit smaller or a little bit more compact.
we expect that there is a lot of redundancy
if you think about the different frames in a video,
and so it might be possible to compress it almost losslessly
into a lower dimensional representation.
And if you can do that,
then you can kind of train on this lower dimensional representation
and you can get much better trade-offs
in terms of like the compute that you pay
and the kind of memory that you need to process the data.
So they might have figured out a better way to encode video
to a semantically meaningful kind of like lower-dimensional latent space.
I would say this doesn't rule out
the possibility that they've used game engines
or 3D models to generate
training data. I mean, as we discussed before,
I think the quality of the training data is
really crucial. And it's
possible that they've used synthetic data
generated by game engines
or Nerf-like 3D models
to generate the kind of data
they want to see where there is a lot of motion.
They can probably use the internals of the
engine to get very good captions
about what's happening in the video, so they can
get a very good match between the effects
and the content that they're trying to generate.
So, I mean, it does seem like it's a purely data-driven approach,
but it's possible that they've used the other pipelines
to generate synthetic data.
And when you contrast the fusion transformer approach that they took
with sort of the prior generations of many flavors of generative models,
whether it was recurrent nets, GANS, just vanilla, other regressive transformers,
why is it that a diffusion model here seemed, again,
to be sort of the uniquely suited best tool for the job?
I think prior to diffusion models, yeah,
people were using GANS, generative adversarial networks.
GANS are pretty good.
They're pretty flexible, but one challenge is they're very unstable to train.
So that was actually one of the main reasons.
We developed diffusion models in the first place.
We wanted to retain the flexibility of basically an arbitrary neural network
as the backbone, but a more principled,
statistical way of training the models that leads you to a stable training loss
where you can just keep training in the model,
just keeps getting better and better.
Outer aggressive models also have that property,
and they're trying to compress the data
and with enough capacity and a compute,
in principle, they can do a pretty good job
on modeling anything, including video.
They just tend to be very slow
because you have to generate one token at a time.
And if you think of video,
there is a lot of tokens,
and they've not been the model of choice for that reason.
If you show models, on the other hand,
they can generate basically tokens essentially in parallel,
and so they can be much faster.
and that's one of the reasons they are preferred, I think, for this kind of modalities.
Another more philosophical reason is that if you think about a diffusion model, in some sense,
at inference time you have access to a very deep computation graph where essentially you can
apply a neural network maybe a thousand times or if you take a continuous time perspective,
like a differential equation perspective, it can even be an infinitely deep kind of computation
graph that you can use to generate content, while at the same time you don't have to
unroll the entire computation graph at training time because the models are trained by score
matching and there is this clever way of kind of like trying to make the model better and better
without ever having to pay a huge price at training time. And so I've always felt that that is
one of the secret weapons of diffusion models and why they are so effective in practice,
because they allow you to use a lot of resources at inference time without having to pay that price
during training time. And to our earlier point about why did the SORA breakthrough happen so much
faster than many expected, it sounds like the stability of the diffusion transformer model
and being able to swap out essentially training time for inference time, which is much cheaper,
much more parallelizable, much more efficient, was a big contributor to compressing the training
times here.
Yeah.
And there is a question of what's the backbone, right?
And that could be convolutional network.
It can be a space model.
That could be a transformer.
There, I think we're still just scratching the surface in terms of what works and what doesn't
and all possible combinations are possible.
Right.
Build auto-aggressive models that are convolutional.
you can build auto-aggressive models that are based on transformers.
You can build ultra-aggressive models that are based on state-space architectures.
And similarly, you can build diffusion models that are based on convolutional architecture,
which is what people tend to do.
And now it seems like maybe open-ey-ey-eyes pushing towards,
no, let's just use a transformer as the background in a diffusion model.
And I'm starting to see people exploring state-space models,
which might allow for very long contexts, for example.
So I think there is an exciting space of different.
kind of combinations that we can try and might give us better scaling, better properties
for really getting to the kind of qualities we want to see for this model.
One of the most elegant parts of the transformer backbone architecture is that it works really
well with the idea of tokenization, right? And in language models, so much of the scaling
laws work that allowed models like GPT, 3 and 4 to be developed so quickly and generalized to
all kinds of tasks was that the process of tokenizing.
language allows almost like a transformation or translation into a format that the models can
understand across many, many different types of languages, whether that's good old-fashioned
English or its code or its health records or it's different, in some cases, multilingual
data sets. And so the beauty of tokenization is it's like this one-size-fits-all process of
turning language data into a format that the transformer backbone really understands well and is
able to learn on. It seems like,
there was a similar key unlock around how visual data was broken down here into small
patches, right?
They essentially tokenized image and video data into this sort of intermediate representation
of a patch.
This approach created a meaningfully better output than other models we've seen in the past.
That's a great question.
And honestly, I don't know the answer.
I think tokenization makes a lot of sense for discrete data, like code or text.
I'm less of a fan of tokenization or patchifying images and visual.
videos and audio, I feel like you have to do it if you want to use a transformer architecture,
but it makes less sense to me just because the data is continuous and the patch is very arbitrary
and you're losing some of the structure there by going through a tokenization.
You kind of have to do it if you want to use transformers and transformers are great because
they scale well and we have very good implementations and they are very hardware friendly.
And so maybe that's the way to go.
It's the bitter lesson once again.
but it feels like what they have is some kind of latent, again, representation,
and maybe once you go to a latent space, then tokenizing it might make more sense
because you've already lost a lot of the structure.
And so it may be a combination of both.
I mean, it seems like they might have access to a very good latent space
where they get rid of a lot of the kind of redundancy that exists in natural data,
especially in videos, or two frames next to each other are very similar, right?
So there is a lot of redundancy in videos.
And if they got rid of some of that through a clever encoding scheme, then they apply tokenization.
I think that starts to make more sense and make things more scalable, less compute, less memory, and just better.
We've seen so many text of video models, but very few were able to actually generate longer form videos, right, more than a few seconds.
And there were often issues with temporal coherence and consistency, even across those short form generation, three to five seconds.
And here we've got a model that's able to do one minute long, sort of 60 second generations.
And arguably some of the long-form generations are actually dramatically better than the short-form generations from SORA too.
You start to see these emergent properties of temporal coherence only in the 60-second clips, right?
What's going on there?
What are they doing differently that enables these videos to have such amazing continuity for long lengths and temporal coherence of the subjects across those lengths?
I think that was the most surprising element of aspect of SORA, like this ability to generate long content.
that is so coherent and consistent and beautiful.
I think that was the part that really amazed me
because I know it's very hard to do
because exactly,
like you have to keep track of a lot of things to make things consistent
and the model doesn't know what's important to keep track of
and what's not.
And somehow the model they've trained seems to be able to do it.
Again, it's not entirely surprising
because at the end of the day,
these models are trained to essentially compress the training data.
And so if you have high-quality training,
training data where there are transitions, and of course, the content is consistent with physics,
and it's consistent that has the right properties that we would expect a real, natural, good quality
video to have. Then in order to compress the data as effectively as possible, then the model should
learn about physics, should learn about object permanence, should learn about 3D geometry is and all
of that. What's surprising is that there is many other kind of like shallower correlations that
a model could discover. And what was surprising to me is that it seems like it's really able to learn
some of that. We don't know why. I think it's one of the mysteries of deep learning. It's probably a
combination of training data, the right architectures and half scale, but it was amazing. And the 3D
properties in their videos emerge without any explicit inductive biases for 3D objects, right? They're
purely phenomena of scale. What does that mean? Is physics just an emergent property? Well, it's not
inconceivable that at the end of the day, physics is a framework that can help you understand
the world. It helps you make better predictions. Like if I understand Newton's law, I can make
predictions about what's going to happen if I drop an object. And it's a very simple formula that
allows me to make a lot of different predictions. So if I'm being tasked to compress a lot of videos,
if I know Newton's law, if I knew some physics, I can probably do a better job at predicting
what the next frame is going to look like, right?
And at the end of the day, these models,
although they are trained by score matching,
it's possible to relate in a very formal sense,
the training objective that we're using to train a diffusion model
to a compression-based objective,
literally just trying to compress video data in this case as much as you can.
And so it's quite possible that knowing something about physics,
knowing something about camera views and 3D structures of objects
and object permanence, but these kind of properties are helpful.
in compressing data because they reveal structure that is helpful to make predictions,
which means you can compress the data better. What's exciting is that it emerges just by training
the model, right? You could imagine other kinds of shallower correlations that exist in the training
data, but they're not as useful or as predictive as a Newton's law or like a real physics
understanding of the scenes and what's going on. And it's hard to tell whether what exactly
is going on, like it's possible that there is no real understanding of physics, but it's certainly
very effective, right? And at the end of the day, maybe that's enough. What you seem to be pointing
out there is if the data that these models are trained on is kind of like their diet and you are what
you eat in a sense. If you're eating a ton of physics, then you are better at physics as a model.
Is that how we should begin to sort of explain other emergent properties? It was a clip that
they shared called Bling Zoo, which was a single prompt generated video that had multiple transitions
baked in without any editing and so on. It was multiple camera shots. It almost seemed like somebody had
manually stitched together different camera angles, right?
How should we explain that?
Is that just essentially it mimicking what it's seen in the training data?
Is there something else going on?
In that case, I'd imagine that, yes.
If you've trained it on high-quality video data
where you can see these transitions across different kinds of shots,
again, what these models do is they try to understand
what all the training videos have in common,
what is the high-level structure of these videos,
and then they try to replicate that.
So that's a sufficiently good model,
might understand that the videos in the training set,
they tend to have this structure where we transition across different views and shots,
and then we combine them and they're combined in interesting ways,
and then it's able to replicate.
Again, what's magic here is that it's provably an impossible task in general, right?
There is so many other ways of interpolating between the things you see in the training set,
and most of them are wrong, right, are generalizations that you don't want to see.
Right.
And somehow this deep neural networks are able to find,
interpolations or generalizations that are the ones we want,
the ones that make sense.
And they discover the kind of structure that we want the model to replicate,
as opposed to the ones that are by chance,
and it's not the kind of structure that we want them to pick up.
And that's what's amazing and mostly unexplained this point.
We do not understand why this happens.
Right.
So we're here now in early 2024.
And the answer to when will video models get good enough to cross the uncanny valley,
has just been breached, right?
We've just gotten to that point.
So if we look ahead now,
SORA is still in data,
but there's several other AI generative
sort of video efforts out there already.
Realistically, how expensive do you think
it's going to be to generate AI video
at any kind of sort of consumer scale,
a readily available scale?
I'm sure this release from Open AI
set off a lot of competitors to the races
and trying to catch up.
And I'm sure we'll see developments
coming from all the various competitors in this space.
I think there is training costs, which are huge.
I'm sure they use thousands of GPUs to train SORA,
and scale was a big part of the success they've seen.
And so it's definitely going to be out of reach for academics,
but there is going to be industry players who will have the resources
to try to compete with them and try to replicate what they did
or achieve similar results in a different way.
The good news is that now we have an artifact,
we have a system that can do it.
And so I think it's a lot easier to try to catch out,
It's not something that there is a lot less uncertainty
and whether it's even possible.
Right now, we have an example.
It's feasible and a lot of people will make the right investments
to really catch up.
I don't know how long it's going to take.
It's going to be six months.
It's going to be 12 months.
But somebody will come up with similar performance,
I would imagine, as we've seen in other lands
and in other spaces where people can't catch up eventually.
The question is how far ahead will open an IP by then?
How much better will the system be in six months or in 12 months?
and that's hard to say.
The other question that I think you were hinting at is inference,
like how expensive is it going to be to serve these models
and provide video generation on demand to users or personalized videos,
like all these really cool applications that could emerge
from a really good video-generative model.
Again, I'm there. I'm pretty optimistic,
especially because the underlying architecture is a diffusion model.
Once you have potentially big, expensive, large, clunky model
that can generate high-quality,
results, there's been a lot of success in distilling these models down into smaller ones
that are almost as capable, but way faster. So I'm pretty optimistic that once we get to high
enough quality, it's going to be possible to get systems that can serve similar quality results
in a very inexpensive way. So I'm pretty excited to see the kind of crazy use cases people come
up with once this technology becomes available. When we talk about compute, training or inference is a
huge part of the calculus there. But there's this other bucket of costs that come from the
datasets required to actually train these models and get scaling laws to work. And in terms of training
data for language models, there's already billions of data points that these labs can get across
the web. But for video, a lot of the data, like you were saying earlier, even if it exists,
it's not particularly well labeled or captioned. And so how do you think video model teams will
overcome that challenge? We've seen recently that Reddit agreed to do a deal to license its
data to Google for $60 million.
Do you think we'll see video production studios
begin to license their content out?
That's a good question.
And first of all, it will probably depend a little bit
on whether you're thinking of startups versus
more established industry players.
I think startups might be willing to move fast and break things,
maybe be less concerned about copyright and then just
scrape from the internet and then train models and then
worry about licensing the data later.
Bigger players, their legal teams are very, very, very
worried about big lawsuits, and so they'll want to have something that is properly licensed.
It's going to be interesting to see whether, as you said, the studios or people who currently
own the content will be willing to license it because it could be existential threat to their
entire business model. I can see Reddit licensing their stuff because it's maybe not
as an existential threat. The other thing you mentioned is labeling. It's a great point. It's going to be a
big challenge. Then I'm pretty optimistic, though, that people will be able to set up human in the loop
pipelines. I mean, we've seen great success in visual language models, even video language models.
They are not good enough to maybe provide high-quality caption out of the box, but I could imagine
that they could drastically speed up, like a human-in-the-loop pipeline where they provide suggestions that
they don't get fixed or improved by human labelers. And so I'm pretty optimistic on the captioning
front that we will be able to find pretty scalable solutions to that.
Is the sort of logical path there to start with a human in the loop implementation,
build a working model of annotations there, and then ultimately move towards synthetic captioning?
It seems like that's a direction that people are exploring and using,
and there's been a lot of success on kind of like enhancing captions in a synthetic ways,
you're using LLMs.
And so, yeah, that would be my bet.
I mean, honestly, I don't know to what extent the bottleneck
on the captioning as opposed to actually having a good high-quality videos, even just the raw
high-quality video data having access to that. It's non-trivial, and my understanding is
actually one of the bottlenecks that we will have to solve first. Ah, gotcha. Why don't we switch gears
a little bit to the usage of these models, right? One of the things that consumers and creators
have really found useful with language models, generative models, is context windows, right? The bigger
the context window, the more flexibility there is for the input. You can give it more detailed
And it's been an exponential progress on the language side.
In a very short amount of time, we've gone from tiny context windows to millions of tokens worth of context windows.
In video, do you expect a similar approach or is there some fundamental limitation?
I was just reading the Gemini 1.5 paper and then talking about this very long context,
a million tokens, 10 million tokens.
And actually, one of the applications they mentioned is actually video summarization,
video understanding, like trying to process one hour long video.
it seems like these very long contexts are going to be very useful to solve a variety of video processing, video understanding, kind of tasks.
And therefore, I would be very surprised if they don't also end up being very useful for video generation.
And in fact, it's entirely possible that that's already a component that played a role in the open AI system.
And part of the reason they are able to generate long videos is because they can handle very long content.
and they are able to scale transformers to very long sequences,
and it's entirely possible that that was part of the secret source.
So I'm again excited about either these attention-based ways to scale up context
through embeddings or ring attention or clever,
hardware-focused implementation, like flash attention that can be scaled up to very long contexts
or more researching things like state-based models or the people are starting to use,
also in the context of diffusion models that might allow you to get to very long contexts.
And so I think it's going to be an interesting space to watch for sure.
Well, looking further out, the timelines and breakthroughs,
this is obviously the worst that this technology will ever be, almost definitionally.
Right.
We're at the earliest stages of progress here.
Why did people underestimate the timeline so far?
And what does that mean about what's to come next and how quickly it's going to come?
Yeah.
This is a really hard question.
And I think I was wrong with my predictions on how long it was going to,
to get to the point we are right now with SORA.
And when we're moving exponentially fast, it's very hard to make predictions and errors
can be pretty big. But it's going to be exciting for sure.
When you look at sort of video generation as a breakthrough in the broader journey of
dealing laws towards generalizable artificial intelligence or whether you want to call
it ASI or AGI. How do you quantify this advancement in that broader journey?
I'm very excited about this because I tend to think of it as a like a video generative model as a pretty natural kind of like world model.
I mean, as we were discussing before, in order to be able to generate videos of these qualities,
there must be some level of understanding of physics, of object permanence, 3D structures.
Like, there must be a lot of knowledge that it's somehow latent in these models.
And I'm excited about all the different ways in which we might be able to extract this knowledge and use it and different applications in
when we think about, especially like robotics or other agents that really interact with the real world,
I think the kind of knowledge that is embedded in this video that has been extracted by watching,
essentially, a lot of videos is going to be very, very useful.
And what's interesting is I think it's going to be pretty complementary to the kind of knowledge that,
for example, is baked into an LLM.
You can learn a lot about the world by reading books, essentially, but there is just going to be a lot of experience that you only get by,
it's more similar to the kind of experiences you get as a child, right?
You just walk around the world, you see things and you learn about how the world works just through your eyes and through video essentially, right?
So that kind of experience that we're feeding through video diffusion model when the kind of knowledge that we might be able to extract from that is therefore going to be, I think, pretty useful.
I think the title of the blog post that they put out was video generation models as world simulators.
I think there is a lot of problems there.
I think if you can do it at the pixel level,
then that means that you've solved the harder version of the problem.
Right.
And you can do a lot of things on top,
whether you're an autonomous vehicle or whether you're building robots
or whether you just want to build an agent
that understands how the world works
and combines the knowledge.
It's gotten by crawling the Internet
with what you can see in the real world.
I think it's all going to be pretty exciting.
Well, look, arguably, we would not be here as an industry
if it wasn't for your lab.
So it's just so exciting to see
how your worldview of where the world was always going to go has accelerated.
I know we could spend hours talking about all the tricks that were required from research
perspective to get here and where we're going to go.
But we're going to wrap up for today.
Thank you so much for making the time.
I'm sure we'll have more to talk about soon.
Thank you very much.
I enjoyed this very much.
If you liked this episode, if you made it this far, help us grow the show.
Share with a friend or if you're feeling really ambitious,
you can leave us a review at rate thispodcast.com.
slash A16C. You know, candidly, producing a podcast can sometimes feel like you're just
talking into a void. And so if you did like this episode, if you liked any of our episodes,
please let us know. I'll see you next time.
