a16z Podcast - Beyond Uncanny Valley: Breaking Down Sora
Episode Date: February 24, 2024In early 2024, the notion of high fidelity, believable AI-generated video seemed a distant future to many. Yet, a mere few weeks into the year, OpenAI unveiled Sora, its new state of the art text-to-v...ideo model producing videos of up to 60 seconds. The output shattered expectations – even for other builders and researchers within generative AI – sparking widespread speculation and awe.How does Sora achieve such realism? And are explicit 3D modeling techniques or game engines at play?In this episode of the a16z Podcast, a16z General Partner Anjney Midha connects with Stefano Ermon, Professor of Computer Science at Stanford and key figure at the lab behind the diffusion models now used in Sora, ChatGPT, and Midjourney. Together, they delve into the challenges of video generation, the cutting-edge mechanics of Sora, and what this all could mean for the road ahead.Resources: Find Stefano on Twitter: https://twitter.com/stefanoermonFind Anjney on Twitter: https://twitter.com/anjneymidhaLearn more about Stefano’s Deep Generative Models course: :https://deepgenerativemodels.github.ioStay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.
Transcript
Discussion (0)
Yeah, honestly, I was very, very surprised.
I mean, I know the two of us often talk about how quickly the field is moving,
how hard it is to keep track of all the things that are happening.
And I was not expecting a model so good coming out so soon.
We've generally converged on, it was going to be a when, not an is.
I thought it was maybe six months out, a year out.
So I was shocked when I saw those videos, the quality of the videos,
the length, I mean, the ability to generate.
60-second videos, so it's really amazed.
This is obviously the worst that this technology will ever be,
almost definitely, right?
We're at the earliest stages of progress here.
I've always felt that that is one of the secret weapons of diffusion models
and why they are so effective in practice.
If you were to ask many people at the beginning of 2024,
when we get high-fidelity, believable AI-generated video,
most would have said that we were years away.
But on February 15th, OpenAI surprised the world with examples from their new model.
Sora, bringing those predictions down from years to weeks.
And of course, the emergence of this model and its impressive modeling of physics and videos of up to 60 seconds
have spurred much speculation around not only how this was accomplished, but also so soon.
And although OpenAI has stated that the model uses a transformer-based diffusion model,
the results have been so good that some have even questioned whether explicit 3D modeling or
a game engine was involved. So naturally, we decided to bring in an expert. Sitting down with
A16-Z general partner, Anshameda, is Professor of Computer Science at Stanford, Stefno Irma,
whose group pioneered the earliest diffusion models and their applications in generative AI.
Of course, these approaches laid the foundation of the very diffusion models deployed in SORA,
not to mention other household names like Chat Chiquot and Mid-Journey.
And perhaps most importantly, Stefano has been working on generative AI for more than a decade,
long before many of us had even an inkling of what was to come.
So throughout this conversation, Stefano breaks down why video has historically been much harder
than its text image counterparts, how a model like SORA might work,
and what all of this could mean for the road ahead.
And of course, if you want to stay updated on all things AI,
make sure to go check out A16Z.com slash AI. Enjoy.
As a reminder, the content here is for informational purposes only, should not be taken
as legal, business, tax, or investment advice, or be used to evaluate any investment or security
and is not directed at any investors or potential investors in any A16C fund.
Please note that A16Z and its affiliates may also maintain investments in the companies
discussed in this podcast. For more details, including a link to our investments, please
A16C.com slash Disclosures.
This is like a conversation I've wanted to have with you for a while.
We've been talking about flavors of this conversation for a long time,
but I think given how quickly things are heating up in this space,
it sounded like a good time for us to check in
and see how many of the assumptions we've been talking about
about the future of diffusion models and video models are tracking.
You're the world's expert in this area of research.
And so I think it'd just be great to start
with her lab sort of involvement in the origins?
Yeah, I'm excited to be here.
Hi, everyone.
My name is Stefano, and I'm a professor of computer science at Stanford.
I work in AI, and I've been actually working in generative AI for more than 10 years.
Way before these things were cool.
I teach a class at Stanford on deep generative models.
It's something I started back in, again, 2018.
And I think it was the first in the world on this topic.
And, yeah, I encourage you to check out the website.
That is called CS-236.
There's lots of materials if you want to dig deeper into how these methods work.
And I've been doing research in generative models for a long time.
As you mentioned, with my former student, Young Song, who is now at Open AI,
who did some of the early work on diffusion models or score-based models,
as we used to call them back then.
Back in 2019, the time generative models of images, video, audio,
like these kind of continuous data modalities, were really dominated by GANS,
generative adversarial networks.
And we were really the first to show that it was actually possible to beat GANS of their own game
using this new class of generative models called Fusion Models,
where we essentially generate content, we generate images by starting from pure noise
and progressively denoising it using a neural network until we turn it into a beautiful sample.
We developed a lot of the theory behind these models, how to train them, how to use score matching.
A lot of the initial architectures and some of those choices are still around today.
I think that work really kick-started a lot of the exciting things we're seeing today around diffusion models, stable diffusion, soar out, of course.
And in addition to that early work on the foundation of diffusion models, we've worked on a number of other aspects of diffusion models like DDIM, a pretty widely used, efficient sample procedure that allows you to generate images very quickly without too much loss in quality.
with my student Chen Lingmong, who is the CTO of PICA Labs,
we developed SDITEDD. It's one of the first methods
to do controllable generation to generate images based on sketches
and things like that. So yeah, I'm excited to be here today
and discuss what's coming next. So given all of that work,
all of the experience you've had in really the ground truth of diffusion models
and their limitations, what was your reaction to seeing a model like SORA come out
last week? Yeah, honestly, I was very, very surprised. I mean, I know the two of us often
talk about how quickly the field is moving, how hard it is to keep track of all the things that
are happening. And I was not expecting a model so good coming out so soon. I mean, I don't think
there is anything fundamentally impossible and I knew it was coming. It was just a matter of time
and just more research, more investments, more people working on these things. But I was not
expecting something that good happening so soon. I thought it was maybe six months out, a year
out. And so I was shocked when I saw those videos, the quality of the videos, the length,
I mean, the ability to generate 60-second videos, always really amazed. Yeah, I do think every time
we've talked in the past, we've generally converged on, it was going to be a when, not an if,
that we'd see video generation get so good. And so it's reassuring to hear you were as
surprised as I was when it came out. But before we get into the details on what maybe some of
the breakthroughs were there on this time frame, maybe we can just spend a few
minutes talking about video diffusion, sort of one-on-one, for folks who may not be as familiar
with these models, why has video diffusion been so much more complex than text or image
generation? And what's historically been the main blocker in making it work?
Yeah, that's a great question. At a very high level, you can think about videos, just a
collection of images. And so really, the first challenge you have to deal with is that you're
generating multiple images at the same time. And so the compute cost that you need to process
and images is at least 10 times larger than what you would pay if you just want to process one of them at a time.
And basically, this means a lot more compute, a lot more memory, and just much more expensive to train all your large-scale model on video data.
The other challenge is just the data challenge.
I think a lot of the success we've seen in diffusion models for images was partially due to the availability of publicly available data sets like Lyon, like large-scale image and TAPT.
data sets that were scraped from the internet and they were made available and people could
use them to train large-scale models.
I think we don't quite have that for video.
I mean, there is a lot of video data, but the quality is kind of like a mixed bag and
we don't have a good way to filter or screen them or there's not a go-to data set that
is available and everybody's using to train these models.
So I'm guessing some of the innovations that went into the Pshora model are actually on
just selecting good quality data to train the models on.
Options are also hard to get for video.
I mean, the video data is out there, but getting good labels, good descriptions of what's
happening in the videos is challenging, and you need that if you want to have good control over
the kind of content that you generate with these models.
And then there is also a challenge of video content.
It's just more complex.
There is more going on if you think about a sequence of images as opposed to just one that
are complex relationships between the frames.
There is physics.
There is object permanence.
And in principle, I think a high capacity.
model with enough compute, enough data and potentially learn these things, but it was always
an empirical question, how much data are you going to need, how much computer are you going
need? When is that going to happen? Is the model really going to discover all this high-level
concept and statistics of the data, essentially? And it was surprising to see that it's doing
so well. You just laid out very clearly what the general obstacles have been, both on architecture,
on data set, on representations of the world via video as a format. Since the release,
came out last week. There's been a lot of speculation around how this model can achieve such
impressive results. Some folks were even speculating that there might be a game engine or a 3D model,
sort of explicit 3D modeling or geometry involved in the inference pipeline. But in their article,
Opinia describes the approach by saying that they train text conditional diffusion model jointly
on videos and images with different durations and resolutions and then apply to transformer
architecture on space-time patches of video and image latent codes.
And so could you just break down in layman terms
for folks who might not be as familiar
with scaling laws and what's going on here?
Sure, yeah, I can try.
I mean, there is certainly some secret sauce here
and I can try to read behind the lines
of what they said in their release.
The idea of training on videos and images
is not new.
It seems like one technical difference
that they are hinting at
is the use of a transformer architecture
for the backbone, for the denoiser,
for the score model.
People often use the convolutional architecture
back from the days where kind of like young initially started using Units as a score model,
which was actually a key innovation that really enabled a lot of the success on images,
and people kind of still ported those kind of architectures over to video data as well,
because they make sense.
We expect that there is a lot of convolutional structure
and convolutional architecture might be a good idea.
It seems like they moved on to a purely transformer-based architecture
and probably following the work by signing Shia.
at NYU, who did some of the initial work in this space
and developing good transformers architectures,
a VIT-based architectures for diffusion models.
It's possible that that gives you better scaling
with respect to compute and data
and just happens to work better.
They are also referring to latent codes,
so it seems like it's unlikely
that they're working directly in pixel space.
Working on latent representations was one of the key things,
the key innovations behind stable diffusion,
latent diffusion.
like the idea of first compressing the data
into a latent representation
that is a little bit smaller
or a little bit more compact
because we expect that there is a lot of redundancy
if you think about the different frames in a video
and so it might be possible to compress it
almost losslessly into a lower dimensional representation
and if you can do that,
then you can kind of train on this lower dimensional representation
and you can get much better tradeoffs
in terms of like the compute that you pay
and the kind of memory that you need to process the data.
So they might have figured out a better way
to encode video to a semantically meaningful kind of like lower dimensional latent space.
I would say this doesn't rule out to the possibility that they've used game engines or 3D models
to generate training data.
I mean, as we discussed before, I think the quality of the training data is really crucial.
And it's possible that they've used synthetic data generated by game engines or nerve-like 3D models
to generate the kind of data they want to see where there is a lot of motion.
They can probably use the internals of the engine to get very good captions about what's happening in the video,
so they can get a very good match between the text and the content that they're trying to generate.
So, I mean, it does seem like it's a purely data-driven approach,
but it's possible that they've used the other pipelines to generate synthetic data.
And when you contrast the fusion transformer approach that they took with sort of the prior generations of many flavors of generative models,
whether it was recurrent nets, GANS, just vanilla, other regressive transformers.
Why is it that a diffusion model here seemed, again, to be sort of the uniquely suited best tool for the job?
I think prior to diffusion models, yeah, people were using GANS, generated adversarial networks.
Gans are pretty good. They're pretty flexible, but one challenge is they're very unstable to train.
So that was actually one of the main reasons. We developed diffusion models in the first place.
we wanted to retain the flexibility of basically an arbitrary neural network as the backbone,
but a more principled statistical way of training the models that leads you to a stable training
loss where you can just keep training in the model just keeps getting better and better.
Outer aggressive models also have that property and is that they're trying to compress the data
and with enough capacity and a compute, in principle, they can do a pretty good job on modeling anything,
including video.
They just tend to be very slow because you have to generate one token at a time.
If you think of video, there is a lot of tokens,
and they've not been the model of choice for that reason.
Diffusual models, on the other hand,
they can generate basically tokens essentially in parallel,
and so they can be much faster.
And that's one of the reasons they are preferred,
I think, for these kind of modalities.
Another more philosophical reason is that if you think about a diffusion model,
in some sense, at the inference time,
you have access to a very deep computation graph,
where essentially you can apply a neural network
maybe a thousand times,
or if you take a continuous time perspective,
like a differential equation perspective,
it can even be an infinitely deep kind of computation graph
that you can use to generate content,
while at the same time,
you don't have to enroll the entire computation graph at training time
because the models are trained by score matching,
and there is this clever way of kind of like trying to make the model better
and better without ever having to pay a huge price at training time.
And so I've always felt that that is one of the secret weapons
of diffusion models and why they are so effective in practice
because they allow you to use a lot of resources at inference time
without having to pay that price during training time.
And to our earlier point about why did the SORA breakthrough happen so much faster
than many expected, it sounds like the stability of the diffusion transformer model
and being able to swap out essentially training time for inference time,
which is much cheaper, much more parallelizable, much more efficient,
was a big contributor to compressing the training times here.
Yeah. And there is a question of what's the backbone, right?
And that could be convolutional network, it can be a space-based model,
that could be a transformer.
There, I think we're still just scratching the surface
in terms of what works and what doesn't
and all possible combinations are possible.
Right.
Build auto-aggressive models that are convolutional.
You can build auto-aggressive models
that are based on transformers.
You can build ultra-aggressive models
that are based on state-based architectures.
And similarly, you can build diffusion models
that are based on convolutional architecture,
which is what people tend to do.
And now it seems like maybe open-ey eyes pushing towards,
no, that's just use a transformer as the back.
run in a diffusion model.
And I'm starting to see people exploring state-based models,
which might allow for very long context, for example.
So I think there is an exciting space of different kinds of combinations that we can try
and might give us better scaling, better properties for really getting to the kind of qualities
we want to see for these models.
One of the most elegant parts of the transformer backbone architecture is that it works really
well with the idea of tokenization, right?
And in language models, so much of the scaling laws work that allowed models like GPT 3 and 4 to be developed so quickly and generalized to all kinds of tasks was that the process of tokenizing language allows almost like a transformation or translation into a format that the models can understand across many, many different types of languages, whether that's good old-fashioned English or its code or its health records or it's different, in some cases, multilingual.
data sets. And so the beauty of tokenization is it's like this one-size-fits-all process of turning
language data into a format that transform our backbone really understands well and is able to learn
on. It seems like there was a similar key unlock around how visual data was broken down here
into small batches, right? They essentially tokenized image and video data into this sort of
intermediary representation of a patch. This approach created a meaningfully better output than other
models we've seen in the past?
That's a great question.
And honestly, I don't know the answer.
I think tokenization makes a lot of sense for discrete data, like code or text.
I'm less of a fan of tokenization or patchifying images and videos and audio.
I feel like you have to do it if you want to use a transformer architecture, but it makes
less sense to me just because the data is continuous and the patch is very arbitrary and
you're losing some of the structure there by going through a tokenization.
you kind of have to do it if you want to use transformers and transformers are great because
they scale well and we have very good implementations and they are very hardware friendly and so maybe
that's the way to go. It's the bitter lesson once again. But it feels like what they have is
some kind of latent, again, representation and maybe once you go to a latent space, then
tokenizing it might make more sense because you've already lost a lot of the structure.
And so it may be a combination of both. I mean, it seems like they might have access.
to a very good latent space
where they get rid of a lot of the
kind of redundancy that exists
in natural data, especially in videos.
Two frames next to each other
are very similar, right?
So there's a lot of redundancy in videos,
and if they got rid of some of that
through a clever encoding scheme,
then they apply tokenization.
I think that starts to make more sense
and make things more scalable,
less compute, less memory,
and just better.
We've seen so many text of video models,
but very few were able to actually generate
longer form videos, right, more than a few seconds.
And there were often issues with temporal coherence and consistency,
even across those short form generation, three to five seconds.
And here we've got a model that's able to do one minute long,
sort of 60 second generations.
And arguably some of the long-form generations are actually dramatically better
than the short-form generations from SORA too.
You start to see these emergent properties of temporal coherence
only in the 60-second clips, right?
What's going on there?
What are they doing differently that enables these videos to have such amazing continuity
for long lengths and temporal coherence
of the subjects across those links.
I think that was the most surprising element
of aspect of SORA,
like this ability to generate long content
that is so coherent and consistent and beautiful.
I think that was the part that really amazed me
because I know it's very hard to do
because exactly, like you have to keep track
of a lot of things to make things consistent
and the model doesn't know
what's important to keep track of and what's not.
And somehow the model they've trained
seems to be able to do it.
Again, it's not entirely surprising
because at the end of the day,
these models are trained to essentially compress the training data.
So if you have high-quality training data
where there are transitions,
and of course, the content is consistent with physics,
and it's consistent that has the right properties
that we would expect a real, natural,
good-quality video to have,
then in order to compress the data as effectively as possible,
then the model should learn about physics,
should learn about object permanence,
should learn about 3D geometry is and all of that.
What's surprising is that there is many other kind of shallower correlations
that your model could discover.
And what was surprising to me is that it seems like
it's really able to learn some of that.
We don't know why.
I think it's one of the mysteries of deep learning.
It's probably a combination of training data,
the right architectures and half-scale,
but it was amazing.
And the 3D properties in their videos,
emerge without any explicit inductive biases for 3D objects, right?
They're purely phenomena of scale.
What does that mean?
Is physics just an emergent property?
Well, it's not inconceivable that at the end of the day,
physics is a framework that can help you understand the world.
It helps you make better predictions.
Like if I understand Newton's law, I can make predictions about what's going to happen
if I drop an object.
And it's a very simple formula that allows me to make a lot of different predictions.
So, if I'm being tasked to compress a lot of videos, if I know Newton's law, if I knew some physics,
I can probably do a better job at predicting what the next frame is going to look like, right?
And at the end of the day, these models, although they are trained by score matching,
it's possible to relate in a very formal sense, the training objective that we're using to train a diffusion model
to a compression-based objective, literally just trying to compress video data in this case as much as you can.
And so it's quite possible that knowing something about physics, knowing something about camera views and 3D structures of objects and object permanence, but these kind of properties are helpful in compressing data because they reveal structure that is helpful to make predictions, which means you can compress the data better.
What's exciting is that it emerges just by training the model, right, that you could imagine other kinds of shallower correlations that exist in the training data, but they are not as useful or as predictive as a Newton's law or like.
like a real physics understanding of the scenes and what's going on.
And it's hard to tell whether what exactly is going on.
Like, it's possible that there is no real understanding of physics,
but it's certainly very effective, right?
And at the end of the day, maybe that's enough.
What you seem to be pointing out there is if the data that these models are trained on
is kind of like their diet and you are what you eat, in a sense,
if you're eating a ton of physics, then you are better at physics as a model.
Is that how we should begin to sort of explain other emergent properties?
It was a clip that they shared called Bling Zoo.
which was a single prompt-generated video
that had multiple transitions baked in
without any editing and so on.
It was multiple camera shots.
It almost seemed like somebody had manually stitched together
different camera angles, right?
How should we explain that?
Is that just essentially it mimicking
what it's seen in the training data
or is there something else going on?
In that case, I'd imagine that yes.
If you've trained it on high-quality video data
where you can see these transitions across different kinds of shots,
again, what these models do is they try to understand
what all the training videos have in common,
what is the high-level structure of these videos,
and then they try to replicate that.
So a sufficiently good model might understand
that the videos in the training set,
they tend to have this structure
where we transition across different views and shots,
and then we combine them,
and they're combined in interesting ways,
and then it's able to replicate.
Again, what's magic here is that it's provably
an impossible task in general, right?
There is so many other ways of interpolation,
between the things you see in the training set.
And most of them are wrong, right?
Are generalizations that you don't want to see?
And somehow this deep neural networks are able to find interpolations or generalizations
that are the ones we want, the ones that make sense.
And they discover the kind of structure that we want the model to replicate
as opposed to the ones that are by chance and it's not the kind of structure that we want
them to pick up.
And that's what's amazing and mostly unexplained this point.
We do not understand why this happens.
Right.
So we're here now in early 2024, and the answer to when will video models get good enough
to cross the uncanny valley has just been breached. We've just gotten to that point.
So if we look ahead now, SORA is still in beta, but there's several other AI generative
sort of video efforts out there already. Realistically, how expensive do you think it's going to be
to generate AI video at any kind of sort of consumer scale, a readily available scale?
I'm sure this release from Open AI set off a lot of competitors to the races
trying to catch up, and I'm sure we'll see developments coming from all the various competitors
in this space.
I think there is training costs, which are huge.
I'm sure they use thousands of GPUs to train SORA, and scale was a big part of the
success they've seen, and so it's definitely going to be out of reach for academics,
but there is going to be industry players who will have the resources to try to compete,
with them and try to replicate what they did or achieve similar results in a different way.
The good news is that now we have an artifact, we have a system that can do it.
And so I think it's a lot easier to try to catch up.
It's not something that there is a lot less uncertainty in whether it's even possible.
Right now we have an example.
It's feasible and a lot of people will make the right investments to really catch up.
I don't know how long it's going to take.
It's going to be six months.
It's going to be 12 months.
But somebody will come up with similar performance, I would imagine.
as we've seen in other lamps and in other spaces
where people can't catch up.
The question is, how far ahead will open an IP?
By then, how much better will the system be in six months
or in 12 months?
And that's hard to say.
The other question that I think you were hinting at
is inference, like how expensive is it going to be
to serve these models and provide video generation on demand
to users or personalized videos,
like all these really cool applications
that could emerge from a really good video generated
model. Again, I'm pretty optimistic, especially because the underlying architecture is a
diffusion model. Once you have potentially big, expensive, large, clunky model that can generate
high quality results, there's been a lot of success in distilling these models down into smaller
ones that are almost as capable, but way faster. So I'm pretty optimistic that once we get
to high enough quality, it's going to be possible to get systems that can serve similar
our quality results in a very inexpensive way.
So I'm pretty excited to see the kind of crazy use cases people come up with
once this technology becomes available.
When we talk about compute, training or inference is a huge part of the calculus there.
But there's this other bucket of costs that come from the datasets required to actually
train these models and get scaling laws to work.
And in terms of training data for language models, there's already billions of data
points that these labs can get across the web.
But for video, a lot of the data, like you were saying,
saying earlier, even if it exists, it's not particularly well labeled or captioned.
And so how do you think video model teams will overcome that challenge?
We've seen recently that Reddit agreed to do a deal to license its data to Google for
$60 million.
Do you think we'll see video production studios begin to license their content out?
That's a good question.
And first of all, it will probably depend a little bit on whether you're thinking of
startups versus more established industry players.
I think startups might be willing to move fast and break things.
maybe be less concerned about copyright
and just scrape from the internet
and then train models and then worry
about licensing the data later.
Bigger players, their legal teams are very, very worried about
big lawsuits. And so they'll want to have something
that is properly licensed. It's going to be
interesting to see whether, as you said,
the studios or people who currently
own the content will be willing to license
it because it could be existential threat
to their entire business model. I mean, I can see Reddit
licensing their stuff because it's maybe not as an existential threat.
The other thing you mentioned is labeling.
It's a great point.
It's going to be a big challenge.
Then I'm pretty optimistic, though, that people will be able to set up human-in-the-loop pipelines.
I mean, we've seen great success in visual language models, even video-language models.
They are not good enough to maybe provide high-quality caption out of the box.
But I could imagine that they could drastically speed up, like a human-in-the-loop pipeline,
where they provide suggestions that they don't get fixed or improved by human labellers.
And so I'm pretty optimistic on the captioning front that we'll be able to find pretty
scalable solutions to that.
Is the sort of logical path there to start with a human in the loop implementation,
build a working model of annotations there, and then ultimately move towards synthetic captioning?
It seems like that's a direction that people are exploring and using, and there's been a lot of
success on kind of like enhancing captions, you know, synthetic ways you're using LLMs.
And so, yeah, that would be my bet.
I mean, honestly, I don't know to what extent the bottleneck is on the captioning as
opposed to actually having a good high-quality videos, even just the raw high-quality video data
having access to that.
It's non-trivial.
And my understanding is actually one of the bottlenecks that we will have to solve first.
Ah, gotcha.
Why don't we switch gears a little bit to the usage of these models, right?
One of the things that consumers and creators have really found useful with language models,
generative models, is context windows, right?
The bigger the context window, the more flexibility there is for the input.
You can give it more detailed context, and it's been an exponential progress on the language side.
In a very short amount of time, we've gone from tiny context windows to millions of tokens
worth of context windows.
In video, do you expect a similar approach or is there some fundamental limitation?
I was just reading the Gemini 1.5 paper and then talking about this.
very long context, a million tokens, 10 million tokens.
And actually, one of the applications they mentioned is actually video summarization,
video understanding, like trying to process a one hour long video.
It seems like these very long contexts are going to be very useful to solve a variety of
video processing, video understanding kind of tasks.
And therefore, I would be very surprised if they don't also end up being very useful for video generation.
And in fact, it's entirely possible that that's already.
a component that played a role in the open AI system.
And part of the reason they are able to generate long videos
is because they can handle very long context
and they are able to scale transformers to very long sequences.
And it's entirely possible that that was part of the secret solves.
So I'm again, excited about either these attention-based ways
to scale up context through embeddings or ring attention or clever
hardware-focused implementation
and flash attention
that can be scaled up
to very long contexts
or more researching things
like state-based models
or which people are starting
to use also in the context
of diffusion models
that might allow you to get
to very long contexts
and so I think it's going to be
an interesting space to watch for sure.
Well, looking further out
the timelines and breakthroughs,
this is obviously the worst
that this technology will ever be
almost definitionally, right?
We're at the earliest stages
of progress.
Why did people underestimate the timeline so far?
And what does that mean about what's to come next and how quickly it's going to come?
Yeah, this is a really hard question.
And I think I was wrong with my predictions on how long it was going to take to get to the point we are right now with SORA.
And when we're moving exponentially fast, it's very hard to make predictions.
And errors can be pretty big.
But it's going to be exciting for sure.
When you look at sort of video generation as a breakthrough in the broader journey of feeling
laws towards generalizable artificial intelligence or whether you want to call it ASI or AGI.
How do you quantify this advancement in that broader journey?
I'm very excited about this because I tend to think of it as a like a video generative model
as a pretty natural kind of like world model. I mean, as we were discussing before,
in order to be able to generate videos of these qualities, there must be some level of understanding
of physics, of object permanence, 3D structures. Like, there must be a lot of knowledge.
that it's somehow latent in these models.
And I'm excited about all the different ways
in which we might be able to extract this knowledge
and use it and different applications.
And when we think about, especially like robotics
or other agents that really interact with the real world,
I think the kind of knowledge that is embedded
in this video that has been extracted by watching,
essentially, a lot of videos is going to be very, very useful.
And what's interesting is I think it's going to be
pretty complementary to the kind of knowledge
that, for example, is baked into an LLM.
You can learn a lot about the world by reading books, essentially,
but there is just going to be a lot of experience that you only get by,
that it's more similar to the kind of experiences you get as a child.
You just walk around the world, you see things,
and you learn about how the world works just through your eyes
and through video essentially, right?
So that kind of experience that we're feeding through video diffusion model
when the kind of knowledge that we might be able to extract from that
is therefore going to be, I think, pretty useful.
I think the title of the blog post that they put out was video generation models as world simulators.
I think there is a lot of problems there.
I think if you can do it at the pixel level, then that means that you've solved the harder version of the problem, right?
And you can do a lot of things on top, whether you're an autonomous vehicle or whether you're building robots or whether you just want to build an agent that understands how the world works and combines the knowledge it's gotten by crawling the internet with what you can see in the real world.
I think it's all going to be pretty exciting.
Well, look, arguably, we would not be here as an industry if it wasn't for your lab.
So it's just so exciting to see how your worldview of where the world was always going to go has accelerated.
I know we could spend hours talking about all the tricks that were required from a research perspective to get here and where we're going to go.
But we're going to wrap up for today.
Thank you so much for making the time.
I'm sure we'll have more to talk about soon.
Thank you very much.
I enjoyed this very much.
If you liked this episode, if you made it this far, help us grow the show.
Share with a friend, or if you're feeling really ambitious, you can leave us a review at rate
thispodcast.com slash A16c.
You know, candidly, producing a podcast can sometimes feel like you're just talking into a void.
And so if you did like this episode, if you liked any of our episodes, please let us know.
I'll see you next time.
You know,