The AI Daily Brief: Artificial Intelligence News and Analysis - The Science Behind Sora, OpenAI's Game Changing Video Model
Episode Date: February 18, 2024On today's episode NLW digs into the recently published research on OpenAI's Sora. Read more: https://openai.com/research/video-generation-models-as-world-simulators ABOUT THE AI BREAKDOWN The AI Bre...akdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
Transcript
Discussion (0)
Today on the AI breakdown, we're digging into the research behind OpenAI's incredible new video model SORA.
The AI breakdown is a daily podcast and video about the most important news and discussions in AI.
Go to Breakdown. Network for more information about our YouTube, our Discord, and our newsletter.
Welcome back to the AI breakdown.
It is now a few days after the launch or at least announcement of SORA, the new video model from Open AI.
As you might imagine, the first reaction of the world was utter shock, disbelief.
somehow OpenAI had figured out a way to blow us away once again, a technology which,
Jan Lacoon from Meta, for example, said we just didn't have just days earlier,
turned out to be there waiting for us inside OpenAI's research halls.
Now, as some people tried to grapple with what it meant for deepfakes to have video this convincing,
or how it would change the nature of the entertainment industry,
others were trying to tear it down saying it wasn't that good,
or that everyone was being hyperbolic.
I would suggest when you see things like that.
to double-check to make sure that the person's business model isn't being a skeptic.
In any case, a few hours after the announcement, OpenAI also dropped their research.
So what we're going to do today is to try to go a little bit behind the scenes of what's actually going on underneath the hood.
This is pretty dense, and we're going to be going off of a lot of what OpenAI has published,
but hopefully this gives you a better sense of the technology itself.
You can find this research paper on OpenAI's website.
The research is called Video Generation Models as World Simulmonary.
And one of the most important lines of the abstract is this one.
Our results suggest that scaling video generation models is a promising path towards building
general purpose simulators of the physical world.
In this line, you get a picture into why this matters to open AI.
It is not just that they are trying to build the best version of any generative AI tool,
although on an intermediate scale that certainly is the case.
But at the same time, they also see this as the means to an end of getting to AGI.
The ability to simulate the physical world is a key capacity in their estimation of AGI,
and so in many ways, SORA is actually doing double duty.
Let's start with their section turning visual data into patches.
They write, we take inspiration from LLMs, which acquire generalist capabilities by training
on internet-scale data.
The success of the LLM paradigm is enabled in part by the use of tokens that elegantly
unified diverse modalities of text, code, math, and various natural languages.
In this work, we consider how generative models of visual data can inherit such benefits.
Whereas LLMs have text tokens, SORA has visual patches.
Patches have previously been shown to be an effective representation for models of visual data.
We find that patches are a highly scalable and effective representation for training generative
models on diverse types of videos and images.
At a high level, we turn videos into patches by first compressing videos into a lower-dimensional
latent space, and subsequently decomposing the representation into space-time patches.
they discuss their video compression network. They say this network takes raw video as input
and outputs a latent representation that is compressed both temporally and spatially. SORA is trained
on and subsequently generates video within this compressed latent space. Continuing they write,
given a compressed input video, we extract a sequence of space-time patches which act as transformer
tokens. The scheme works for images too since images are just videos with a single frame. And as an
aside, this is something that people are noticing that part of the way that they're making SORA
may actually influence how they think about image generation in the future as well.
Now, an important section is called scaling transformers for video generation.
They write, SORA is a diffusion model.
Given input noisy patches and conditioning information like text prompts,
it's trained to predict the original clean patches.
Importantly, SORA is a diffusion transformer.
Transformers have demonstrated remarkable scaling properties across a variety of domains,
including language modeling, computer vision, and image generation.
In this work, we find that diffusion transformers scale effectively as video,
models as well. Sample quality improves markedly as training compute increases. This is another area
we're checking out the YouTube channel and seeing the actual images would be really valuable. They show a video
of a Shiba Inu in a blue knitted hat, frolicing in the snow, and they show it at a base compute
rendering, a 4x compute, and a 16x compute, and the upgrade is dramatic in each case. Then they get
into a section that starts to articulate how they've done some things differently than others have.
They write, past approaches to image and video generation typically resize, crop, or trim videos to a standard size,
e.g. 4-second videos at 256-2-56-resolution. We find that instead, training on data at its native size provides several benefits.
One is sampling flexibility. They write, SORA can sample widescreen 1920 by 1080p videos,
vertical 1080 by 1920 videos, and everything in between. This, they say, lets SORA create content for different devices directly at their native aspect ratios.
It also lets us quickly prototype content at lower sizes before generating at full resolution,
all with the same model.
They also say that they, quote, empirically find that training on videos at their native aspect
ratios improves composition and framing.
One of the things they noticed is that when they worked with a version of their model that
crops all training videos to B-square, which they say is common practice, it had a tendency
to generate videos where the subject was only partially in view.
Now, another piece of this training has to do with language.
They write, training text-to-video generation systems requires a large amount of videos with
corresponding text captions. We apply the re-captioning technique introduced in Dolly 3 to videos.
We first train a highly descriptive captioner model and then use it to produce text captions
for all videos in our training set. We find that training on highly descriptive video captions
improves text fidelity, as well as the overall quality of videos. Similar to Dolly 3, we also
leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model.
This enables SORA to generate high-quality videos that accurately follow user prompts.
Now, next in the research paper shows something that they actually weren't showing off in their SORA landing page.
Everything on that landing page that we saw was text to video samples. However, Sora can also be prompted with pre-existing images or video.
One of the use cases for that is, for example, animating Dali images, turning 2D images, in other words, into 3D animations.
Sora also has the capacity to extend other videos that have been generated with AI, which, among other things, allows them to extend a video both forward and backward to produce an infinite loop.
Sora has video-to-video capacity.
And in one of the things that is most impressive,
and has been showing up on Twitter the most,
they write,
We can also use Sora to gradually interpolate between two input videos,
creating seamless transitions between videos
with entirely different subjects and scene compositions.
The example they give is a drone flying through what looks like Roman ruins
that turns into a butterfly,
which is then underwater with the ruins growing coral.
This is another area where seeing what I'm describing is going to be much more useful,
but I think part of why people are picking up on it is that this makes it really easy to imagine
how AI could be used to do some things that would be very hard to replicate without significant
Hollywood budgets. Being able to take two unlike things and transition them from one to the other
opens up just a world of creative possibilities. Finally, they conclude with a really interesting
section on emerging simulation capabilities. OpenAI writes, we find that video models exhibit
a number of interesting emergent capabilities when trained at scale. These capabilities enable SORA to simulate some aspects
of people, animals, and environments from the physical world.
These properties emerge without any explicit inductive biases for 3D, objects, etc.
They are purely phenomena of scale.
The examples they give are 3D consistency.
Sora, they write, can generate videos with dynamic camera motion.
As the camera shifts and rotates, people and scene elements move consistently through three-dimensional
space.
Another example is long-range coherence and object permanence.
They write,
A significant challenge for video generation systems has been maintaining temporal consistency
when sampling long videos.
We find that Sora is often, though not always, able to effectively model both short and long-range dependencies.
For example, our model can persist people, animals, and objects even when they are occluded or leave the frame.
Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video.
A third emergent capability is interacting with the world.
They write, Sora can sometimes simulate actions that affect the state of the world in simple ways.
For example, a painter can leave new strokes along a canvas that persist over time,
or a man can eat a burger and leave bite marks.
Finally, a fourth emerging capability is simulating digital worlds.
They write, SORA can simultaneously control the player in Minecraft with a basic policy
while also rendering the world in its dynamics in high fidelity.
These capabilities can be elicited zero shot by prompting Sora with captions mentioning Minecraft.
Combined, they say, these capabilities suggest that continued scaling of video models
is a promising path towards the development of highly capable simulators of the physical and digital
world and the objects, animals, and people that live within them.
Now let's go to how some others
we're trying to simplify some of these concepts.
Dan Shipper over at Every wrote a piece
called Sora and the Future of Filmmaking.
In a section called How Sora uses massive amounts of data
to make mind-blowing video clips,
he tries to break down this idea of patches.
He says, imagine a film print of the dark night.
You unroll the film from its scroll
and chop off the first 100 frames of cellophane.
You take each frame,
hear the Joker laughing maniacally,
there Batman grimacing,
and perform the following strange ritual.
You use an exacto knife to cut a gash
in the shape of an amoeba in the first
You excise this cellophane amoeba with tweezers as carefully as a watchmaker and put it in a safe place.
Then you move on to the next frame.
You got the same amoeba shaped hole from the same part of the next cellophane frame.
You remove this new amoeba, shaped precisely the same as the last one, with tweezers and stack it carefully on top of the first one.
Keep going until you've done this to all 100 frames.
You now have a multicolored amoeba extruding through its y-axis, a tower of cellophane that could be run through a projector to show a small area of the dark night,
as if someone stuck their hand in a loose fist in front of the projector, letting only a little bit of the movie through.
This tower is then compressed and turned into what's called a patch, a smear of color changing
through time. The patch is the basic unit of SORA in the same way that the token is the basic
unit of GPT4. T tokens are bits of words, while patches are bits of movies. GPT4 has been trained
to take in a sequence of tokens and output the next token in the sequence. SORA has been trained
to do the same thing. It takes in a sequence of patches and outputs the next patch in the sequence.
Patches are innovative and SORA appears to be so powerful because they allow
open AI to train SORA on an immense amount of image and video data. Imagine patches
cut out of every video in existence, infinite towers of cellophane, stacked and fed into the model.
Previous text-to-image approaches required images and videos used in training to all be the same size,
which required significant pre-processing to cut videos down to size.
But because SORA trains on patches instead of the full frame of the video,
he can gobble up any video or image without requiring it to be cut down.
As a result, more data can be used for training for a higher quality output.
He also says,
Another big advance with Sora is the architecture.
Traditionally, text-to-video models like runway are diffusion models,
while text models like GPT4 are transformers.
SORA is a diffusion transformer, a mashup of the two.
Instead of predicting the next piece of text in a sequence,
SORA predicts the next patch in a sequence of patches.
By using this architecture, OpenAI can throw much more data and compute at training SORA,
and the results are stunning.
If you haven't been over to every.2,
Dan writes some great stuff about AI, and I highly suggest checking it out.
Now, of the people who are talking about the limits of SORA,
the most interesting ones are those who are actually zooming out to how they might use it
when it becomes available.
Martin Nebelong writes,
Systems like OpenAI's new SORA video model
just shows us that anyone can create a pretty video
or a beautiful picture using nothing but a text prompt.
But if we are to use the technology to create interesting art,
thought-provoking cinema, and great games,
we need control of the output down to a granular level
if that's what our project calls for.
We need better tools and better interfaces to interact with the generative matter.
Martin is calling for things like video impainting,
more precise control over specific frames,
specific texture and movement,
all of which there are lots of people thinking about,
and so I anticipate coming to bear as well.
But still, everything that I've seen for the last couple days reinforces for me
that we really are now once again living in a post-X world,
and X this time is not Chatsy-B-T, but SORA.
Will it have as big of an impact?
It's hard to say.
Using text is more universal than using images
is more universal than using videos.
And so perhaps it won't ultimately be as influential.
But I'm not so sure.
In the same way that Mid Journey and Dali 3 are democratizing
how people create images and who actually makes them,
I'll be interested to see how SORA and other tools like it as they catch up do the same for video.
Fascinating times, but that is going to do it for this AI breakdown.
Hope you're having a great weekend.
Until next time, peace.
