The AI Daily Brief: Artificial Intelligence News and Analysis - OpenAI's Stunning New Video Model SORA Shocks the AI World
Episode Date: February 16, 2024Welp, they've done it again. On a day when it was supposed to be all about Google's million+ token lengths in Gemini 1.5, OpenAI completely stole their thunder with the announcement of Sora, easily th...e most advanced text-to-video generation model we've seen. ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
Transcript
Discussion (0)
Today on the AI breakdown, we are looking at OpenAI's new SORA Video Generation model,
which not only blows everything we've seen so far out of the water,
but which many are already calling the GPT4 moment for AI video generation.
The AI breakdown is a daily podcast and video about the most important news and discussions in AI.
Go to Breakdown.net Network for more information about our YouTube, our Discord, and our newsletter.
Hello, friends. Quick note before we get into the show,
SORA was such an exciting thing to discuss that I went a little bit longer than I normally do,
and I decided to just make this the focus of the entire show.
We will be back to our normal format of Briefs and Mains next week.
But for now, let's just talk some Sora.
Welcome back to the AI breakdown.
Usually I try to make sure that any content I'm doing works for both YouTube and for the podcast.
However, if you are just listening to this, I highly suggest you also check out the YouTube version.
inherently this is a conversation that is better had with the visual reference points of all of the videos that we're going to be talking about, starting with this one that's on the screen now.
It's a famous set of video generations of Will Smith eating noodles that looks something like if Mid Journey version 1 or version 2 came to life.
The movement is weird, the faces are frankly disturbing, but this is where we were just a year ago.
Now OpenAI has released SORA.
OpenAI COO Brad Lightcap wrote, this is one of those things you tell yourself is coming and you think you're thinking.
are ready for it and couldn't possibly be surprised by it, but then you see it and don't quite
believe it, and you're not sure why you didn't think you'd be surprised. So what is SORA?
Sora is a video generation model. OpenAI writes, it's an AI model that can create realistic
and imaginative scenes from text instructions. What it really is is easily the most advanced
and impressive AI video generation model we've ever seen, and it's not particularly close.
Unlike video generators like PICA and Runway right now, which work in four or five or six-second segments,
SORA can generate videos of up to 60 seconds.
The quality, just based on anyone's visual reference, also blows pretty much everything we've
seen out of the water.
The level of detail, the sheer physics involved in these videos, makes them nearly indistinguishable
from something captured by film or created in all of the old-fashioned ways.
The demo site has about 30 examples, and they run the gamut from people to earth,
urban settings, to landscape visuals, to animal scenes, to cartoon style generations, to historic
footage of California during the Gold Rush. Now, unfortunately for all of us who instantly want to get
our hands on this, this is a rare instance of open AI announcing something that isn't broadly accessible.
They write, today's SORA is becoming available to red teamers to assess critical areas for
harms or risks. We're also granting access to a number of visual artists, designers, and filmmakers
to gain feedback on how to advance the model to be most helpful for creative professionals.
We're sharing our research progress early to start working with and getting feedback from people
outside of OpenAI and to give the public a sense of what AI capabilities are on the horizon.
Now, it's not just the length that makes SORA different.
As OpenAI writes, Sora is able to generate complex scenes with multiple characters,
specific types of motion, and accurate details of the subject and background.
The model understands not only what the user has asked for in the prompt,
but how those things exist in the physical world.
Part of this comes from, as they put it,
the model's deep understanding of language,
which helps it accurately interpret prompts
and, quote, generate compelling characters
that express vibrant emotions.
Now, Dr. Jim Fan from NVIDIA
wrote about how much more there is likely to be
than even meets the eye.
He writes,
If you think Open AI SORA is a creative toy like Dali,
think again.
Sora is a data-driven physics engine.
It is a simulation of many worlds real or fantastical.
The simulator learns intricate rendering, intuitive physics, long horizon reasoning, and semantic grounding, all by some denozing and gradient maths.
I wouldn't be surprised if SORA is trained on lots of synthetic data using Unreal Engine 5.
He then breaks down a video of two pirate ships in a cup of coffee for which the prompt was,
photorealistic close-up video of two pirate ships battling each other as they sail inside a cup of coffee.
Jim explains what's going on behind the scenes, writing,
The simulator instantiates two exquisite 3D assets, pirate ships with different decorations.
SORA has to solve text to 3D implicitly in its latent space.
3D objects are consistently animated as they sail and avoid each other's paths.
Fluid dynamics of the coffee, even the foams that form around the ships.
Fluid simulation is an entire subfield of computer graphics, which traditionally requires
very complex algorithms and equations.
Photorealism, almost like rendering with ray tracing.
The simulator takes into account the small size of the cup compared to oceans and applies
tilt-ship photography to give a minuscule vibe.
The semantics of the scene does not exist in the real world, but the engines
still implements the correct physical rules that we expect.
Now, if Dr. Jim is explaining what's going on,
the vast majority of people who are tweeting just have their jaw on the floor.
Andrew Cotei writes,
Once again, Open AI seemed to be one to two years ahead of everybody else.
Sully Omar says,
this is genuinely the first time I've audibly said what the F for an AI video.
100% AI generated.
The face, the bread, the pores, everything about this feels real.
McKay-Rigley writes,
I don't even know what to say.
These clips generated by OpenAI's SORA model have me speechless.
We knew good AI text of video would come, but this quickly, unreal, we're stepping into a new world.
Buckle up.
Now, fascinatingly, this comes on the heels, of course, of yesterday morning's announcement
of Google's Gemini 1.5, with its 1 million token context window.
I even said at the end of yesterday's video that I'd be interested to see how OpenAI responded
to that.
Well, clearly, this wasn't just some reaction to Gemini's announcement, but boy, just when you thought
that Google was catching up to OpenAI from a public perception standpoint, they go and drop SORA
and just absolutely rip the conversation back in their orbit. Indeed, the simplest way to describe
how most people feel about this comes from Siki Chen, who says, Open AI just casually dropping
the GPT4 of video generation on a Thursday morning. Going back to Sora's announcement post, they do
say that this model has weaknesses. They suggest it may struggle with accurately simulating the physics
of a complex scene, and that it may not understand specific instances of cause and effect.
The example they give is a person taking a bite out of a cookie, and afterward the cookie not having a bite mark.
They also say that the model might confuse details, such as mixing up left and right,
or struggling with, quote, precise descriptions of events that take place over time, like following a specific camera trajectory.
Now, when it comes to how this model works, OpenAI writes SORA is a diffusion model,
which generates a video by starting off with one that looks like static noise and gradually transforms it by removing the noise over many steps.
Sora is capable of generating entire videos all at once, or extending generated videos to make
them longer. By giving the model foresight of many frames at a time, we've solved a challenging
problem of making sure a subject stays the same, even when it goes out of view temporarily.
Similar to GPT models, Sora uses a transformer architecture, unlocking superior scaling performance.
We represent videos and images as collections of smaller units of data called patches, each of which
is akin to a token in GBT. By unifying how we represent data, we can train diffusion
transformers on a wider range of visual data than was possible before.
They also note that, in addition to being able to generate video just from text, the model can
also take an existing still image and generate a video from that. As with everything they do,
they make it clear that this is not just an endpoint, but is a milestone for a bigger objective.
They conclude their page about Sora saying, SORA serves as a foundation for models that
can understand and simulate the real world. The capability we believe will be an important milestone
for achieving AGI. Now, some people, even if they were incredibly impressed by the quality
of the videos, did call out the challenge of having AI-generated video that's this convincing.
Avi Chal from Electric Capital writes,
Deepfakes are going to be a big problem in 2024. How do we mitigate the risk of deepfakes?
One, content should have a verifiable cryptographic signature.
Two, publish the signature to an openly readable, difficult to corrupt database.
AI handshake crypto.
Now, lest you think it's just the crypto people saying this,
the Biden White House seems to be thinking in similar ways when it comes to verifying
all official White House communications.
Matthew Kobach writes, I just watched the open AI videos and then I scrolled Instagram
for a minute. And now everything seems like it could be AI.
This changes our perception of video forever.
Now, on that safety front, OpenAI writes that they're working right now with red teamers,
especially those who are domain experts in areas like misinformation, hateful content, and bias
to adversarily test the model.
In terms of other safety initiatives, they also say that they're building tools like a detection classifier
that can detect when a video was generated by SORA,
and they reiterate their position that the only way to actually figure out how AI is going to work in the world is by trying it.
Despite extensive research and testing, they say,
we cannot predict all the beneficial ways people will use our technology,
nor all the ways people will abuse it. That's why we believe that learning from real-world use
is a critical component of creating and releasing increasingly safe AI systems over time.
Now there is another dimension of this which I wanted to discuss, which is the marketing
dimension. I jokingly said that the most shocking part of SORA was that it seems like they
actually had the marketing team involved in the name. No more GBT4 or Dolly 3 or anything like that.
Instead, we get this pretty-to-say four-letter word that actually feels like a brand.
Even beyond that, though, VC Rachel Horwitz pointed out the difference between the way
that the CEO of Google announced Gemini 1.5 and how Sam Altman was introducing SORA.
The way she summed it up was scrappy digital native founder casually interacting with people
on the web to show off mind-blowing new tech over slick, overly produced marketing videos and PR-approved
statements. What she's referring to is that just after the announcement, Sam Altman also tweeted,
we'd like to show you what Sora can do. Please reply with captions for videos you'd like to see and
we'll start making some. Don't hold back on the detail or difficulty. So far on his thread he's
published, a wizard wearing a pointed hat and a blue robe with white stars casting a spell
that shoots lightning from his hand and holding an old tome in his other hand. A half duck,
half dragon flies through a beautiful sunset with a hamster dressed in adventure gear on its
back. A street-level tour through a futuristic city which in harmony with nature and also
simultaneously cyberpunk and high tech. The city should be clean with advanced futuristic trams,
beautiful fountains, giant holograms everywhere, and robots all over. A futuristic drone race
at sunset on the planet Mars. Two golden retrievers podcasting on top of a mountain,
an instructional cooking session for homemade Noki, hosted by a grandmother social media influencer
set in a rustic Tuscan country kitchen with cinematic lighting. And it just goes on and on,
each in some way more impressive than the last. Now, if you're wondering what runway and Pika
are thinking about this right now, all we've seen so far is the CEO of runway Cristobal
Valenzuela saying, game on, which is of course exactly what we want to hear. Meanwhile,
Prati, the creator of Sink, took the image of the cyberpunk-looking woman walking through Tokyo
and added a clip of dialogue from the Matrix with matching lip syncing.
In this way, showing how these AI tools will interplay with one another.
In other words, he built off this AI generation that came from OpenAI
and added his own technology to turn it into a totally different visual scene.
Now, one of the big byproducts of having a product with this potential level of impact
is that people are zooming out and talking about the future.
ST on Twitter writes, there are two schools of thought here.
One, in five years, entertainment networks, they're not really social.
social networks don't need creators. They just churn out an endless stream of AI-generated content
with a bunch of AI-generated ads linking you to AI-generated product pages, keeping you amused
till the end of time. Two in five years, this leads to an explosion in creator-generated content.
The baseline quality of content goes up. We get Hollywood-level production with more niche,
more extreme storylines. The new creators are the prompters. Now, obviously none of us know
exactly how this is going to go. However, I can share my observation that if you look at the
communities that have already started to arise around runway and around PICA, what you're
seeing is people who are taking advantage of a new medium to tell classic stories, but just in a
different way. There is a huge difference between the really great AI-generated stuff and the
more boring, not-so-good AI-generated stuff. And when it comes to story, much of that isn't just
about the quality of the technology, but about narrative and about how people put disparate elements
together. When we think about film or TV, it's not just a collection of sets. It's a collection of
very specific decisions about script, about character, about narrative arc, and even visually about the
way that one shot interplays with the next. Yes, of course, AI will be able to do and replicate
much of that, but it seems likely to me that there's going to be a huge chasm between the best
AI-generated entertainment and the worst. I tend to agree that the people who are going to be best at
creating that AI-generated entertainment are likely not exclusively the people who are creating
content already. I almost think that the inevitable reality is that both of these two models actually
come to bear. The idea that entertainment slash social networks churn out an endless stream of AI-generated
content that's perfectly personalized to the people who are consuming it, there's no universe in
which that's not going to happen, at least in some way. And yet at the same time, I also believe
that there's no universe in which the other thing, the explosion in creator-generated content, doesn't happen
as well. I think just in the same way that there is a huge range of highbrow to lowbrow to different
format content that we consume on any given day, so too will it be like that when we just have
more content because of how AI has changed our ability to create it. There was a thought coming into
2024 that perhaps this was going to be the year for AI video in the same way that 2020 was the
real breakout year for AI generated images. Between the end of 2022 and where we are now,
Tools like Mid Journey and Dolly 3 have made enormous, unbelievable strides.
There is more control, more adherence to prompts, more ability to get exactly what you want out of them.
And yet, until we saw SORA today, I think it was still hard for many to imagine what the sort of phase shift moment was going to be that made it feel like that level of advance.
Now, of course, comparing those Will Smith eating noodles videos from a year ago to these incredibly detailed hyper-realistic renderings from Sora now is truly a reminder.
of how fast things are moving and what we have to expect in the future.
Ben Tossel from Ben's bites got at that when he tweeted,
Assume token lengths will be billions. Assume videos will be indistinguishable.
Assume images will be indistinguishable.
Assume every sentence is by AI fully or in part.
Assume day-to-day tasks can be 90% automated.
What mad stuff will blow us away then?
You can, if you look hard enough, find people who are cynics about this,
or who try to tell you that it's not impressive or not a good use of A&Sov.
AI, but in spite of their squawking, I think that the Sora moment will be one we look back on
as a fundamental before and after break.
It feels very much like just the beginning of something, with all the anxiety and excitement
that that brings.
Still, perhaps the best summary of yesterday, between Gemini 1.5's Million Token Context
lengths and Sora's unbelievable advances in AI video generation, SWIX writes,
it's been a fantastic year of AI progress in the past hour.
And so, friends, that is where we will wrap the AIME.
I break down. I can't wait to get my hands on SORA and see what we can create. And I will be
eagerly awaiting all of the creations of people who do have it in the meantime. Appreciate you
listening or watching as always. And until next time, peace.
