No Priors: Artificial Intelligence | Technology | Startups - OpenAI’s Sora team thinks we’ve only seen the "GPT-1 of video models"
Episode Date: April 25, 2024AI-generated videos are not just leveled-up image generators. But rather, they could be a big step forward on the path to AGI. This week on No Priors, the team from Sora is here to discuss OpenAI’s ...recently announced generative video model, which can take a text prompt and create realistic, visually coherent, high-definition clips that are up to a minute long. Sora team leads, Aditya Ramesh, Tim Brooks, and Bill Peebles join Elad and Sarah to talk about developing Sora. The generative video model isn’t yet available for public use but the examples of its work are very impressive. However, they believe we’re still in the GPT-1 era of AI video models and are focused on a slow rollout to ensure the model is in the best place possible to offer value to the user and more importantly they’ve applied all the safety measures possible to avoid deep fakes and misinformation. They also discuss what they’re learning from implementing diffusion transformers, why they believe video generation is taking us one step closer to AGI, and why entertainment may not be the main use case for this tool in the future. Show Links: Bling Zoo video Man eating a burger video Tokyo Walk video Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @_tim_brooks l @billpeeb l @model_mechanic Show Notes: (0:00) Sora team Introduction (1:05) Simulating the world with Sora (2:25) Building the most valuable consumer product (5:50) Alternative use cases and simulation capabilities (8:41) Diffusion transformers explanation (10:15) Scaling laws for video (13:08) Applying end-to-end deep learning to video (15:30) Tuning the visual aesthetic of Sora (17:08) The road to “desktop Pixar” for everyone (20:12) Safety for visual models (22:34) Limitations of Sora (25:04) Learning from how Sora is learning (29:32) The biggest misconceptions about video models
Transcript
Discussion (0)
Hi listeners, welcome to another episode of No Priors.
Today we're excited to be talking to the team behind OpenAI's SORA,
which is a new generative video model that can take a text prompt and return a clip that is high definition, visually coherent, and up to a minute long.
SORA also raised the question of whether these large video models are world simulators and applied the scalable Transformers architecture.
to the video domain.
We're here with the team behind it.
Aditya Ramesh, Tim Brooks, and Bill Peebles.
Welcome to No Pryors, guys.
Thanks for having us.
To start off, why don't we just ask each of you
to introduce yourselves so our listeners know
who we're talking to, Aditya, mine starting us off.
Sure, I'm Aditya. I lead the Sora team
together with Tim and Bill.
Hi, I'm Tim. I also lead the Sora team.
And Bill, also lead the Sora team.
Simple enough.
Maybe we can just start with, you know,
the Open AI mission,
AGI, right? Greater intelligence. Is text a video, like on path to that mission? How'd you end up working on this?
Yeah, we absolutely believe models like SORA are really on the critical pathway to AGI.
We think one sample that illustrates this kind of nicely is a scene with a bunch of people walking through Tokyo during the winter.
And in that scene, there's so much complexity. So you have a camera which is flying through the scene.
There's lots of people which are interacting with one another. They're talking. They're holding hands.
They're people selling items at nearby stalls. And we really think the same thing.
illustrates how SORA is on a pathway towards being able to model extremely complex environments
and worlds all within the weights of a neural network. And looking forward, you know, in order to
generate truly realistic video, you have to have learned some model of how people work, how they
interact with others, how they think ultimately, and not only people, also animals and really
any kind of object you want to model. And so looking forward as we continue to scale up models
like SORA, we think we're going to be able to build these like world simulators where
essentially, you know, anybody can interact with them. I as a human can have my own simulator
running and I can go and give a human in that simulator work to go do and they can come back
with it after they're done. And we think this is a pathway to AGI, which is just going to happen
as we scale up SORA in the future. It's been said that we're still far away despite massive
demand for a consumer product. Like what is, is that on the roadmap? What do you have to work on
before you have broader access to SORA? Tim, you want to talk about it? Yeah, so we, we
We really want to engage with people outside of Open AI and thinking about how SORA will impact
the world, how it will be useful to people. And so we don't currently have immediate plans or even
a timeline for creating a product. But what we are doing is we're giving access to SORA to a small
group of artists as well as to Red Teamers to start learning about what impact SORA will have.
And so we're getting feedback from artists about how we can make it most useful as a tool for
them as well as feedback from red teamers about how we can make this safe, how we could introduce
it to the public. And this is going to set our roadmap for our future research and inform
if we do in the future end up coming up with the product or not exactly what timelines that would
have. Can you tell us about some of the feedback you've gone? Yeah. So we have given access to SORA
to like a small handful of artists and creators just to get early feedback. In general, I think
a big thing is just controllability.
So right now, the model really only accepts text as input.
And while that's useful, it's still pretty constraining in terms of being able to specify
precise descriptions of what you want.
So we're thinking about, like, you know, how to extend the capabilities of the model
potentially in the future so that you can supply inputs other than just text.
Do you all have a favorite thing that you've seen artists or others use it for or a favorite
video or something that you've found really inspiring?
I know that when it launched, a lot of people were really stricken by just how beautiful some of the images were, how striking, how you'd see the shadow of a cat in a pool of water, things like that.
But as just curious, what you've seen sort of emerge as people, more and more people start using it.
Yeah, it's been really amazing to see what the artists do with the model, because we have our own ideas of some things to try, but then people who, for their profession, are making creative content are, like, so creatively brilliant and do such amazing things.
So shy kids have this really cool video that they made this short story airhead with this character that has a balloon.
And they really, like, made this story.
And there it was really cool to see a way that Sora can unlock and make this story easier for them to tell.
And I think there it's even less about, like, a particular clip or video that Sora made and more about this story that these artists want to tell and are able to share.
and that SOR can help enable that. So that is really amazing to see.
You mentioned the Tokyo scene. Others?
My personal favorite sample that we've created is the Bling Zoo.
So I posted this on my Twitter the day we launched Sora.
And it's essentially a multi-shot scene of a zoo in New York, which is also a jewelry store.
And so you see like saber-toothed tigers kind of like decked out with Bling.
It was very surreal.
Yeah, yeah. And so I love those kinds of samples because as someone who
you know, loves to generate creative content, but doesn't really have the skills to do it.
It's like so easy to go play with this model and to just fire off a bunch of ideas and get
something that's pretty compelling. Like the time it took to actually generate that in terms
of iterating on prompts was, you know, really like less than an hour. So I get something I really
loved. So I had so much fun just playing with the model to get something like that out of it.
And it's great to see the artists are also enjoying using the models and getting great content
from that. What do you think is a timeline to broader use of these sorts of models for
films or other things because if you look at for example the evolution of
Pixar they really started making these Pixar shorts and then a subset of them
turned into these longer format movies and a lot of it had to do with how well
could they actually world model even little things like the movement of hair or
things like that and so it's been interesting to watch the evolution of that
prior generation of technology which I now think is 30 years old or something
like that do you have a prediction on when we'll start to see actual content
either from SORA or from other models that will be professionally produced and
sort of part of the broader media genre?
That's a good question.
I don't have a prediction on the exact timeline,
but one thing related to this I'm really interested in
is what things other than traditional films people might use this for.
I do think that maybe over the next couple years
we'll see people starting to make more and more films,
but I think people will also find completely new ways
to use these models that are just different
from the current media that we're used to.
Because it's a very different paradigm when you can tell these models kind of what you want them to see and they can respond in a way.
And maybe they're just like new modes of interacting with content that like really creative artists will come up with.
So I'm actually like most excited for what totally new things people will be doing.
That's just different from what we currently have.
It's really interesting because one of the things you mentioned earlier, this is also a way to do world modeling.
And I think you've been at OpenEI for something like five years.
And so you've seen a lot of the evolution of models and the company and what you've worked on.
And I remember going to the office really early on, and it was initially things like robotic arms.
And it was self-playing games or self-play for games and things like that.
As you think about the capabilities of this world simulation model, do you think it'll become a physics engine for simulation where people are, you know, actually simulating like wind tunnels?
Is it a basis for robotics?
And you say, there, is it something else?
I'm just sort of curious where are some of these other future-forward applications that could emerge.
Yeah, I totally think that carrying out simulations in the video model is something that we're going to be able to do in the future at some point.
Bill actually has a lot of thoughts about this sort of thing, so maybe you can...
Yeah, I mean, I think you hit the nail on the head with applications like robotics.
You know, there's so much you learn from video, which you don't necessarily get from other modalities,
which companies like Open Eye have invested a lot in the past, like, language.
You know, like the minutia of like how arms and joints move through space, you know, again, getting back to,
that scene in Tokyo, how those legs are moving and how they're making contact with the ground
in a physically accurate way. So you learned so much about the physical world just from training
on raw video that we really believe that it's going to be essential for things like physical
embodiment moving forward. And talking more about the model itself, there are a bunch of really
interesting innovations here, right? So not to put you on the spot, Tim, but can you describe for
a broad technical audience what a diffusion transformer is? Totally. So SORA build
on research from both the Dolly models and the GPT models at Open AI, and diffusion
is a process that creates data in our case videos by starting from noise and iteratively
removing noise many times until eventually you've removed so much noise that it just creates
a sample.
And so that is our process for generating the videos.
We start from a video of noise and we remove it incrementally.
But then architecturally, it's really important.
that our models are scalable and that they can learn from a lot of data and learn these really
complex and challenging relationships and videos.
And so we use an architecture that is similar to the GPT models and that's called a transformer.
And so diffusion transformers combining these two concepts.
And the transformer architecture allows us to scale these models and as we put more compute and
more data into training them, they get better and better.
And we even released a technical report on SORA and we show the results that you get from
the same prompt when you use a smaller amount of compute, an intermediate amount of compute
and more compute.
And by using this method, as you use more and more compute, the results get better and better.
And we strongly believe this trend will continue so that by using this really simple methodology,
we'll be able to continue improving these models by adding more compute, adding more data,
and they will be able to do all these amazing things we've been talking about, having
better simulation and longer-term generations. Bill, can we characterize it all with the scaling
laws for this type of model look like yet? Good question. So as Tim alluded to, you know,
one of the benefits of using transformers is that you inherit all of their great properties that
we've seen in other domains like language. So you absolutely can begin to come up with scaling
laws for video as opposed to language. And this is something that, you know, we're actively
looking at in our team and, you know, not only constructing them, but figuring out ways to make
them better. So, you know, if I use the same amount of training compute, can I get an even
better loss without fundamentally increasing the amount of compute needed? So these are a lot of
the questions that we tackle day-to-day on the research team to make SORA and future models
as good as possible. One of the questions about applying, you know, Transformers in this domain is
like tokenization, right? And so, by the way, I don't know who came up with this name, but like
latent space-time patches is like a great sci-fi name here. Can you explain, like, what that is
and why it is relevant here?
Because the ability to do minute-long generation
and get to visual and temporal coherence
is really amazing.
I don't think we came up with it as a name
so much as like a descriptive thing of exactly what,
like that's what we call it.
Yeah, even better though.
Yeah, yeah.
So one of the critical successes for the LLM paradigm
has been this notion of tokens.
So if you look at the internet,
there's all kinds of text data on it.
There's books, there's code, there's math,
And what's beautiful about language models
is that they have this singular notion of a token
which enables them to be trained
on this vast swath of very diverse data.
There's really no analog for prior visual generative models.
So, you know, what was very standard in the past before SORA
is that you would train, say, an image generative model
or a video generative model on just, like, 256 by 256 resolution images
or 256 by 256 video that's exactly like four seconds long.
And this is very limiting
because it limits the types of data you can use.
You have to throw away so much of, you know, the visual data that exists on the internet.
And that limits, like, the generalist capabilities of the model.
So with SORA, we introduced this notion of spacetime patches, where you can essentially just represent data,
however it exists in an image and a really long video and, like, a tall vertical video,
by just taking out cubes.
So you can essentially imagine, right, a video is just like a stack, a vertical stack of individual images.
So you can just take these, like, 3D cubes out of it.
And that is our notion of a token when we ultimately feed it into the transform.
And the result of this is that SORA, you know, can do a lot more than just generate, say, like, 720P video at, for some like fixed duration, right?
You can generate vertical videos, widescreen videos.
You can do anything between like one to two aspect ratio to two to one.
It can generate images.
It's an image generation model.
And so this is really the first generative model of visual content that has breadth in a way that language models have breadth.
So that was really why we pursued this direction.
It feels just as important on the, like, input and training side, right, in terms of being able to take in different types of video?
Absolutely. And so a huge part of this project was really developing the infrastructure and systems needed to be able to work with this vast data in a way that hasn't been needed for previous image or video generation systems.
A lot of the models before SORA that were working on video were really looking at extending image generation models.
and so there was a lot of great work on image generation
and what many people have been doing
is taking an image generator and extending it a bit
instead of doing one image you can do a few seconds
but what was really important for SORA
and was really this difference in architecture
was instead of starting from an image generator
and trying to add on video we started from scratch
and we started with the question of
how are we going to do a minute of HD footage
And that was our goal.
And when you have that goal,
we knew that we couldn't just extend an image generator.
We knew that in order to emit an HD footage,
we needed something that was scalable,
that broke down data into a really simple way
so that we could use scalable models.
So I think that really was the architectural evolution
from image generators to what led us to SORA.
That's a really interesting framework
because it feels like it could be applied
to all sorts of other areas
where people aren't currently applying end-to-end deep learning.
Yeah, I think that's right.
And it makes sense because in the shortest term, right, we weren't the first to come out with a video generator.
A lot of people, and a lot of people have done impressive work on video generation.
But we were like, okay, we'd rather pick a point further in the future and just, you know, work for a year on that.
And there is this pressure to do things fast because AI is so fast.
And the fastest thing to do is, oh, let's take what's working now and let's kind of like add on something to it.
And that probably is, as you're saying, more general than just image to video, but other things.
But sometimes it takes taking a step back and saying, like, what will the solution to this look like in three years?
Let's start building that.
Yeah, it seems like a very similar transition happened in self-driving recently, where people went from bespoke edge case sort of predictions and heuristics and all about a deal to like end-to-end deep learning in some of the new models.
So it's very exciting to see it apply to video.
One of the striking things about Sora is just the visual aesthetic of it.
And I'm a little bit curious, how did you go about either tuning or crafting that aesthetic?
Because I know that in some of the more traditional image gen models, you both have feedback that helps impact evolution of aesthetic over time.
But in some cases, people are literally tuning the models.
And I'm a little bit curious how you thought about it in the context of SORA.
Yeah.
Well, to be honest, we didn't spend a ton of effort on it for SORA.
The world is just beautiful?
Yeah.
Oh, this is a great answer.
I think that's maybe the honest answer to most of it.
I think SORA's language understanding definitely allows the user to steer it in a way that would be more difficult with, like, other models.
So you can provide a lot of like hints and visual cues that will sort of steer the model toward the type of generations that you want.
But it's not like the aesthetic is like deeply embedded.
Yeah, not yet.
But I think moving to the future, you know, I feel like the models kind of empowering people to sort of
get it to grok your personal sense of aesthetic is going to be something that a lot of people
will look forward to. Many of the artists and creators that we talk to, they'd love to just like
upload their whole portfolio of assets to the model and be able to draw upon like a large body of
work when they're writing captions and how the model understand like the jargon of their design
firm accumulated over many decades and so on. So I think personalization and how that will kind of
work together with aesthetics is going to be a cool thing to explore later on.
I think to the point Tim was making about just like a new applications beyond traditional
entertainment. I work and I travel and I have young kids. And so I don't know if this is like
something to be judged for or not. But one of the things I do today is generate what amount to
like short audio books with voice cloning, dolly images and, you know, stories in the style
of like the magic tree house or whatever in around some top.
that either I'm interested in, like, oh, you know, hang out with Roman Emperor X, right?
Or something the girls my kids are interested in.
But this is computationally expensive and hard and not quite possible.
But I imagine there's some version of, like, desktop Pixar for everyone, which is like, you know,
I think kids are going to find this first, but I'm going to narrate a story and have, like,
magical visuals happen in real time.
I think this is a very different entertainment paradigm than we have now.
Totally.
I mean...
Are we going to get it?
Yeah.
I think we're headed there and a different entertainment paradigm and also a different
educational paradigm and a communication paradigm.
Entertainment's a big part of that, but I think there are actually many potential
applications once this really understands our world and so much of our world and how we
experience it is visual.
And something really cool about these models is that they're starting to better understand
our world and what we live in and the things that we do and we can potentially use them
to entertain us, but also to educate us. And, like, sometimes if I'm trying to learn something,
the best thing would be if I could get a custom tailored educational video to explain it to me.
Or if I'm trying to communicate something to someone, you know, maybe the best communication I could
do is make a video to explain my point. So I think that entertainment, but also kind of a much
broader set of potential things that video models could be useful for.
That makes sense. I mean, that resonates in that. I think if you asked people under some
certain age cut off that they'd say the biggest driver of educational world to YouTube today.
Right.
Better or worse?
Yeah.
Have you all tried applying this to things like digital avatars?
I mean, there's companies like synesthesia, hegen, etc.
They're doing interesting things in this area, but having a true, something that really
encapsulates a person in a very deep and rich way seems kind of fascinating as one potential
adaptive approach to this.
I'm just sort of curious if you've tried anything along those lines, yeah.
or if it's not really applicable given that it's more like text to video prompts.
So we haven't, we've really focused on just the core technology behind it so far.
So we haven't focused that much on, for that matter,
of particular applications, including the idea of avatars,
which makes a lot of sense.
And I think that would be very cool to try.
I think where we are in the trajectory of SORA right now is like,
this is the GPT1 of this new paradigm of visual models.
And that we're really looking at the fundamental.
research into making these way better, making it a way better engine that could power all these
different things. So that's, so our focus is just on this fundamental development of the
technology right now, maybe more so than specific downstream. That makes sense. Yeah, one of the
reasons I ask about the avatar stuff as well as it starts to open questions around safety
and says a little bit curious, you know, how you all thought about safety in the context of video
models and the potential to do deep fakes or spoofs or things like that. Yeah, I can speak a little
bit to that. It's definitely a pretty complex topic. I think a lot of the safety mitigations
could probably be ported over from Dali 3. For example, the way we handle like Gracie images
or gory images, things like that. There's definitely going to be new safety issues to worry about,
for example, misinformation. Or for example, like, do we allow users to generate images that have
offensive words on them? And I think one key thing to figure out here is like how much responsibility
do the companies deploying this technology bear?
How much should social media companies do?
For example, to inform users that content they're seeing may not be from a trusted source.
And how much responsibility does the user bear for using this technology to create something in the first place?
So I think it's tricky and we need to think hard about these issues to sort of reach a position that we think is going to be best for people.
That makes sense.
It's also there's a lot of precedent.
Like people used to use Photoshop to manipulate images and then publish them and make claims.
And it's not like people said that therefore the maker of Photoshop is liable for somebody abusing the technology.
So it seems like there's a lot of precedent in terms of how you can think about some of these things as well.
Yeah, totally.
Like we want to release something that people feel like they really have the freedom to express themselves and do what they want to do.
But at the same time, sometimes that's at odds with, you know, doing something that is responsible and sort of gradually.
releasing the technology in a way that people can get used to it.
I guess a question for all of you, maybe starting with Tim, is like, and if you can share this great, if not understood, but what is the thing you're most excited about in terms of the future product roadmap or where you're heading or some of the capabilities that you're working on next?
Yeah, great question. I'm really excited about the things that people will create with this. I think there are so many brilliant creative people with ideas of things that they want to make. And sometimes being able to make that is really hard because it requires resources.
or tools or things that you don't have access to.
And there's the potential for this technology
to enable so many people with brilliant creative ideas to make things.
And I'm really excited for what awesome things they're going to make
and that this technology will help to make.
Bill, maybe one question for you would just be,
if this is, as you just mentioned, like, the GPT-1,
we have a long way to go.
This isn't something that the general public
has an opportunity experiment with yet.
Can you sort of characterize what the limitations are or the gaps are that you want to work on
besides the obvious around like length, right?
Yeah.
So I think in terms of making this something that's more widely available, you know, there's a lot of serving kind of considerations that have to go in there.
So a big one here is making it cheap enough for people to use.
So we've said, you know, in the past that in terms of generating videos, it depends a lot on the exact
parameters of, you know, like the resolution and the duration of the video you're creating.
but you know it's not instant and you have to wait at least like a few minutes
for like these really long videos that we're generating and so we're actively working on
threads here to make that cheaper in order to democratize this more broadly I think
there's a lot of considerations as a Ditya and Tim we're alluding to on the safety side as
well so in order for this to really become more broadly accessible we need to you know
make sure that especially in an election here we're being really careful with the
potential for misinformation and any surrounding risks we're actively working
on addressing these threads today.
That's a big part of our research roadmap.
What about just core, for lack of better term,
quality issues?
Yeah, yeah.
Are there specific things,
like if it's object permanence
or certain types of interactions you're thinking through?
Yeah, so as we look forward to the GPT2 or GPT3 moments,
I think we're really excited for very complex long-term
physical interactions to become much more accurate.
So to give a concrete example of where SORA fall short today,
you know, if I have a video of someone like playing soft
and they're kicking around a ball. At some point, you know, that ball is probably going to, like,
vaporize and maybe come back. So it can do certain kinds of simpler interactions, pretty
reliably, you know, things like people walking, for example. But these types of more detailed
object-to-object interactions are definitely, you know, still a feature that's in the oven,
and we think it's going to get a lot better with scale. But that's something to look forward to
moving forward. There's one sample that I think is, like, a glimpse of the few. I mean, sure,
there are many. But there's one I've seen, which is, you know, a man taking a
bite of a burger and the bite being in the burger in terms of like keeping state, which is very
cool. Yeah. We are really excited about that one. Also, there's another one where it's like a woman
like painting with watercolors on a canvas and it actually leaves a trail. So there's like
glimmers of, you know, this kind of capability in the current model, as you said. And we think it's
going to get much better in the future. Is there anything you can say about how the work you've done
with SORA sort of affects the broader research roadmap? Yeah. So I think something here is about
the knowledge that SORA ends up learning about the world, just from seeing all this visual data.
It understands 3D, which is one cool thing because we haven't trained it to.
We didn't explicitly bake 3D information into it whatsoever.
We just trained it on video data, and it learned about 3D because 3D exists in those videos.
And it learned that when you take a bite out of a hamburger, that you leave a bite mark.
So it's learning so much about our world.
and when we interact with the world, so much of it is visual, so much of what we see and learn
throughout our lives is visual information.
So we really think that just in terms of intelligence, in terms of leading toward AI models
that are more intelligent that better understand the world like we do, this will actually
be really important for them to have this grounding of like, hey, this is the world that we
live in, there's so much complexity in it, there's so much about how people interact, how
things happen, how events in the past end up impacting events in the future, that this will
actually lead to just much more intelligent AI models more broadly than even generating
videos.
It's almost like you invented like the future visual cortex plus some part of the reasoning
parts of the brain or something, sort of simultaneously.
Yeah, and that's a cool comparison because a lot of the intelligence that humans have is
actually about world modeling, right? All the time when we're thinking about how we're going
to do things. We're playing out scenarios in our head. We have dreams where we're playing
out scenarios in head. We're thinking in advance of doing things, if I did this, this thing
would happen. If I did this other thing, what would happen? So we have a world model, and building
SORA as a world model is very similar to a big part of the intelligence that humans have.
How do you guys think about the sort of analogy to humans as having a very approximate world
model versus something that is as accurate as like, let's say, a physics engine in the traditional
sense, right? Because if I, you know, hold an apple and I drop it, I expect it to fall at a certain
rate. But most humans do not think of that as articulating a path with a speed as a calculation.
Do you think that sort of learning is like parallel in large models?
I think it's a really interesting observation. I think how we think about things is that it's almost
like a deficiency, you know, in humans that it's not so high fidelity. So, you know, the fact
that we actually can't do very accurate long-term prediction when you get down to a really
narrow set of physics is something that we can improve upon with some of these systems. And so
we're optimistic that Sora will, you know, supersede that kind of capability and will, you know,
in the long run, enable to be more intelligent one day than humans as world models. But it is,
you know, certainly an existence proof that it's not necessary for other types of intelligence.
Regardless of that, it's still something that SORA and models in the future will be able to improve upon.
Okay, so it's very clear that the trajectory prediction for, like, throwing a football is going to be better than the next, next versions of these models and minus, let's say.
If I could add something to that, this relates to the paradigm of scale and the bitter lesson a bit about how we want methods that as you increase compute get better and better.
and something that works really well in this paradigm is doing the simple but challenging task of just
predicting data. And you can try coming up with more complicated tasks, for example, something that
doesn't use video explicitly, but is maybe in some like space that simulates approximate things or
something. But all this complexity actually isn't beneficial when it comes to the scaling laws
of how methods improve as you increase scale. And what works really well,
as you increase scale is just predict data.
And that's what we do with text.
We just predict text.
And that's exactly what we're doing with visual data with SORA,
which is we're not making some complicated,
trying to figure out some new thing to optimize.
We're saying, hey, the best way to learn intelligence
in a scalable matter is to just predict data.
That makes sense in relating to what you said, Bill,
like predictions will just get much better
with no necessary limit that approximates humans.
Right.
Is there anything you feel?
like the general public misunderstands about video models or about SORA, or you want them to know?
I think maybe the biggest update to people with the release of SORA is that internally, we've
always made an analogy, as Bill and Tim said, between SORA and GPT models, in that, you know,
when GPT1 and GPT2 came out, it started to become increasingly clear to some people that simply
scaling up these models would give them amazing capabilities.
And it wasn't clear right away if like, oh,
we'll scaling up next token prediction result in a language model that's helpful for writing code.
To us, like, it's felt pretty clear that applying the same methodology to video models
is also going to result in really amazing capabilities.
And I think SORA 1 is kind of an existence proof that there's one point on the scaling curve now,
and we're very excited for what this is going to lead to.
Yeah, amazing.
Well, I don't know why it's such a surprise to everybody, but their lesson wins again.
Yeah.
I would just say that, as both Tim and Aditya we're alluding to, we really do feel like this is the GPT1 moments,
and these models are going to get a lot better very quickly.
And we're really excited both for the incredible benefits we think this is going to bring to the creative world,
what the implications are long-term for AGI.
And at the same time, we're trying to be very mindful about the safety considerations
and building a robust stack now to make sure that society is actually going to get the benefits of this
while mitigating the downsides.
But it's exciting times, and we're looking forward to what future models are going to be capable of.
Yeah, congrats on such an amazing release.
Find us on Twitter at No Pryor's Pod.
Subscribe to our YouTube channel if you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen.
That way you get a new episode every week.
And sign up for emails or find transcripts for every episode at no-priars.com.