No Priors: Artificial Intelligence | Technology | Startups - OpenAI’s Sora team thinks we’ve only seen the "GPT-1 of video models"

Episode Date: April 25, 2024

AI-generated videos are not just leveled-up image generators. But rather, they could be a big step forward on the path to AGI. This week on No Priors, the team from Sora is here to discuss OpenAI’s ...recently announced generative video model, which can take a text prompt and create realistic, visually coherent, high-definition clips that are up to a minute long. Sora team leads, Aditya Ramesh, Tim Brooks, and Bill Peebles join Elad and Sarah to talk about developing Sora. The generative video model isn’t yet available for public use but the examples of its work are very impressive. However, they believe we’re still in the GPT-1 era of AI video models and are focused on a slow rollout to ensure the model is in the best place possible to offer value to the user and more importantly they’ve applied all the safety measures possible to avoid deep fakes and misinformation. They also discuss what they’re learning from implementing diffusion transformers, why they believe video generation is taking us one step closer to AGI, and why entertainment may not be the main use case for this tool in the future.  Show Links: Bling Zoo video Man eating a burger video Tokyo Walk video Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @_tim_brooks l @billpeeb l @model_mechanic Show Notes:  (0:00) Sora team Introduction (1:05) Simulating the world with Sora (2:25) Building the most valuable consumer product (5:50) Alternative use cases and simulation capabilities (8:41) Diffusion transformers explanation (10:15) Scaling laws for video (13:08) Applying end-to-end deep learning to video (15:30) Tuning the visual aesthetic of Sora (17:08) The road to “desktop Pixar” for everyone (20:12) Safety for visual models (22:34) Limitations of Sora (25:04) Learning from how Sora is learning (29:32) The biggest misconceptions about video models

Transcript
Discussion (0)
Starting point is 00:00:00 Hi listeners, welcome to another episode of No Priors. Today we're excited to be talking to the team behind OpenAI's SORA, which is a new generative video model that can take a text prompt and return a clip that is high definition, visually coherent, and up to a minute long. SORA also raised the question of whether these large video models are world simulators and applied the scalable Transformers architecture. to the video domain. We're here with the team behind it. Aditya Ramesh, Tim Brooks, and Bill Peebles. Welcome to No Pryors, guys.
Starting point is 00:00:39 Thanks for having us. To start off, why don't we just ask each of you to introduce yourselves so our listeners know who we're talking to, Aditya, mine starting us off. Sure, I'm Aditya. I lead the Sora team together with Tim and Bill. Hi, I'm Tim. I also lead the Sora team. And Bill, also lead the Sora team.
Starting point is 00:00:55 Simple enough. Maybe we can just start with, you know, the Open AI mission, AGI, right? Greater intelligence. Is text a video, like on path to that mission? How'd you end up working on this? Yeah, we absolutely believe models like SORA are really on the critical pathway to AGI. We think one sample that illustrates this kind of nicely is a scene with a bunch of people walking through Tokyo during the winter. And in that scene, there's so much complexity. So you have a camera which is flying through the scene. There's lots of people which are interacting with one another. They're talking. They're holding hands.
Starting point is 00:01:26 They're people selling items at nearby stalls. And we really think the same thing. illustrates how SORA is on a pathway towards being able to model extremely complex environments and worlds all within the weights of a neural network. And looking forward, you know, in order to generate truly realistic video, you have to have learned some model of how people work, how they interact with others, how they think ultimately, and not only people, also animals and really any kind of object you want to model. And so looking forward as we continue to scale up models like SORA, we think we're going to be able to build these like world simulators where essentially, you know, anybody can interact with them. I as a human can have my own simulator
Starting point is 00:02:02 running and I can go and give a human in that simulator work to go do and they can come back with it after they're done. And we think this is a pathway to AGI, which is just going to happen as we scale up SORA in the future. It's been said that we're still far away despite massive demand for a consumer product. Like what is, is that on the roadmap? What do you have to work on before you have broader access to SORA? Tim, you want to talk about it? Yeah, so we, we We really want to engage with people outside of Open AI and thinking about how SORA will impact the world, how it will be useful to people. And so we don't currently have immediate plans or even a timeline for creating a product. But what we are doing is we're giving access to SORA to a small
Starting point is 00:02:45 group of artists as well as to Red Teamers to start learning about what impact SORA will have. And so we're getting feedback from artists about how we can make it most useful as a tool for them as well as feedback from red teamers about how we can make this safe, how we could introduce it to the public. And this is going to set our roadmap for our future research and inform if we do in the future end up coming up with the product or not exactly what timelines that would have. Can you tell us about some of the feedback you've gone? Yeah. So we have given access to SORA to like a small handful of artists and creators just to get early feedback. In general, I think a big thing is just controllability.
Starting point is 00:03:25 So right now, the model really only accepts text as input. And while that's useful, it's still pretty constraining in terms of being able to specify precise descriptions of what you want. So we're thinking about, like, you know, how to extend the capabilities of the model potentially in the future so that you can supply inputs other than just text. Do you all have a favorite thing that you've seen artists or others use it for or a favorite video or something that you've found really inspiring? I know that when it launched, a lot of people were really stricken by just how beautiful some of the images were, how striking, how you'd see the shadow of a cat in a pool of water, things like that.
Starting point is 00:04:00 But as just curious, what you've seen sort of emerge as people, more and more people start using it. Yeah, it's been really amazing to see what the artists do with the model, because we have our own ideas of some things to try, but then people who, for their profession, are making creative content are, like, so creatively brilliant and do such amazing things. So shy kids have this really cool video that they made this short story airhead with this character that has a balloon. And they really, like, made this story. And there it was really cool to see a way that Sora can unlock and make this story easier for them to tell. And I think there it's even less about, like, a particular clip or video that Sora made and more about this story that these artists want to tell and are able to share. and that SOR can help enable that. So that is really amazing to see. You mentioned the Tokyo scene. Others?
Starting point is 00:04:56 My personal favorite sample that we've created is the Bling Zoo. So I posted this on my Twitter the day we launched Sora. And it's essentially a multi-shot scene of a zoo in New York, which is also a jewelry store. And so you see like saber-toothed tigers kind of like decked out with Bling. It was very surreal. Yeah, yeah. And so I love those kinds of samples because as someone who you know, loves to generate creative content, but doesn't really have the skills to do it. It's like so easy to go play with this model and to just fire off a bunch of ideas and get
Starting point is 00:05:29 something that's pretty compelling. Like the time it took to actually generate that in terms of iterating on prompts was, you know, really like less than an hour. So I get something I really loved. So I had so much fun just playing with the model to get something like that out of it. And it's great to see the artists are also enjoying using the models and getting great content from that. What do you think is a timeline to broader use of these sorts of models for films or other things because if you look at for example the evolution of Pixar they really started making these Pixar shorts and then a subset of them turned into these longer format movies and a lot of it had to do with how well
Starting point is 00:06:01 could they actually world model even little things like the movement of hair or things like that and so it's been interesting to watch the evolution of that prior generation of technology which I now think is 30 years old or something like that do you have a prediction on when we'll start to see actual content either from SORA or from other models that will be professionally produced and sort of part of the broader media genre? That's a good question. I don't have a prediction on the exact timeline,
Starting point is 00:06:25 but one thing related to this I'm really interested in is what things other than traditional films people might use this for. I do think that maybe over the next couple years we'll see people starting to make more and more films, but I think people will also find completely new ways to use these models that are just different from the current media that we're used to. Because it's a very different paradigm when you can tell these models kind of what you want them to see and they can respond in a way.
Starting point is 00:06:54 And maybe they're just like new modes of interacting with content that like really creative artists will come up with. So I'm actually like most excited for what totally new things people will be doing. That's just different from what we currently have. It's really interesting because one of the things you mentioned earlier, this is also a way to do world modeling. And I think you've been at OpenEI for something like five years. And so you've seen a lot of the evolution of models and the company and what you've worked on. And I remember going to the office really early on, and it was initially things like robotic arms. And it was self-playing games or self-play for games and things like that.
Starting point is 00:07:26 As you think about the capabilities of this world simulation model, do you think it'll become a physics engine for simulation where people are, you know, actually simulating like wind tunnels? Is it a basis for robotics? And you say, there, is it something else? I'm just sort of curious where are some of these other future-forward applications that could emerge. Yeah, I totally think that carrying out simulations in the video model is something that we're going to be able to do in the future at some point. Bill actually has a lot of thoughts about this sort of thing, so maybe you can... Yeah, I mean, I think you hit the nail on the head with applications like robotics. You know, there's so much you learn from video, which you don't necessarily get from other modalities,
Starting point is 00:08:06 which companies like Open Eye have invested a lot in the past, like, language. You know, like the minutia of like how arms and joints move through space, you know, again, getting back to, that scene in Tokyo, how those legs are moving and how they're making contact with the ground in a physically accurate way. So you learned so much about the physical world just from training on raw video that we really believe that it's going to be essential for things like physical embodiment moving forward. And talking more about the model itself, there are a bunch of really interesting innovations here, right? So not to put you on the spot, Tim, but can you describe for a broad technical audience what a diffusion transformer is? Totally. So SORA build
Starting point is 00:08:44 on research from both the Dolly models and the GPT models at Open AI, and diffusion is a process that creates data in our case videos by starting from noise and iteratively removing noise many times until eventually you've removed so much noise that it just creates a sample. And so that is our process for generating the videos. We start from a video of noise and we remove it incrementally. But then architecturally, it's really important. that our models are scalable and that they can learn from a lot of data and learn these really
Starting point is 00:09:19 complex and challenging relationships and videos. And so we use an architecture that is similar to the GPT models and that's called a transformer. And so diffusion transformers combining these two concepts. And the transformer architecture allows us to scale these models and as we put more compute and more data into training them, they get better and better. And we even released a technical report on SORA and we show the results that you get from the same prompt when you use a smaller amount of compute, an intermediate amount of compute and more compute.
Starting point is 00:09:54 And by using this method, as you use more and more compute, the results get better and better. And we strongly believe this trend will continue so that by using this really simple methodology, we'll be able to continue improving these models by adding more compute, adding more data, and they will be able to do all these amazing things we've been talking about, having better simulation and longer-term generations. Bill, can we characterize it all with the scaling laws for this type of model look like yet? Good question. So as Tim alluded to, you know, one of the benefits of using transformers is that you inherit all of their great properties that we've seen in other domains like language. So you absolutely can begin to come up with scaling
Starting point is 00:10:34 laws for video as opposed to language. And this is something that, you know, we're actively looking at in our team and, you know, not only constructing them, but figuring out ways to make them better. So, you know, if I use the same amount of training compute, can I get an even better loss without fundamentally increasing the amount of compute needed? So these are a lot of the questions that we tackle day-to-day on the research team to make SORA and future models as good as possible. One of the questions about applying, you know, Transformers in this domain is like tokenization, right? And so, by the way, I don't know who came up with this name, but like latent space-time patches is like a great sci-fi name here. Can you explain, like, what that is
Starting point is 00:11:11 and why it is relevant here? Because the ability to do minute-long generation and get to visual and temporal coherence is really amazing. I don't think we came up with it as a name so much as like a descriptive thing of exactly what, like that's what we call it. Yeah, even better though.
Starting point is 00:11:30 Yeah, yeah. So one of the critical successes for the LLM paradigm has been this notion of tokens. So if you look at the internet, there's all kinds of text data on it. There's books, there's code, there's math, And what's beautiful about language models is that they have this singular notion of a token
Starting point is 00:11:46 which enables them to be trained on this vast swath of very diverse data. There's really no analog for prior visual generative models. So, you know, what was very standard in the past before SORA is that you would train, say, an image generative model or a video generative model on just, like, 256 by 256 resolution images or 256 by 256 video that's exactly like four seconds long. And this is very limiting
Starting point is 00:12:09 because it limits the types of data you can use. You have to throw away so much of, you know, the visual data that exists on the internet. And that limits, like, the generalist capabilities of the model. So with SORA, we introduced this notion of spacetime patches, where you can essentially just represent data, however it exists in an image and a really long video and, like, a tall vertical video, by just taking out cubes. So you can essentially imagine, right, a video is just like a stack, a vertical stack of individual images. So you can just take these, like, 3D cubes out of it.
Starting point is 00:12:38 And that is our notion of a token when we ultimately feed it into the transform. And the result of this is that SORA, you know, can do a lot more than just generate, say, like, 720P video at, for some like fixed duration, right? You can generate vertical videos, widescreen videos. You can do anything between like one to two aspect ratio to two to one. It can generate images. It's an image generation model. And so this is really the first generative model of visual content that has breadth in a way that language models have breadth. So that was really why we pursued this direction.
Starting point is 00:13:08 It feels just as important on the, like, input and training side, right, in terms of being able to take in different types of video? Absolutely. And so a huge part of this project was really developing the infrastructure and systems needed to be able to work with this vast data in a way that hasn't been needed for previous image or video generation systems. A lot of the models before SORA that were working on video were really looking at extending image generation models. and so there was a lot of great work on image generation and what many people have been doing is taking an image generator and extending it a bit instead of doing one image you can do a few seconds but what was really important for SORA
Starting point is 00:13:50 and was really this difference in architecture was instead of starting from an image generator and trying to add on video we started from scratch and we started with the question of how are we going to do a minute of HD footage And that was our goal. And when you have that goal, we knew that we couldn't just extend an image generator.
Starting point is 00:14:10 We knew that in order to emit an HD footage, we needed something that was scalable, that broke down data into a really simple way so that we could use scalable models. So I think that really was the architectural evolution from image generators to what led us to SORA. That's a really interesting framework because it feels like it could be applied
Starting point is 00:14:27 to all sorts of other areas where people aren't currently applying end-to-end deep learning. Yeah, I think that's right. And it makes sense because in the shortest term, right, we weren't the first to come out with a video generator. A lot of people, and a lot of people have done impressive work on video generation. But we were like, okay, we'd rather pick a point further in the future and just, you know, work for a year on that. And there is this pressure to do things fast because AI is so fast. And the fastest thing to do is, oh, let's take what's working now and let's kind of like add on something to it.
Starting point is 00:15:01 And that probably is, as you're saying, more general than just image to video, but other things. But sometimes it takes taking a step back and saying, like, what will the solution to this look like in three years? Let's start building that. Yeah, it seems like a very similar transition happened in self-driving recently, where people went from bespoke edge case sort of predictions and heuristics and all about a deal to like end-to-end deep learning in some of the new models. So it's very exciting to see it apply to video. One of the striking things about Sora is just the visual aesthetic of it. And I'm a little bit curious, how did you go about either tuning or crafting that aesthetic? Because I know that in some of the more traditional image gen models, you both have feedback that helps impact evolution of aesthetic over time.
Starting point is 00:15:45 But in some cases, people are literally tuning the models. And I'm a little bit curious how you thought about it in the context of SORA. Yeah. Well, to be honest, we didn't spend a ton of effort on it for SORA. The world is just beautiful? Yeah. Oh, this is a great answer. I think that's maybe the honest answer to most of it.
Starting point is 00:16:02 I think SORA's language understanding definitely allows the user to steer it in a way that would be more difficult with, like, other models. So you can provide a lot of like hints and visual cues that will sort of steer the model toward the type of generations that you want. But it's not like the aesthetic is like deeply embedded. Yeah, not yet. But I think moving to the future, you know, I feel like the models kind of empowering people to sort of get it to grok your personal sense of aesthetic is going to be something that a lot of people will look forward to. Many of the artists and creators that we talk to, they'd love to just like upload their whole portfolio of assets to the model and be able to draw upon like a large body of
Starting point is 00:16:44 work when they're writing captions and how the model understand like the jargon of their design firm accumulated over many decades and so on. So I think personalization and how that will kind of work together with aesthetics is going to be a cool thing to explore later on. I think to the point Tim was making about just like a new applications beyond traditional entertainment. I work and I travel and I have young kids. And so I don't know if this is like something to be judged for or not. But one of the things I do today is generate what amount to like short audio books with voice cloning, dolly images and, you know, stories in the style of like the magic tree house or whatever in around some top.
Starting point is 00:17:28 that either I'm interested in, like, oh, you know, hang out with Roman Emperor X, right? Or something the girls my kids are interested in. But this is computationally expensive and hard and not quite possible. But I imagine there's some version of, like, desktop Pixar for everyone, which is like, you know, I think kids are going to find this first, but I'm going to narrate a story and have, like, magical visuals happen in real time. I think this is a very different entertainment paradigm than we have now. Totally.
Starting point is 00:17:55 I mean... Are we going to get it? Yeah. I think we're headed there and a different entertainment paradigm and also a different educational paradigm and a communication paradigm. Entertainment's a big part of that, but I think there are actually many potential applications once this really understands our world and so much of our world and how we experience it is visual.
Starting point is 00:18:19 And something really cool about these models is that they're starting to better understand our world and what we live in and the things that we do and we can potentially use them to entertain us, but also to educate us. And, like, sometimes if I'm trying to learn something, the best thing would be if I could get a custom tailored educational video to explain it to me. Or if I'm trying to communicate something to someone, you know, maybe the best communication I could do is make a video to explain my point. So I think that entertainment, but also kind of a much broader set of potential things that video models could be useful for. That makes sense. I mean, that resonates in that. I think if you asked people under some
Starting point is 00:18:56 certain age cut off that they'd say the biggest driver of educational world to YouTube today. Right. Better or worse? Yeah. Have you all tried applying this to things like digital avatars? I mean, there's companies like synesthesia, hegen, etc. They're doing interesting things in this area, but having a true, something that really encapsulates a person in a very deep and rich way seems kind of fascinating as one potential
Starting point is 00:19:18 adaptive approach to this. I'm just sort of curious if you've tried anything along those lines, yeah. or if it's not really applicable given that it's more like text to video prompts. So we haven't, we've really focused on just the core technology behind it so far. So we haven't focused that much on, for that matter, of particular applications, including the idea of avatars, which makes a lot of sense. And I think that would be very cool to try.
Starting point is 00:19:42 I think where we are in the trajectory of SORA right now is like, this is the GPT1 of this new paradigm of visual models. And that we're really looking at the fundamental. research into making these way better, making it a way better engine that could power all these different things. So that's, so our focus is just on this fundamental development of the technology right now, maybe more so than specific downstream. That makes sense. Yeah, one of the reasons I ask about the avatar stuff as well as it starts to open questions around safety and says a little bit curious, you know, how you all thought about safety in the context of video
Starting point is 00:20:15 models and the potential to do deep fakes or spoofs or things like that. Yeah, I can speak a little bit to that. It's definitely a pretty complex topic. I think a lot of the safety mitigations could probably be ported over from Dali 3. For example, the way we handle like Gracie images or gory images, things like that. There's definitely going to be new safety issues to worry about, for example, misinformation. Or for example, like, do we allow users to generate images that have offensive words on them? And I think one key thing to figure out here is like how much responsibility do the companies deploying this technology bear? How much should social media companies do?
Starting point is 00:20:56 For example, to inform users that content they're seeing may not be from a trusted source. And how much responsibility does the user bear for using this technology to create something in the first place? So I think it's tricky and we need to think hard about these issues to sort of reach a position that we think is going to be best for people. That makes sense. It's also there's a lot of precedent. Like people used to use Photoshop to manipulate images and then publish them and make claims. And it's not like people said that therefore the maker of Photoshop is liable for somebody abusing the technology. So it seems like there's a lot of precedent in terms of how you can think about some of these things as well.
Starting point is 00:21:33 Yeah, totally. Like we want to release something that people feel like they really have the freedom to express themselves and do what they want to do. But at the same time, sometimes that's at odds with, you know, doing something that is responsible and sort of gradually. releasing the technology in a way that people can get used to it. I guess a question for all of you, maybe starting with Tim, is like, and if you can share this great, if not understood, but what is the thing you're most excited about in terms of the future product roadmap or where you're heading or some of the capabilities that you're working on next? Yeah, great question. I'm really excited about the things that people will create with this. I think there are so many brilliant creative people with ideas of things that they want to make. And sometimes being able to make that is really hard because it requires resources. or tools or things that you don't have access to. And there's the potential for this technology
Starting point is 00:22:22 to enable so many people with brilliant creative ideas to make things. And I'm really excited for what awesome things they're going to make and that this technology will help to make. Bill, maybe one question for you would just be, if this is, as you just mentioned, like, the GPT-1, we have a long way to go. This isn't something that the general public has an opportunity experiment with yet.
Starting point is 00:22:45 Can you sort of characterize what the limitations are or the gaps are that you want to work on besides the obvious around like length, right? Yeah. So I think in terms of making this something that's more widely available, you know, there's a lot of serving kind of considerations that have to go in there. So a big one here is making it cheap enough for people to use. So we've said, you know, in the past that in terms of generating videos, it depends a lot on the exact parameters of, you know, like the resolution and the duration of the video you're creating. but you know it's not instant and you have to wait at least like a few minutes
Starting point is 00:23:18 for like these really long videos that we're generating and so we're actively working on threads here to make that cheaper in order to democratize this more broadly I think there's a lot of considerations as a Ditya and Tim we're alluding to on the safety side as well so in order for this to really become more broadly accessible we need to you know make sure that especially in an election here we're being really careful with the potential for misinformation and any surrounding risks we're actively working on addressing these threads today. That's a big part of our research roadmap.
Starting point is 00:23:48 What about just core, for lack of better term, quality issues? Yeah, yeah. Are there specific things, like if it's object permanence or certain types of interactions you're thinking through? Yeah, so as we look forward to the GPT2 or GPT3 moments, I think we're really excited for very complex long-term
Starting point is 00:24:06 physical interactions to become much more accurate. So to give a concrete example of where SORA fall short today, you know, if I have a video of someone like playing soft and they're kicking around a ball. At some point, you know, that ball is probably going to, like, vaporize and maybe come back. So it can do certain kinds of simpler interactions, pretty reliably, you know, things like people walking, for example. But these types of more detailed object-to-object interactions are definitely, you know, still a feature that's in the oven, and we think it's going to get a lot better with scale. But that's something to look forward to
Starting point is 00:24:35 moving forward. There's one sample that I think is, like, a glimpse of the few. I mean, sure, there are many. But there's one I've seen, which is, you know, a man taking a bite of a burger and the bite being in the burger in terms of like keeping state, which is very cool. Yeah. We are really excited about that one. Also, there's another one where it's like a woman like painting with watercolors on a canvas and it actually leaves a trail. So there's like glimmers of, you know, this kind of capability in the current model, as you said. And we think it's going to get much better in the future. Is there anything you can say about how the work you've done with SORA sort of affects the broader research roadmap? Yeah. So I think something here is about
Starting point is 00:25:12 the knowledge that SORA ends up learning about the world, just from seeing all this visual data. It understands 3D, which is one cool thing because we haven't trained it to. We didn't explicitly bake 3D information into it whatsoever. We just trained it on video data, and it learned about 3D because 3D exists in those videos. And it learned that when you take a bite out of a hamburger, that you leave a bite mark. So it's learning so much about our world. and when we interact with the world, so much of it is visual, so much of what we see and learn throughout our lives is visual information.
Starting point is 00:25:48 So we really think that just in terms of intelligence, in terms of leading toward AI models that are more intelligent that better understand the world like we do, this will actually be really important for them to have this grounding of like, hey, this is the world that we live in, there's so much complexity in it, there's so much about how people interact, how things happen, how events in the past end up impacting events in the future, that this will actually lead to just much more intelligent AI models more broadly than even generating videos. It's almost like you invented like the future visual cortex plus some part of the reasoning
Starting point is 00:26:25 parts of the brain or something, sort of simultaneously. Yeah, and that's a cool comparison because a lot of the intelligence that humans have is actually about world modeling, right? All the time when we're thinking about how we're going to do things. We're playing out scenarios in our head. We have dreams where we're playing out scenarios in head. We're thinking in advance of doing things, if I did this, this thing would happen. If I did this other thing, what would happen? So we have a world model, and building SORA as a world model is very similar to a big part of the intelligence that humans have. How do you guys think about the sort of analogy to humans as having a very approximate world
Starting point is 00:27:03 model versus something that is as accurate as like, let's say, a physics engine in the traditional sense, right? Because if I, you know, hold an apple and I drop it, I expect it to fall at a certain rate. But most humans do not think of that as articulating a path with a speed as a calculation. Do you think that sort of learning is like parallel in large models? I think it's a really interesting observation. I think how we think about things is that it's almost like a deficiency, you know, in humans that it's not so high fidelity. So, you know, the fact that we actually can't do very accurate long-term prediction when you get down to a really narrow set of physics is something that we can improve upon with some of these systems. And so
Starting point is 00:27:46 we're optimistic that Sora will, you know, supersede that kind of capability and will, you know, in the long run, enable to be more intelligent one day than humans as world models. But it is, you know, certainly an existence proof that it's not necessary for other types of intelligence. Regardless of that, it's still something that SORA and models in the future will be able to improve upon. Okay, so it's very clear that the trajectory prediction for, like, throwing a football is going to be better than the next, next versions of these models and minus, let's say. If I could add something to that, this relates to the paradigm of scale and the bitter lesson a bit about how we want methods that as you increase compute get better and better. and something that works really well in this paradigm is doing the simple but challenging task of just predicting data. And you can try coming up with more complicated tasks, for example, something that
Starting point is 00:28:43 doesn't use video explicitly, but is maybe in some like space that simulates approximate things or something. But all this complexity actually isn't beneficial when it comes to the scaling laws of how methods improve as you increase scale. And what works really well, as you increase scale is just predict data. And that's what we do with text. We just predict text. And that's exactly what we're doing with visual data with SORA, which is we're not making some complicated,
Starting point is 00:29:09 trying to figure out some new thing to optimize. We're saying, hey, the best way to learn intelligence in a scalable matter is to just predict data. That makes sense in relating to what you said, Bill, like predictions will just get much better with no necessary limit that approximates humans. Right. Is there anything you feel?
Starting point is 00:29:28 like the general public misunderstands about video models or about SORA, or you want them to know? I think maybe the biggest update to people with the release of SORA is that internally, we've always made an analogy, as Bill and Tim said, between SORA and GPT models, in that, you know, when GPT1 and GPT2 came out, it started to become increasingly clear to some people that simply scaling up these models would give them amazing capabilities. And it wasn't clear right away if like, oh, we'll scaling up next token prediction result in a language model that's helpful for writing code. To us, like, it's felt pretty clear that applying the same methodology to video models
Starting point is 00:30:11 is also going to result in really amazing capabilities. And I think SORA 1 is kind of an existence proof that there's one point on the scaling curve now, and we're very excited for what this is going to lead to. Yeah, amazing. Well, I don't know why it's such a surprise to everybody, but their lesson wins again. Yeah. I would just say that, as both Tim and Aditya we're alluding to, we really do feel like this is the GPT1 moments, and these models are going to get a lot better very quickly.
Starting point is 00:30:39 And we're really excited both for the incredible benefits we think this is going to bring to the creative world, what the implications are long-term for AGI. And at the same time, we're trying to be very mindful about the safety considerations and building a robust stack now to make sure that society is actually going to get the benefits of this while mitigating the downsides. But it's exciting times, and we're looking forward to what future models are going to be capable of. Yeah, congrats on such an amazing release. Find us on Twitter at No Pryor's Pod.
Starting point is 00:31:07 Subscribe to our YouTube channel if you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priars.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.