a16z Podcast - Beyond Uncanny Valley: Breaking Down Sora

Starting point is 00:00:00 Yeah, honestly, I was very, very surprised. I mean, I know the two of us often talk about how quickly the field is moving, how hard it is to keep track of all the things that are happening. And I was not expecting a model so good coming out so soon. We've generally converged on, it was going to be a when, not an is. I thought it was maybe six months out, a year out. So I was shocked when I saw those videos, the quality of the videos, the length, I mean, the ability to generate.

Starting point is 00:00:30 60-second videos, so it's really amazed. This is obviously the worst that this technology will ever be, almost definitely, right? We're at the earliest stages of progress here. I've always felt that that is one of the secret weapons of diffusion models and why they are so effective in practice. If you were to ask many people at the beginning of 2024, when we get high-fidelity, believable AI-generated video,

Starting point is 00:00:55 most would have said that we were years away. But on February 15th, OpenAI surprised the world with examples from their new model. Sora, bringing those predictions down from years to weeks. And of course, the emergence of this model and its impressive modeling of physics and videos of up to 60 seconds have spurred much speculation around not only how this was accomplished, but also so soon. And although OpenAI has stated that the model uses a transformer-based diffusion model, the results have been so good that some have even questioned whether explicit 3D modeling or a game engine was involved. So naturally, we decided to bring in an expert. Sitting down with

Starting point is 00:01:37 A16-Z general partner, Anshameda, is Professor of Computer Science at Stanford, Stefno Irma, whose group pioneered the earliest diffusion models and their applications in generative AI. Of course, these approaches laid the foundation of the very diffusion models deployed in SORA, not to mention other household names like Chat Chiquot and Mid-Journey. And perhaps most importantly, Stefano has been working on generative AI for more than a decade, long before many of us had even an inkling of what was to come. So throughout this conversation, Stefano breaks down why video has historically been much harder than its text image counterparts, how a model like SORA might work,

Starting point is 00:02:16 and what all of this could mean for the road ahead. And of course, if you want to stay updated on all things AI, make sure to go check out A16Z.com slash AI. Enjoy. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16C fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please

Starting point is 00:02:52 A16C.com slash Disclosures. This is like a conversation I've wanted to have with you for a while. We've been talking about flavors of this conversation for a long time, but I think given how quickly things are heating up in this space, it sounded like a good time for us to check in and see how many of the assumptions we've been talking about about the future of diffusion models and video models are tracking. You're the world's expert in this area of research.

Starting point is 00:03:20 And so I think it'd just be great to start with her lab sort of involvement in the origins? Yeah, I'm excited to be here. Hi, everyone. My name is Stefano, and I'm a professor of computer science at Stanford. I work in AI, and I've been actually working in generative AI for more than 10 years. Way before these things were cool. I teach a class at Stanford on deep generative models.

Starting point is 00:03:43 It's something I started back in, again, 2018. And I think it was the first in the world on this topic. And, yeah, I encourage you to check out the website. That is called CS-236. There's lots of materials if you want to dig deeper into how these methods work. And I've been doing research in generative models for a long time. As you mentioned, with my former student, Young Song, who is now at Open AI, who did some of the early work on diffusion models or score-based models,

Starting point is 00:04:08 as we used to call them back then. Back in 2019, the time generative models of images, video, audio, like these kind of continuous data modalities, were really dominated by GANS, generative adversarial networks. And we were really the first to show that it was actually possible to beat GANS of their own game using this new class of generative models called Fusion Models, where we essentially generate content, we generate images by starting from pure noise and progressively denoising it using a neural network until we turn it into a beautiful sample.

Starting point is 00:04:41 We developed a lot of the theory behind these models, how to train them, how to use score matching. A lot of the initial architectures and some of those choices are still around today. I think that work really kick-started a lot of the exciting things we're seeing today around diffusion models, stable diffusion, soar out, of course. And in addition to that early work on the foundation of diffusion models, we've worked on a number of other aspects of diffusion models like DDIM, a pretty widely used, efficient sample procedure that allows you to generate images very quickly without too much loss in quality. with my student Chen Lingmong, who is the CTO of PICA Labs, we developed SDITEDD. It's one of the first methods to do controllable generation to generate images based on sketches and things like that. So yeah, I'm excited to be here today

Starting point is 00:05:29 and discuss what's coming next. So given all of that work, all of the experience you've had in really the ground truth of diffusion models and their limitations, what was your reaction to seeing a model like SORA come out last week? Yeah, honestly, I was very, very surprised. I mean, I know the two of us often talk about how quickly the field is moving, how hard it is to keep track of all the things that are happening. And I was not expecting a model so good coming out so soon. I mean, I don't think there is anything fundamentally impossible and I knew it was coming. It was just a matter of time and just more research, more investments, more people working on these things. But I was not

Starting point is 00:06:06 expecting something that good happening so soon. I thought it was maybe six months out, a year out. And so I was shocked when I saw those videos, the quality of the videos, the length, I mean, the ability to generate 60-second videos, always really amazed. Yeah, I do think every time we've talked in the past, we've generally converged on, it was going to be a when, not an if, that we'd see video generation get so good. And so it's reassuring to hear you were as surprised as I was when it came out. But before we get into the details on what maybe some of the breakthroughs were there on this time frame, maybe we can just spend a few minutes talking about video diffusion, sort of one-on-one, for folks who may not be as familiar

Starting point is 00:06:47 with these models, why has video diffusion been so much more complex than text or image generation? And what's historically been the main blocker in making it work? Yeah, that's a great question. At a very high level, you can think about videos, just a collection of images. And so really, the first challenge you have to deal with is that you're generating multiple images at the same time. And so the compute cost that you need to process and images is at least 10 times larger than what you would pay if you just want to process one of them at a time. And basically, this means a lot more compute, a lot more memory, and just much more expensive to train all your large-scale model on video data. The other challenge is just the data challenge.

Starting point is 00:07:28 I think a lot of the success we've seen in diffusion models for images was partially due to the availability of publicly available data sets like Lyon, like large-scale image and TAPT. data sets that were scraped from the internet and they were made available and people could use them to train large-scale models. I think we don't quite have that for video. I mean, there is a lot of video data, but the quality is kind of like a mixed bag and we don't have a good way to filter or screen them or there's not a go-to data set that is available and everybody's using to train these models. So I'm guessing some of the innovations that went into the Pshora model are actually on

Starting point is 00:08:06 just selecting good quality data to train the models on. Options are also hard to get for video. I mean, the video data is out there, but getting good labels, good descriptions of what's happening in the videos is challenging, and you need that if you want to have good control over the kind of content that you generate with these models. And then there is also a challenge of video content. It's just more complex. There is more going on if you think about a sequence of images as opposed to just one that

Starting point is 00:08:32 are complex relationships between the frames. There is physics. There is object permanence. And in principle, I think a high capacity. model with enough compute, enough data and potentially learn these things, but it was always an empirical question, how much data are you going to need, how much computer are you going need? When is that going to happen? Is the model really going to discover all this high-level concept and statistics of the data, essentially? And it was surprising to see that it's doing

Starting point is 00:09:00 so well. You just laid out very clearly what the general obstacles have been, both on architecture, on data set, on representations of the world via video as a format. Since the release, came out last week. There's been a lot of speculation around how this model can achieve such impressive results. Some folks were even speculating that there might be a game engine or a 3D model, sort of explicit 3D modeling or geometry involved in the inference pipeline. But in their article, Opinia describes the approach by saying that they train text conditional diffusion model jointly on videos and images with different durations and resolutions and then apply to transformer architecture on space-time patches of video and image latent codes.

Starting point is 00:09:40 And so could you just break down in layman terms for folks who might not be as familiar with scaling laws and what's going on here? Sure, yeah, I can try. I mean, there is certainly some secret sauce here and I can try to read behind the lines of what they said in their release. The idea of training on videos and images

Starting point is 00:09:56 is not new. It seems like one technical difference that they are hinting at is the use of a transformer architecture for the backbone, for the denoiser, for the score model. People often use the convolutional architecture back from the days where kind of like young initially started using Units as a score model,

Starting point is 00:10:14 which was actually a key innovation that really enabled a lot of the success on images, and people kind of still ported those kind of architectures over to video data as well, because they make sense. We expect that there is a lot of convolutional structure and convolutional architecture might be a good idea. It seems like they moved on to a purely transformer-based architecture and probably following the work by signing Shia. at NYU, who did some of the initial work in this space

Starting point is 00:10:41 and developing good transformers architectures, a VIT-based architectures for diffusion models. It's possible that that gives you better scaling with respect to compute and data and just happens to work better. They are also referring to latent codes, so it seems like it's unlikely that they're working directly in pixel space.

Starting point is 00:11:02 Working on latent representations was one of the key things, the key innovations behind stable diffusion, latent diffusion. like the idea of first compressing the data into a latent representation that is a little bit smaller or a little bit more compact because we expect that there is a lot of redundancy

Starting point is 00:11:16 if you think about the different frames in a video and so it might be possible to compress it almost losslessly into a lower dimensional representation and if you can do that, then you can kind of train on this lower dimensional representation and you can get much better tradeoffs in terms of like the compute that you pay and the kind of memory that you need to process the data.

Starting point is 00:11:35 So they might have figured out a better way to encode video to a semantically meaningful kind of like lower dimensional latent space. I would say this doesn't rule out to the possibility that they've used game engines or 3D models to generate training data. I mean, as we discussed before, I think the quality of the training data is really crucial. And it's possible that they've used synthetic data generated by game engines or nerve-like 3D models to generate the kind of data they want to see where there is a lot of motion. They can probably use the internals of the engine to get very good captions about what's happening in the video,

Starting point is 00:12:11 so they can get a very good match between the text and the content that they're trying to generate. So, I mean, it does seem like it's a purely data-driven approach, but it's possible that they've used the other pipelines to generate synthetic data. And when you contrast the fusion transformer approach that they took with sort of the prior generations of many flavors of generative models, whether it was recurrent nets, GANS, just vanilla, other regressive transformers. Why is it that a diffusion model here seemed, again, to be sort of the uniquely suited best tool for the job? I think prior to diffusion models, yeah, people were using GANS, generated adversarial networks. Gans are pretty good. They're pretty flexible, but one challenge is they're very unstable to train.

Starting point is 00:12:55 So that was actually one of the main reasons. We developed diffusion models in the first place. we wanted to retain the flexibility of basically an arbitrary neural network as the backbone, but a more principled statistical way of training the models that leads you to a stable training loss where you can just keep training in the model just keeps getting better and better. Outer aggressive models also have that property and is that they're trying to compress the data and with enough capacity and a compute, in principle, they can do a pretty good job on modeling anything, including video. They just tend to be very slow because you have to generate one token at a time.

Starting point is 00:13:28 If you think of video, there is a lot of tokens, and they've not been the model of choice for that reason. Diffusual models, on the other hand, they can generate basically tokens essentially in parallel, and so they can be much faster. And that's one of the reasons they are preferred, I think, for these kind of modalities. Another more philosophical reason is that if you think about a diffusion model,

Starting point is 00:13:49 in some sense, at the inference time, you have access to a very deep computation graph, where essentially you can apply a neural network maybe a thousand times, or if you take a continuous time perspective, like a differential equation perspective, it can even be an infinitely deep kind of computation graph that you can use to generate content,

Starting point is 00:14:06 while at the same time, you don't have to enroll the entire computation graph at training time because the models are trained by score matching, and there is this clever way of kind of like trying to make the model better and better without ever having to pay a huge price at training time. And so I've always felt that that is one of the secret weapons of diffusion models and why they are so effective in practice because they allow you to use a lot of resources at inference time

Starting point is 00:14:29 without having to pay that price during training time. And to our earlier point about why did the SORA breakthrough happen so much faster than many expected, it sounds like the stability of the diffusion transformer model and being able to swap out essentially training time for inference time, which is much cheaper, much more parallelizable, much more efficient, was a big contributor to compressing the training times here. Yeah. And there is a question of what's the backbone, right? And that could be convolutional network, it can be a space-based model,

Starting point is 00:14:53 that could be a transformer. There, I think we're still just scratching the surface in terms of what works and what doesn't and all possible combinations are possible. Right. Build auto-aggressive models that are convolutional. You can build auto-aggressive models that are based on transformers.

Starting point is 00:15:08 You can build ultra-aggressive models that are based on state-based architectures. And similarly, you can build diffusion models that are based on convolutional architecture, which is what people tend to do. And now it seems like maybe open-ey eyes pushing towards, no, that's just use a transformer as the back. run in a diffusion model.

Starting point is 00:15:25 And I'm starting to see people exploring state-based models, which might allow for very long context, for example. So I think there is an exciting space of different kinds of combinations that we can try and might give us better scaling, better properties for really getting to the kind of qualities we want to see for these models. One of the most elegant parts of the transformer backbone architecture is that it works really well with the idea of tokenization, right? And in language models, so much of the scaling laws work that allowed models like GPT 3 and 4 to be developed so quickly and generalized to all kinds of tasks was that the process of tokenizing language allows almost like a transformation or translation into a format that the models can understand across many, many different types of languages, whether that's good old-fashioned English or its code or its health records or it's different, in some cases, multilingual.

Starting point is 00:16:23 data sets. And so the beauty of tokenization is it's like this one-size-fits-all process of turning language data into a format that transform our backbone really understands well and is able to learn on. It seems like there was a similar key unlock around how visual data was broken down here into small batches, right? They essentially tokenized image and video data into this sort of intermediary representation of a patch. This approach created a meaningfully better output than other models we've seen in the past? That's a great question. And honestly, I don't know the answer.

Starting point is 00:16:55 I think tokenization makes a lot of sense for discrete data, like code or text. I'm less of a fan of tokenization or patchifying images and videos and audio. I feel like you have to do it if you want to use a transformer architecture, but it makes less sense to me just because the data is continuous and the patch is very arbitrary and you're losing some of the structure there by going through a tokenization. you kind of have to do it if you want to use transformers and transformers are great because they scale well and we have very good implementations and they are very hardware friendly and so maybe that's the way to go. It's the bitter lesson once again. But it feels like what they have is

Starting point is 00:17:36 some kind of latent, again, representation and maybe once you go to a latent space, then tokenizing it might make more sense because you've already lost a lot of the structure. And so it may be a combination of both. I mean, it seems like they might have access. to a very good latent space where they get rid of a lot of the kind of redundancy that exists in natural data, especially in videos. Two frames next to each other

Starting point is 00:18:00 are very similar, right? So there's a lot of redundancy in videos, and if they got rid of some of that through a clever encoding scheme, then they apply tokenization. I think that starts to make more sense and make things more scalable, less compute, less memory,

Starting point is 00:18:15 and just better. We've seen so many text of video models, but very few were able to actually generate longer form videos, right, more than a few seconds. And there were often issues with temporal coherence and consistency, even across those short form generation, three to five seconds. And here we've got a model that's able to do one minute long, sort of 60 second generations.

Starting point is 00:18:35 And arguably some of the long-form generations are actually dramatically better than the short-form generations from SORA too. You start to see these emergent properties of temporal coherence only in the 60-second clips, right? What's going on there? What are they doing differently that enables these videos to have such amazing continuity for long lengths and temporal coherence of the subjects across those links.

Starting point is 00:18:54 I think that was the most surprising element of aspect of SORA, like this ability to generate long content that is so coherent and consistent and beautiful. I think that was the part that really amazed me because I know it's very hard to do because exactly, like you have to keep track of a lot of things to make things consistent

Starting point is 00:19:13 and the model doesn't know what's important to keep track of and what's not. And somehow the model they've trained seems to be able to do it. Again, it's not entirely surprising because at the end of the day, these models are trained to essentially compress the training data. So if you have high-quality training data

Starting point is 00:19:31 where there are transitions, and of course, the content is consistent with physics, and it's consistent that has the right properties that we would expect a real, natural, good-quality video to have, then in order to compress the data as effectively as possible, then the model should learn about physics, should learn about object permanence,

Starting point is 00:19:51 should learn about 3D geometry is and all of that. What's surprising is that there is many other kind of shallower correlations that your model could discover. And what was surprising to me is that it seems like it's really able to learn some of that. We don't know why. I think it's one of the mysteries of deep learning. It's probably a combination of training data,

Starting point is 00:20:11 the right architectures and half-scale, but it was amazing. And the 3D properties in their videos, emerge without any explicit inductive biases for 3D objects, right? They're purely phenomena of scale. What does that mean? Is physics just an emergent property? Well, it's not inconceivable that at the end of the day,

Starting point is 00:20:29 physics is a framework that can help you understand the world. It helps you make better predictions. Like if I understand Newton's law, I can make predictions about what's going to happen if I drop an object. And it's a very simple formula that allows me to make a lot of different predictions. So, if I'm being tasked to compress a lot of videos, if I know Newton's law, if I knew some physics, I can probably do a better job at predicting what the next frame is going to look like, right? And at the end of the day, these models, although they are trained by score matching,

Starting point is 00:21:00 it's possible to relate in a very formal sense, the training objective that we're using to train a diffusion model to a compression-based objective, literally just trying to compress video data in this case as much as you can. And so it's quite possible that knowing something about physics, knowing something about camera views and 3D structures of objects and object permanence, but these kind of properties are helpful in compressing data because they reveal structure that is helpful to make predictions, which means you can compress the data better. What's exciting is that it emerges just by training the model, right, that you could imagine other kinds of shallower correlations that exist in the training data, but they are not as useful or as predictive as a Newton's law or like. like a real physics understanding of the scenes and what's going on. And it's hard to tell whether what exactly is going on. Like, it's possible that there is no real understanding of physics, but it's certainly very effective, right?

Starting point is 00:21:53 And at the end of the day, maybe that's enough. What you seem to be pointing out there is if the data that these models are trained on is kind of like their diet and you are what you eat, in a sense, if you're eating a ton of physics, then you are better at physics as a model. Is that how we should begin to sort of explain other emergent properties? It was a clip that they shared called Bling Zoo. which was a single prompt-generated video that had multiple transitions baked in

Starting point is 00:22:16 without any editing and so on. It was multiple camera shots. It almost seemed like somebody had manually stitched together different camera angles, right? How should we explain that? Is that just essentially it mimicking what it's seen in the training data or is there something else going on?

Starting point is 00:22:29 In that case, I'd imagine that yes. If you've trained it on high-quality video data where you can see these transitions across different kinds of shots, again, what these models do is they try to understand what all the training videos have in common, what is the high-level structure of these videos, and then they try to replicate that. So a sufficiently good model might understand

Starting point is 00:22:52 that the videos in the training set, they tend to have this structure where we transition across different views and shots, and then we combine them, and they're combined in interesting ways, and then it's able to replicate. Again, what's magic here is that it's provably an impossible task in general, right?

Starting point is 00:23:08 There is so many other ways of interpolation, between the things you see in the training set. And most of them are wrong, right? Are generalizations that you don't want to see? And somehow this deep neural networks are able to find interpolations or generalizations that are the ones we want, the ones that make sense. And they discover the kind of structure that we want the model to replicate as opposed to the ones that are by chance and it's not the kind of structure that we want

Starting point is 00:23:32 them to pick up. And that's what's amazing and mostly unexplained this point. We do not understand why this happens. Right. So we're here now in early 2024, and the answer to when will video models get good enough to cross the uncanny valley has just been breached. We've just gotten to that point. So if we look ahead now, SORA is still in beta, but there's several other AI generative sort of video efforts out there already. Realistically, how expensive do you think it's going to be

Starting point is 00:24:03 to generate AI video at any kind of sort of consumer scale, a readily available scale? I'm sure this release from Open AI set off a lot of competitors to the races trying to catch up, and I'm sure we'll see developments coming from all the various competitors in this space. I think there is training costs, which are huge. I'm sure they use thousands of GPUs to train SORA, and scale was a big part of the success they've seen, and so it's definitely going to be out of reach for academics, but there is going to be industry players who will have the resources to try to compete,

Starting point is 00:24:38 with them and try to replicate what they did or achieve similar results in a different way. The good news is that now we have an artifact, we have a system that can do it. And so I think it's a lot easier to try to catch up. It's not something that there is a lot less uncertainty in whether it's even possible. Right now we have an example. It's feasible and a lot of people will make the right investments to really catch up. I don't know how long it's going to take. It's going to be six months.

Starting point is 00:25:02 It's going to be 12 months. But somebody will come up with similar performance, I would imagine. as we've seen in other lamps and in other spaces where people can't catch up. The question is, how far ahead will open an IP? By then, how much better will the system be in six months or in 12 months? And that's hard to say.

Starting point is 00:25:21 The other question that I think you were hinting at is inference, like how expensive is it going to be to serve these models and provide video generation on demand to users or personalized videos, like all these really cool applications that could emerge from a really good video generated model. Again, I'm pretty optimistic, especially because the underlying architecture is a diffusion model. Once you have potentially big, expensive, large, clunky model that can generate

Starting point is 00:25:48 high quality results, there's been a lot of success in distilling these models down into smaller ones that are almost as capable, but way faster. So I'm pretty optimistic that once we get to high enough quality, it's going to be possible to get systems that can serve similar our quality results in a very inexpensive way. So I'm pretty excited to see the kind of crazy use cases people come up with once this technology becomes available. When we talk about compute, training or inference is a huge part of the calculus there. But there's this other bucket of costs that come from the datasets required to actually

Starting point is 00:26:24 train these models and get scaling laws to work. And in terms of training data for language models, there's already billions of data points that these labs can get across the web. But for video, a lot of the data, like you were saying, saying earlier, even if it exists, it's not particularly well labeled or captioned. And so how do you think video model teams will overcome that challenge? We've seen recently that Reddit agreed to do a deal to license its data to Google for $60 million.

Starting point is 00:26:49 Do you think we'll see video production studios begin to license their content out? That's a good question. And first of all, it will probably depend a little bit on whether you're thinking of startups versus more established industry players. I think startups might be willing to move fast and break things. maybe be less concerned about copyright and just scrape from the internet and then train models and then worry

Starting point is 00:27:11 about licensing the data later. Bigger players, their legal teams are very, very worried about big lawsuits. And so they'll want to have something that is properly licensed. It's going to be interesting to see whether, as you said, the studios or people who currently own the content will be willing to license it because it could be existential threat

Starting point is 00:27:32 to their entire business model. I mean, I can see Reddit licensing their stuff because it's maybe not as an existential threat. The other thing you mentioned is labeling. It's a great point. It's going to be a big challenge. Then I'm pretty optimistic, though, that people will be able to set up human-in-the-loop pipelines. I mean, we've seen great success in visual language models, even video-language models. They are not good enough to maybe provide high-quality caption out of the box.

Starting point is 00:28:00 But I could imagine that they could drastically speed up, like a human-in-the-loop pipeline, where they provide suggestions that they don't get fixed or improved by human labellers. And so I'm pretty optimistic on the captioning front that we'll be able to find pretty scalable solutions to that. Is the sort of logical path there to start with a human in the loop implementation, build a working model of annotations there, and then ultimately move towards synthetic captioning? It seems like that's a direction that people are exploring and using, and there's been a lot of success on kind of like enhancing captions, you know, synthetic ways you're using LLMs.

Starting point is 00:28:39 And so, yeah, that would be my bet. I mean, honestly, I don't know to what extent the bottleneck is on the captioning as opposed to actually having a good high-quality videos, even just the raw high-quality video data having access to that. It's non-trivial. And my understanding is actually one of the bottlenecks that we will have to solve first. Ah, gotcha. Why don't we switch gears a little bit to the usage of these models, right?

Starting point is 00:29:01 One of the things that consumers and creators have really found useful with language models, generative models, is context windows, right? The bigger the context window, the more flexibility there is for the input. You can give it more detailed context, and it's been an exponential progress on the language side. In a very short amount of time, we've gone from tiny context windows to millions of tokens worth of context windows. In video, do you expect a similar approach or is there some fundamental limitation? I was just reading the Gemini 1.5 paper and then talking about this.

Starting point is 00:29:31 very long context, a million tokens, 10 million tokens. And actually, one of the applications they mentioned is actually video summarization, video understanding, like trying to process a one hour long video. It seems like these very long contexts are going to be very useful to solve a variety of video processing, video understanding kind of tasks. And therefore, I would be very surprised if they don't also end up being very useful for video generation. And in fact, it's entirely possible that that's already. a component that played a role in the open AI system.

Starting point is 00:30:06 And part of the reason they are able to generate long videos is because they can handle very long context and they are able to scale transformers to very long sequences. And it's entirely possible that that was part of the secret solves. So I'm again, excited about either these attention-based ways to scale up context through embeddings or ring attention or clever hardware-focused implementation and flash attention

Starting point is 00:30:32 that can be scaled up to very long contexts or more researching things like state-based models or which people are starting to use also in the context of diffusion models that might allow you to get

Starting point is 00:30:45 to very long contexts and so I think it's going to be an interesting space to watch for sure. Well, looking further out the timelines and breakthroughs, this is obviously the worst that this technology will ever be almost definitionally, right?

Starting point is 00:30:57 We're at the earliest stages of progress. Why did people underestimate the timeline so far? And what does that mean about what's to come next and how quickly it's going to come? Yeah, this is a really hard question. And I think I was wrong with my predictions on how long it was going to take to get to the point we are right now with SORA. And when we're moving exponentially fast, it's very hard to make predictions. And errors can be pretty big.

Starting point is 00:31:20 But it's going to be exciting for sure. When you look at sort of video generation as a breakthrough in the broader journey of feeling laws towards generalizable artificial intelligence or whether you want to call it ASI or AGI. How do you quantify this advancement in that broader journey? I'm very excited about this because I tend to think of it as a like a video generative model as a pretty natural kind of like world model. I mean, as we were discussing before, in order to be able to generate videos of these qualities, there must be some level of understanding of physics, of object permanence, 3D structures. Like, there must be a lot of knowledge.

Starting point is 00:31:58 that it's somehow latent in these models. And I'm excited about all the different ways in which we might be able to extract this knowledge and use it and different applications. And when we think about, especially like robotics or other agents that really interact with the real world, I think the kind of knowledge that is embedded in this video that has been extracted by watching,

Starting point is 00:32:18 essentially, a lot of videos is going to be very, very useful. And what's interesting is I think it's going to be pretty complementary to the kind of knowledge that, for example, is baked into an LLM. You can learn a lot about the world by reading books, essentially, but there is just going to be a lot of experience that you only get by, that it's more similar to the kind of experiences you get as a child. You just walk around the world, you see things,

Starting point is 00:32:42 and you learn about how the world works just through your eyes and through video essentially, right? So that kind of experience that we're feeding through video diffusion model when the kind of knowledge that we might be able to extract from that is therefore going to be, I think, pretty useful. I think the title of the blog post that they put out was video generation models as world simulators. I think there is a lot of problems there. I think if you can do it at the pixel level, then that means that you've solved the harder version of the problem, right?

Starting point is 00:33:12 And you can do a lot of things on top, whether you're an autonomous vehicle or whether you're building robots or whether you just want to build an agent that understands how the world works and combines the knowledge it's gotten by crawling the internet with what you can see in the real world. I think it's all going to be pretty exciting. Well, look, arguably, we would not be here as an industry if it wasn't for your lab. So it's just so exciting to see how your worldview of where the world was always going to go has accelerated. I know we could spend hours talking about all the tricks that were required from a research perspective to get here and where we're going to go. But we're going to wrap up for today. Thank you so much for making the time. I'm sure we'll have more to talk about soon.

Starting point is 00:33:51 Thank you very much. I enjoyed this very much. If you liked this episode, if you made it this far, help us grow the show. Share with a friend, or if you're feeling really ambitious, you can leave us a review at rate thispodcast.com slash A16c. You know, candidly, producing a podcast can sometimes feel like you're just talking into a void. And so if you did like this episode, if you liked any of our episodes, please let us know. I'll see you next time.

Starting point is 00:34:26 You know,

Your Ad Here

a16z Podcast - Beyond Uncanny Valley: Breaking Down Sora

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.