The a16z Show - Beyond Uncanny Valley: Breaking Down Sora

Episode Date: February 24, 2024

In early 2024, the notion of high fidelity, believable AI-generated video seemed a distant future to many. Yet, a mere few weeks into the year, OpenAI unveiled Sora, its new state of the art text-to-v...ideo model producing videos of up to 60 seconds. The output shattered expectations – even for other builders and researchers within generative AI – sparking widespread speculation and awe.How does Sora achieve such realism? And are explicit 3D modeling techniques or game engines at play?In this episode of the a16z Podcast, a16z General Partner Anjney Midha connects with Stefano Ermon, Professor of Computer Science at Stanford and key figure at the lab behind the diffusion models now used in Sora, ChatGPT, and Midjourney. Together, they delve into the challenges of video generation, the cutting-edge mechanics of Sora, and what this all could mean for the road ahead.Resources: Find Stefano on Twitter: https://twitter.com/stefanoermonFind Anjney on Twitter: https://twitter.com/anjneymidhaLearn more about Stefano’s Deep Generative Models course: :https://deepgenerativemodels.github.ioStay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Transcript
Discussion (0)
Starting point is 00:00:00 Yeah, honestly, I was very, very surprised. I mean, I know the two of us often talk about how quickly the field is moving, how hard it is to keep track of all the things that are happening. And I was not expecting a model so good coming out so soon. We generally converged on, it was going to be a when, not an is. I thought it was maybe six months out, a year out. So I was shocked when I saw those videos, the quality of the videos, the length, I mean, the ability to generate. 60-second videos, so it's really amazed. This is obviously the worst that this technology will ever be,
Starting point is 00:00:36 almost definition. Right? We're at the earliest stages of progress here. I've always felt that that is one of the secret weapons of diffusion models and why they are so effective in practice. If you were to ask many people at the beginning of 2024, when we get high-fidelity, believable AI-generated video, most would have said that we were years away.
Starting point is 00:00:58 But on February 15th, Open AI surprised the world with examples from their new model. Sora, bringing those predictions down from years to weeks. And of course, the emergence of this model, and its impressive modeling of physics and videos of up to 60 seconds, have spurred much speculation around not only how this was accomplished, but also so soon. And although Open AI has stated that the model uses a transformer-based diffusion model, the results have been so good that some have even questioned whether explicit three-year-es. modeling or a game engine was involved. So naturally, we decided to bring in an expert.
Starting point is 00:01:36 Sitting down with A16Z general partner, Anshameda, is professor of computer science at Stanford, Stefano Irma, whose group pioneered the earliest diffusion models and their applications in generative AI. Of course, these approaches laid the foundation of the very diffusion models deployed in SORA, not to mention other household names like Chachypti and Midgernie. And perhaps most importantly, Stephanie, Stefano has been working on generative AI for more than a decade, long before many of us had even an inkling of what was to come. So throughout this conversation, Stefano breaks down why video has historically been much harder than its text image counterparts, how a model like Sora might work, and what all of this could mean for the road ahead. And of course, if you want to stay updated on all things AI, make sure to go check out A16.com slash AI. Enjoy.
Starting point is 00:02:27 As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see A16c.com slash disclosures. This is like a conversation I've wanted to have with you for a while. We've been talking about flavors of this conversation for a long time, but I think given how quickly things are heating up in this space, it sounded like a good time for us to check in and see how many of the assumptions we've been talking about, about the future of diffusion models and video models are tracking. You're the world's expert in this area of research.
Starting point is 00:03:20 And so I think it'd just be great to start with their lab's sort of involvement in the origins. Yeah, I'm excited to be here. Hi, everyone. My name is Stefano, and I'm a professor of computer science at Stanford. I work in AI and I've been actually working in generative AI for more than 10 years. Way before these things were cool. I teach a class at Stanford on deep generative models. It's something I started back in, again, 2018. And I think it was the first in the world on this topic. And yeah, I encourage you to check out the website. It's called CS-236. There's lots of materials if you want to dig deeper into how these methods work. And I've been doing research in generative models for a long time, as you mentioned with my former student, a young song, who is now at Open AI,
Starting point is 00:04:05 who did some of the early work on diffusion models or score-based models, as we used to call them back then. Back in 2019, the time generative models of images, video, audio, like this kind of continuous data modalities, were really dominated by GANS, generative adversarial networks. And we were really the first to show that it was actually possible to beat GANS of their own game using this new class of generative models called Fusion models where we essentially generate content, we generate images by starting from pure noise and progressively denoising it using a neural network
Starting point is 00:04:38 until we turn it into a beautiful sample. We developed a lot of the theory behind these models, how to train them, how to use score matching, a lot of the initial architectures, and some of those choices are still around today. And I think that work really kick-started a lot of the exciting things we're seeing today around diffusion models,
Starting point is 00:04:55 stable diffusion, SORA, of course. And in addition to that early work on the foundation of diffusion models, we've worked on a number of other aspects of diffusion models like DDIM, pretty widely used, a efficient sample procedure that allows you to generate images very quickly, without too much loss in quality. With my student Chen Ling Meng, who is the CTO of PICA Labs, we developed SD-Edit, is one of the first methods to do controllable generation, to generate images based on sketches and things like that.
Starting point is 00:05:27 So, yeah, I'm excited to be here today and discuss what's coming next. So given all of that work, all of the experience you've had in really the ground truth of diffusion models and their limitations, what was your reaction to seeing a model like SORA come out last week? Yeah, honestly, I was very, very surprised. I mean, I know the two of us often talk about how quickly the field is moving, how hard it is to keep track of all the things that are happening. And I was not expecting a model so good coming out.
Starting point is 00:05:54 so soon. I mean, I don't think there is anything fundamentally impossible and I knew it was coming. It was just a matter of time and just more research, more investments, more people working on these things. But I was not expecting something that good happening so soon. I thought it was maybe six months out, a year out. So I was shocked when I saw those videos, the quality of the videos, the length, I mean, the ability to generate 60 second videos. I was really amazed. Yeah, I do think every time we've talked in the past, we've generally converged on, it was going to be a when, not an if, that we'd see video generation get so good. And so it's reassuring to hear you were as suppressed as I was when it came out. But before we get into the details on what maybe some of the breakthroughs were there on this time frame, maybe we can just spend a few minutes talking about video diffusions sort of 101 for folks who may not be as familiar with these models.
Starting point is 00:06:48 Why has video diffusion been so much more complex than text or image generation? and what's historically been the main blocker in making it work? Yeah, that's a great question. At a very high level, you can think of a video, just a collection of images. So really, the first challenge you have to deal with
Starting point is 00:07:04 is that you're generating multiple images at the same time. And so the compute cost that you need to process and images is at least 10 times larger than what you would pay if you just want to process one of them at a time. And basically, this means a lot more compute, a lot more memory,
Starting point is 00:07:20 and just much more expensive. to train all your large-scale model on video data. The other challenge is just a data challenge. I think a lot of the success we've seen in diffusion models for images was partially due to the availability of publicly available data sets like Lyon, like large-scale image and captioned datasets. They were scraped from the internet and they were made available and people could use them to train large-scale models.
Starting point is 00:07:47 I think we don't quite have that for video. I mean, there is a lot of video data, but the quality is kind of like a mixed bag and we don't have a good way to filter or screen them or there's not a go-to data set that is available and everybody's using to train these models.
Starting point is 00:08:02 I'm guessing some of the innovations that went into the SORA model were actually on just selecting good quality data to train the models on. Captions are also hard to get for video. I mean, the video data is out there, but getting good labels, good descriptions of what's happening in the videos
Starting point is 00:08:17 is challenging and you need that if you want to have good control over the kind of content that you generate with these models. And then there is also a challenge of video content. It's just more complex. There is more going on if you think about a sequence of images as opposed to just one that are complex relationships between the frames.
Starting point is 00:08:34 There is physics. There is object permanence. And in principle, I think a high capacity model with enough compute, enough data, can potentially learn these things. But it was always an empirical question, how much data are you going to need? How much computer are you going to need?
Starting point is 00:08:50 when is that going to happen? Is the model really going to discover all this high-level concept and statistics of the data, essentially? And it was surprising to see that it's doing so well. You just laid out very clearly what the general obstacles have been, both on architecture, on data set, on representations of the world via video as a format. Since the release came out last week, there's been a lot of speculation around how this model can achieve such impressive results.
Starting point is 00:09:17 Some folks were even speculating that there might be a game engine or 3D model, sort of explicit 3D modeling or geometry involved in the inference pipeline. But in their article, Opinia describes the approach by saying that they train text conditional diffusion model jointly on videos and images with different durations and resolutions and then apply to transformer architecture on spacetime patches of video and image latent codes. And so could you just break down in layman terms for folks who might not be as familiar with scaling laws and what's going on here? Sure, yeah, I can try.
Starting point is 00:09:47 I mean, there is certainly some secret sauce here. and I can try to read behind the lines of what they said in their release. The idea of training on videos and images is not new. It seems like one technical difference that they are hinting at is the use of a transformer architecture
Starting point is 00:10:02 for the backbone, for the denoiser, for the score model. People often use the convolutional architecture back from the days where kind of like young initially started using units as a score model, which was actually a key innovation that really enabled a lot of the success on images and people kind of still ported those kind of architectures over to video data as well,
Starting point is 00:10:23 because they make sense. We expect that there is a lot of convolutional structure and convolutional architecture might be a good idea. It seems like they moved on to a purely transformer-based architecture and probably following the work by signing Shay at NYU, who did some of the initial work in this space and developing good transformers architectures, like VIT-based architectures for diffusion models.
Starting point is 00:10:46 it's possible that that gives you better scaling with respect to compute and data and just happens to work better. They are also referring to latent codes, so it seems like it's unlikely that they're working directly in pixel space. Working on latent representations was one of the key things, the key innovations behind stable diffusion, latent diffusion, like the idea of first compressing the data into a latent representation that is a little bit smaller or a little bit more compact. we expect that there is a lot of redundancy if you think about the different frames in a video, and so it might be possible to compress it almost losslessly into a lower dimensional representation.
Starting point is 00:11:24 And if you can do that, then you can kind of train on this lower dimensional representation and you can get much better trade-offs in terms of like the compute that you pay and the kind of memory that you need to process the data. So they might have figured out a better way to encode video to a semantically meaningful kind of like lower-dimensional latent space. I would say this doesn't rule out
Starting point is 00:11:45 the possibility that they've used game engines or 3D models to generate training data. I mean, as we discussed before, I think the quality of the training data is really crucial. And it's possible that they've used synthetic data generated by game engines or Nerf-like 3D models
Starting point is 00:12:01 to generate the kind of data they want to see where there is a lot of motion. They can probably use the internals of the engine to get very good captions about what's happening in the video, so they can get a very good match between the effects and the content that they're trying to generate. So, I mean, it does seem like it's a purely data-driven approach,
Starting point is 00:12:19 but it's possible that they've used the other pipelines to generate synthetic data. And when you contrast the fusion transformer approach that they took with sort of the prior generations of many flavors of generative models, whether it was recurrent nets, GANS, just vanilla, other regressive transformers, why is it that a diffusion model here seemed, again, to be sort of the uniquely suited best tool for the job? I think prior to diffusion models, yeah,
Starting point is 00:12:48 people were using GANS, generative adversarial networks. GANS are pretty good. They're pretty flexible, but one challenge is they're very unstable to train. So that was actually one of the main reasons. We developed diffusion models in the first place. We wanted to retain the flexibility of basically an arbitrary neural network as the backbone, but a more principled, statistical way of training the models that leads you to a stable training loss
Starting point is 00:13:10 where you can just keep training in the model, just keeps getting better and better. Outer aggressive models also have that property, and they're trying to compress the data and with enough capacity and a compute, in principle, they can do a pretty good job on modeling anything, including video. They just tend to be very slow
Starting point is 00:13:26 because you have to generate one token at a time. And if you think of video, there is a lot of tokens, and they've not been the model of choice for that reason. If you show models, on the other hand, they can generate basically tokens essentially in parallel, and so they can be much faster. and that's one of the reasons they are preferred, I think, for this kind of modalities.
Starting point is 00:13:46 Another more philosophical reason is that if you think about a diffusion model, in some sense, at inference time you have access to a very deep computation graph where essentially you can apply a neural network maybe a thousand times or if you take a continuous time perspective, like a differential equation perspective, it can even be an infinitely deep kind of computation graph that you can use to generate content, while at the same time you don't have to unroll the entire computation graph at training time because the models are trained by score matching and there is this clever way of kind of like trying to make the model better and better without ever having to pay a huge price at training time. And so I've always felt that that is
Starting point is 00:14:21 one of the secret weapons of diffusion models and why they are so effective in practice, because they allow you to use a lot of resources at inference time without having to pay that price during training time. And to our earlier point about why did the SORA breakthrough happen so much faster than many expected, it sounds like the stability of the diffusion transformer model and being able to swap out essentially training time for inference time, which is much cheaper, much more parallelizable, much more efficient, was a big contributor to compressing the training times here. Yeah.
Starting point is 00:14:49 And there is a question of what's the backbone, right? And that could be convolutional network. It can be a space model. That could be a transformer. There, I think we're still just scratching the surface in terms of what works and what doesn't and all possible combinations are possible. Right. Build auto-aggressive models that are convolutional.
Starting point is 00:15:05 you can build auto-aggressive models that are based on transformers. You can build ultra-aggressive models that are based on state-space architectures. And similarly, you can build diffusion models that are based on convolutional architecture, which is what people tend to do. And now it seems like maybe open-ey-ey-eyes pushing towards, no, let's just use a transformer as the background in a diffusion model. And I'm starting to see people exploring state-space models, which might allow for very long contexts, for example.
Starting point is 00:15:31 So I think there is an exciting space of different. kind of combinations that we can try and might give us better scaling, better properties for really getting to the kind of qualities we want to see for this model. One of the most elegant parts of the transformer backbone architecture is that it works really well with the idea of tokenization, right? And in language models, so much of the scaling laws work that allowed models like GPT, 3 and 4 to be developed so quickly and generalized to all kinds of tasks was that the process of tokenizing. language allows almost like a transformation or translation into a format that the models can
Starting point is 00:16:12 understand across many, many different types of languages, whether that's good old-fashioned English or its code or its health records or it's different, in some cases, multilingual data sets. And so the beauty of tokenization is it's like this one-size-fits-all process of turning language data into a format that the transformer backbone really understands well and is able to learn on. It seems like, there was a similar key unlock around how visual data was broken down here into small patches, right? They essentially tokenized image and video data into this sort of intermediate representation
Starting point is 00:16:47 of a patch. This approach created a meaningfully better output than other models we've seen in the past. That's a great question. And honestly, I don't know the answer. I think tokenization makes a lot of sense for discrete data, like code or text. I'm less of a fan of tokenization or patchifying images and visual. videos and audio, I feel like you have to do it if you want to use a transformer architecture, but it makes less sense to me just because the data is continuous and the patch is very arbitrary
Starting point is 00:17:16 and you're losing some of the structure there by going through a tokenization. You kind of have to do it if you want to use transformers and transformers are great because they scale well and we have very good implementations and they are very hardware friendly. And so maybe that's the way to go. It's the bitter lesson once again. but it feels like what they have is some kind of latent, again, representation, and maybe once you go to a latent space, then tokenizing it might make more sense because you've already lost a lot of the structure.
Starting point is 00:17:47 And so it may be a combination of both. I mean, it seems like they might have access to a very good latent space where they get rid of a lot of the kind of redundancy that exists in natural data, especially in videos, or two frames next to each other are very similar, right? So there is a lot of redundancy in videos. And if they got rid of some of that through a clever encoding scheme, then they apply tokenization. I think that starts to make more sense and make things more scalable, less compute, less memory, and just better. We've seen so many text of video models, but very few were able to actually generate longer form videos, right, more than a few seconds.
Starting point is 00:18:24 And there were often issues with temporal coherence and consistency, even across those short form generation, three to five seconds. And here we've got a model that's able to do one minute long, sort of 60 second generations. And arguably some of the long-form generations are actually dramatically better than the short-form generations from SORA too. You start to see these emergent properties of temporal coherence only in the 60-second clips, right? What's going on there? What are they doing differently that enables these videos to have such amazing continuity for long lengths and temporal coherence of the subjects across those lengths? I think that was the most surprising element of aspect of SORA, like this ability to generate long content. that is so coherent and consistent and beautiful.
Starting point is 00:19:03 I think that was the part that really amazed me because I know it's very hard to do because exactly, like you have to keep track of a lot of things to make things consistent and the model doesn't know what's important to keep track of and what's not. And somehow the model they've trained seems to be able to do it. Again, it's not entirely surprising
Starting point is 00:19:23 because at the end of the day, these models are trained to essentially compress the training data. And so if you have high-quality training, training data where there are transitions, and of course, the content is consistent with physics, and it's consistent that has the right properties that we would expect a real, natural, good quality video to have. Then in order to compress the data as effectively as possible, then the model should learn about physics, should learn about object permanence, should learn about 3D geometry is and all of that. What's surprising is that there is many other kind of like shallower correlations that
Starting point is 00:19:59 a model could discover. And what was surprising to me is that it seems like it's really able to learn some of that. We don't know why. I think it's one of the mysteries of deep learning. It's probably a combination of training data, the right architectures and half scale, but it was amazing. And the 3D properties in their videos emerge without any explicit inductive biases for 3D objects, right? They're purely phenomena of scale. What does that mean? Is physics just an emergent property? Well, it's not inconceivable that at the end of the day, physics is a framework that can help you understand the world. It helps you make better predictions. Like if I understand Newton's law, I can make predictions about what's going to happen if I drop an object. And it's a very simple formula that
Starting point is 00:20:42 allows me to make a lot of different predictions. So if I'm being tasked to compress a lot of videos, if I know Newton's law, if I knew some physics, I can probably do a better job at predicting what the next frame is going to look like, right? And at the end of the day, these models, although they are trained by score matching, it's possible to relate in a very formal sense, the training objective that we're using to train a diffusion model to a compression-based objective,
Starting point is 00:21:08 literally just trying to compress video data in this case as much as you can. And so it's quite possible that knowing something about physics, knowing something about camera views and 3D structures of objects and object permanence, but these kind of properties are helpful. in compressing data because they reveal structure that is helpful to make predictions, which means you can compress the data better. What's exciting is that it emerges just by training the model, right? You could imagine other kinds of shallower correlations that exist in the training data, but they're not as useful or as predictive as a Newton's law or like a real physics
Starting point is 00:21:43 understanding of the scenes and what's going on. And it's hard to tell whether what exactly is going on, like it's possible that there is no real understanding of physics, but it's certainly very effective, right? And at the end of the day, maybe that's enough. What you seem to be pointing out there is if the data that these models are trained on is kind of like their diet and you are what you eat in a sense. If you're eating a ton of physics, then you are better at physics as a model. Is that how we should begin to sort of explain other emergent properties? It was a clip that they shared called Bling Zoo, which was a single prompt generated video that had multiple transitions baked in without any editing and so on. It was multiple camera shots. It almost seemed like somebody had
Starting point is 00:22:20 manually stitched together different camera angles, right? How should we explain that? Is that just essentially it mimicking what it's seen in the training data? Is there something else going on? In that case, I'd imagine that, yes. If you've trained it on high-quality video data where you can see these transitions across different kinds of shots, again, what these models do is they try to understand
Starting point is 00:22:40 what all the training videos have in common, what is the high-level structure of these videos, and then they try to replicate that. So that's a sufficiently good model, might understand that the videos in the training set, they tend to have this structure where we transition across different views and shots, and then we combine them and they're combined in interesting ways, and then it's able to replicate.
Starting point is 00:23:02 Again, what's magic here is that it's provably an impossible task in general, right? There is so many other ways of interpolating between the things you see in the training set, and most of them are wrong, right, are generalizations that you don't want to see. Right. And somehow this deep neural networks are able to find, interpolations or generalizations that are the ones we want, the ones that make sense. And they discover the kind of structure that we want the model to replicate,
Starting point is 00:23:28 as opposed to the ones that are by chance, and it's not the kind of structure that we want them to pick up. And that's what's amazing and mostly unexplained this point. We do not understand why this happens. Right. So we're here now in early 2024. And the answer to when will video models get good enough to cross the uncanny valley, has just been breached, right?
Starting point is 00:23:52 We've just gotten to that point. So if we look ahead now, SORA is still in data, but there's several other AI generative sort of video efforts out there already. Realistically, how expensive do you think it's going to be to generate AI video at any kind of sort of consumer scale,
Starting point is 00:24:07 a readily available scale? I'm sure this release from Open AI set off a lot of competitors to the races and trying to catch up. And I'm sure we'll see developments coming from all the various competitors in this space. I think there is training costs, which are huge. I'm sure they use thousands of GPUs to train SORA,
Starting point is 00:24:27 and scale was a big part of the success they've seen. And so it's definitely going to be out of reach for academics, but there is going to be industry players who will have the resources to try to compete with them and try to replicate what they did or achieve similar results in a different way. The good news is that now we have an artifact, we have a system that can do it. And so I think it's a lot easier to try to catch out,
Starting point is 00:24:50 It's not something that there is a lot less uncertainty and whether it's even possible. Right now, we have an example. It's feasible and a lot of people will make the right investments to really catch up. I don't know how long it's going to take. It's going to be six months. It's going to be 12 months.
Starting point is 00:25:03 But somebody will come up with similar performance, I would imagine, as we've seen in other lands and in other spaces where people can't catch up eventually. The question is how far ahead will open an IP by then? How much better will the system be in six months or in 12 months? and that's hard to say. The other question that I think you were hinting at is inference, like how expensive is it going to be to serve these models
Starting point is 00:25:26 and provide video generation on demand to users or personalized videos, like all these really cool applications that could emerge from a really good video-generative model. Again, I'm there. I'm pretty optimistic, especially because the underlying architecture is a diffusion model. Once you have potentially big, expensive, large, clunky model that can generate high-quality, results, there's been a lot of success in distilling these models down into smaller ones
Starting point is 00:25:53 that are almost as capable, but way faster. So I'm pretty optimistic that once we get to high enough quality, it's going to be possible to get systems that can serve similar quality results in a very inexpensive way. So I'm pretty excited to see the kind of crazy use cases people come up with once this technology becomes available. When we talk about compute, training or inference is a huge part of the calculus there. But there's this other bucket of costs that come from the datasets required to actually train these models and get scaling laws to work. And in terms of training data for language models, there's already billions of data points that these labs can get across the web. But for video, a lot of the data, like you were saying earlier, even if it exists,
Starting point is 00:26:37 it's not particularly well labeled or captioned. And so how do you think video model teams will overcome that challenge? We've seen recently that Reddit agreed to do a deal to license its data to Google for $60 million. Do you think we'll see video production studios begin to license their content out? That's a good question. And first of all, it will probably depend a little bit on whether you're thinking of startups versus
Starting point is 00:26:59 more established industry players. I think startups might be willing to move fast and break things, maybe be less concerned about copyright and then just scrape from the internet and then train models and then worry about licensing the data later. Bigger players, their legal teams are very, very, very worried about big lawsuits, and so they'll want to have something that is properly licensed. It's going to be interesting to see whether, as you said, the studios or people who currently
Starting point is 00:27:27 own the content will be willing to license it because it could be existential threat to their entire business model. I can see Reddit licensing their stuff because it's maybe not as an existential threat. The other thing you mentioned is labeling. It's a great point. It's going to be a big challenge. Then I'm pretty optimistic, though, that people will be able to set up human in the loop pipelines. I mean, we've seen great success in visual language models, even video language models. They are not good enough to maybe provide high-quality caption out of the box, but I could imagine that they could drastically speed up, like a human-in-the-loop pipeline where they provide suggestions that they don't get fixed or improved by human labelers. And so I'm pretty optimistic on the captioning
Starting point is 00:28:14 front that we will be able to find pretty scalable solutions to that. Is the sort of logical path there to start with a human in the loop implementation, build a working model of annotations there, and then ultimately move towards synthetic captioning? It seems like that's a direction that people are exploring and using, and there's been a lot of success on kind of like enhancing captions in a synthetic ways, you're using LLMs. And so, yeah, that would be my bet. I mean, honestly, I don't know to what extent the bottleneck
Starting point is 00:28:44 on the captioning as opposed to actually having a good high-quality videos, even just the raw high-quality video data having access to that. It's non-trivial, and my understanding is actually one of the bottlenecks that we will have to solve first. Ah, gotcha. Why don't we switch gears a little bit to the usage of these models, right? One of the things that consumers and creators have really found useful with language models, generative models, is context windows, right? The bigger the context window, the more flexibility there is for the input. You can give it more detailed And it's been an exponential progress on the language side. In a very short amount of time, we've gone from tiny context windows to millions of tokens worth of context windows.
Starting point is 00:29:23 In video, do you expect a similar approach or is there some fundamental limitation? I was just reading the Gemini 1.5 paper and then talking about this very long context, a million tokens, 10 million tokens. And actually, one of the applications they mentioned is actually video summarization, video understanding, like trying to process one hour long video. it seems like these very long contexts are going to be very useful to solve a variety of video processing, video understanding, kind of tasks. And therefore, I would be very surprised if they don't also end up being very useful for video generation. And in fact, it's entirely possible that that's already a component that played a role in the open AI system.
Starting point is 00:30:06 And part of the reason they are able to generate long videos is because they can handle very long content. and they are able to scale transformers to very long sequences, and it's entirely possible that that was part of the secret source. So I'm again excited about either these attention-based ways to scale up context through embeddings or ring attention or clever, hardware-focused implementation, like flash attention that can be scaled up to very long contexts or more researching things like state-based models or the people are starting to use, also in the context of diffusion models that might allow you to get to very long contexts.
Starting point is 00:30:46 And so I think it's going to be an interesting space to watch for sure. Well, looking further out, the timelines and breakthroughs, this is obviously the worst that this technology will ever be, almost definitionally. Right. We're at the earliest stages of progress here. Why did people underestimate the timeline so far? And what does that mean about what's to come next and how quickly it's going to come? Yeah.
Starting point is 00:31:06 This is a really hard question. And I think I was wrong with my predictions on how long it was going to, to get to the point we are right now with SORA. And when we're moving exponentially fast, it's very hard to make predictions and errors can be pretty big. But it's going to be exciting for sure. When you look at sort of video generation as a breakthrough in the broader journey of dealing laws towards generalizable artificial intelligence or whether you want to call it ASI or AGI. How do you quantify this advancement in that broader journey?
Starting point is 00:31:39 I'm very excited about this because I tend to think of it as a like a video generative model as a pretty natural kind of like world model. I mean, as we were discussing before, in order to be able to generate videos of these qualities, there must be some level of understanding of physics, of object permanence, 3D structures. Like, there must be a lot of knowledge that it's somehow latent in these models. And I'm excited about all the different ways in which we might be able to extract this knowledge and use it and different applications in when we think about, especially like robotics or other agents that really interact with the real world, I think the kind of knowledge that is embedded in this video that has been extracted by watching, essentially, a lot of videos is going to be very, very useful.
Starting point is 00:32:21 And what's interesting is I think it's going to be pretty complementary to the kind of knowledge that, for example, is baked into an LLM. You can learn a lot about the world by reading books, essentially, but there is just going to be a lot of experience that you only get by, it's more similar to the kind of experiences you get as a child, right? You just walk around the world, you see things and you learn about how the world works just through your eyes and through video essentially, right? So that kind of experience that we're feeding through video diffusion model when the kind of knowledge that we might be able to extract from that is therefore going to be, I think, pretty useful. I think the title of the blog post that they put out was video generation models as world simulators. I think there is a lot of problems there.
Starting point is 00:33:05 I think if you can do it at the pixel level, then that means that you've solved the harder version of the problem. Right. And you can do a lot of things on top, whether you're an autonomous vehicle or whether you're building robots or whether you just want to build an agent that understands how the world works and combines the knowledge.
Starting point is 00:33:22 It's gotten by crawling the Internet with what you can see in the real world. I think it's all going to be pretty exciting. Well, look, arguably, we would not be here as an industry if it wasn't for your lab. So it's just so exciting to see how your worldview of where the world was always going to go has accelerated. I know we could spend hours talking about all the tricks that were required from research
Starting point is 00:33:44 perspective to get here and where we're going to go. But we're going to wrap up for today. Thank you so much for making the time. I'm sure we'll have more to talk about soon. Thank you very much. I enjoyed this very much. If you liked this episode, if you made it this far, help us grow the show. Share with a friend or if you're feeling really ambitious,
Starting point is 00:34:05 you can leave us a review at rate thispodcast.com. slash A16C. You know, candidly, producing a podcast can sometimes feel like you're just talking into a void. And so if you did like this episode, if you liked any of our episodes, please let us know. I'll see you next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.