The AI Daily Brief: Artificial Intelligence News and Analysis - The Science Behind Sora, OpenAI's Game Changing Video Model

Starting point is 00:00:00 Today on the AI breakdown, we're digging into the research behind OpenAI's incredible new video model SORA. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown. Network for more information about our YouTube, our Discord, and our newsletter. Welcome back to the AI breakdown. It is now a few days after the launch or at least announcement of SORA, the new video model from Open AI. As you might imagine, the first reaction of the world was utter shock, disbelief. somehow OpenAI had figured out a way to blow us away once again, a technology which, Jan Lacoon from Meta, for example, said we just didn't have just days earlier,

Starting point is 00:00:48 turned out to be there waiting for us inside OpenAI's research halls. Now, as some people tried to grapple with what it meant for deepfakes to have video this convincing, or how it would change the nature of the entertainment industry, others were trying to tear it down saying it wasn't that good, or that everyone was being hyperbolic. I would suggest when you see things like that. to double-check to make sure that the person's business model isn't being a skeptic. In any case, a few hours after the announcement, OpenAI also dropped their research.

Starting point is 00:01:16 So what we're going to do today is to try to go a little bit behind the scenes of what's actually going on underneath the hood. This is pretty dense, and we're going to be going off of a lot of what OpenAI has published, but hopefully this gives you a better sense of the technology itself. You can find this research paper on OpenAI's website. The research is called Video Generation Models as World Simulmonary. And one of the most important lines of the abstract is this one. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

Starting point is 00:01:49 In this line, you get a picture into why this matters to open AI. It is not just that they are trying to build the best version of any generative AI tool, although on an intermediate scale that certainly is the case. But at the same time, they also see this as the means to an end of getting to AGI. The ability to simulate the physical world is a key capacity in their estimation of AGI, and so in many ways, SORA is actually doing double duty. Let's start with their section turning visual data into patches. They write, we take inspiration from LLMs, which acquire generalist capabilities by training

Starting point is 00:02:22 on internet-scale data. The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unified diverse modalities of text, code, math, and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, SORA has visual patches. Patches have previously been shown to be an effective representation for models of visual data. We find that patches are a highly scalable and effective representation for training generative models on diverse types of videos and images.

Starting point is 00:02:53 At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space, and subsequently decomposing the representation into space-time patches. they discuss their video compression network. They say this network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. SORA is trained on and subsequently generates video within this compressed latent space. Continuing they write, given a compressed input video, we extract a sequence of space-time patches which act as transformer tokens. The scheme works for images too since images are just videos with a single frame. And as an aside, this is something that people are noticing that part of the way that they're making SORA

Starting point is 00:03:31 may actually influence how they think about image generation in the future as well. Now, an important section is called scaling transformers for video generation. They write, SORA is a diffusion model. Given input noisy patches and conditioning information like text prompts, it's trained to predict the original clean patches. Importantly, SORA is a diffusion transformer. Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling, computer vision, and image generation.

Starting point is 00:03:58 In this work, we find that diffusion transformers scale effectively as video, models as well. Sample quality improves markedly as training compute increases. This is another area we're checking out the YouTube channel and seeing the actual images would be really valuable. They show a video of a Shiba Inu in a blue knitted hat, frolicing in the snow, and they show it at a base compute rendering, a 4x compute, and a 16x compute, and the upgrade is dramatic in each case. Then they get into a section that starts to articulate how they've done some things differently than others have. They write, past approaches to image and video generation typically resize, crop, or trim videos to a standard size, e.g. 4-second videos at 256-2-56-resolution. We find that instead, training on data at its native size provides several benefits.

Starting point is 00:04:42 One is sampling flexibility. They write, SORA can sample widescreen 1920 by 1080p videos, vertical 1080 by 1920 videos, and everything in between. This, they say, lets SORA create content for different devices directly at their native aspect ratios. It also lets us quickly prototype content at lower sizes before generating at full resolution, all with the same model. They also say that they, quote, empirically find that training on videos at their native aspect ratios improves composition and framing. One of the things they noticed is that when they worked with a version of their model that crops all training videos to B-square, which they say is common practice, it had a tendency

Starting point is 00:05:15 to generate videos where the subject was only partially in view. Now, another piece of this training has to do with language. They write, training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in Dolly 3 to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity, as well as the overall quality of videos. Similar to Dolly 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model.

Starting point is 00:05:48 This enables SORA to generate high-quality videos that accurately follow user prompts. Now, next in the research paper shows something that they actually weren't showing off in their SORA landing page. Everything on that landing page that we saw was text to video samples. However, Sora can also be prompted with pre-existing images or video. One of the use cases for that is, for example, animating Dali images, turning 2D images, in other words, into 3D animations. Sora also has the capacity to extend other videos that have been generated with AI, which, among other things, allows them to extend a video both forward and backward to produce an infinite loop. Sora has video-to-video capacity. And in one of the things that is most impressive, and has been showing up on Twitter the most,

Starting point is 00:06:27 they write, We can also use Sora to gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. The example they give is a drone flying through what looks like Roman ruins that turns into a butterfly, which is then underwater with the ruins growing coral. This is another area where seeing what I'm describing is going to be much more useful,

Starting point is 00:06:49 but I think part of why people are picking up on it is that this makes it really easy to imagine how AI could be used to do some things that would be very hard to replicate without significant Hollywood budgets. Being able to take two unlike things and transition them from one to the other opens up just a world of creative possibilities. Finally, they conclude with a really interesting section on emerging simulation capabilities. OpenAI writes, we find that video models exhibit a number of interesting emergent capabilities when trained at scale. These capabilities enable SORA to simulate some aspects of people, animals, and environments from the physical world. These properties emerge without any explicit inductive biases for 3D, objects, etc.

Starting point is 00:07:25 They are purely phenomena of scale. The examples they give are 3D consistency. Sora, they write, can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space. Another example is long-range coherence and object permanence. They write, A significant challenge for video generation systems has been maintaining temporal consistency

Starting point is 00:07:46 when sampling long videos. We find that Sora is often, though not always, able to effectively model both short and long-range dependencies. For example, our model can persist people, animals, and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video. A third emergent capability is interacting with the world. They write, Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks.

Starting point is 00:08:16 Finally, a fourth emerging capability is simulating digital worlds. They write, SORA can simultaneously control the player in Minecraft with a basic policy while also rendering the world in its dynamics in high fidelity. These capabilities can be elicited zero shot by prompting Sora with captions mentioning Minecraft. Combined, they say, these capabilities suggest that continued scaling of video models is a promising path towards the development of highly capable simulators of the physical and digital world and the objects, animals, and people that live within them. Now let's go to how some others

Starting point is 00:08:46 we're trying to simplify some of these concepts. Dan Shipper over at Every wrote a piece called Sora and the Future of Filmmaking. In a section called How Sora uses massive amounts of data to make mind-blowing video clips, he tries to break down this idea of patches. He says, imagine a film print of the dark night. You unroll the film from its scroll

Starting point is 00:09:03 and chop off the first 100 frames of cellophane. You take each frame, hear the Joker laughing maniacally, there Batman grimacing, and perform the following strange ritual. You use an exacto knife to cut a gash in the shape of an amoeba in the first You excise this cellophane amoeba with tweezers as carefully as a watchmaker and put it in a safe place.

Starting point is 00:09:19 Then you move on to the next frame. You got the same amoeba shaped hole from the same part of the next cellophane frame. You remove this new amoeba, shaped precisely the same as the last one, with tweezers and stack it carefully on top of the first one. Keep going until you've done this to all 100 frames. You now have a multicolored amoeba extruding through its y-axis, a tower of cellophane that could be run through a projector to show a small area of the dark night, as if someone stuck their hand in a loose fist in front of the projector, letting only a little bit of the movie through. This tower is then compressed and turned into what's called a patch, a smear of color changing through time. The patch is the basic unit of SORA in the same way that the token is the basic

Starting point is 00:09:52 unit of GPT4. T tokens are bits of words, while patches are bits of movies. GPT4 has been trained to take in a sequence of tokens and output the next token in the sequence. SORA has been trained to do the same thing. It takes in a sequence of patches and outputs the next patch in the sequence. Patches are innovative and SORA appears to be so powerful because they allow open AI to train SORA on an immense amount of image and video data. Imagine patches cut out of every video in existence, infinite towers of cellophane, stacked and fed into the model. Previous text-to-image approaches required images and videos used in training to all be the same size, which required significant pre-processing to cut videos down to size.

Starting point is 00:10:26 But because SORA trains on patches instead of the full frame of the video, he can gobble up any video or image without requiring it to be cut down. As a result, more data can be used for training for a higher quality output. He also says, Another big advance with Sora is the architecture. Traditionally, text-to-video models like runway are diffusion models, while text models like GPT4 are transformers. SORA is a diffusion transformer, a mashup of the two.

Starting point is 00:10:48 Instead of predicting the next piece of text in a sequence, SORA predicts the next patch in a sequence of patches. By using this architecture, OpenAI can throw much more data and compute at training SORA, and the results are stunning. If you haven't been over to every.2, Dan writes some great stuff about AI, and I highly suggest checking it out. Now, of the people who are talking about the limits of SORA, the most interesting ones are those who are actually zooming out to how they might use it

Starting point is 00:11:11 when it becomes available. Martin Nebelong writes, Systems like OpenAI's new SORA video model just shows us that anyone can create a pretty video or a beautiful picture using nothing but a text prompt. But if we are to use the technology to create interesting art, thought-provoking cinema, and great games, we need control of the output down to a granular level

Starting point is 00:11:27 if that's what our project calls for. We need better tools and better interfaces to interact with the generative matter. Martin is calling for things like video impainting, more precise control over specific frames, specific texture and movement, all of which there are lots of people thinking about, and so I anticipate coming to bear as well. But still, everything that I've seen for the last couple days reinforces for me

Starting point is 00:11:46 that we really are now once again living in a post-X world, and X this time is not Chatsy-B-T, but SORA. Will it have as big of an impact? It's hard to say. Using text is more universal than using images is more universal than using videos. And so perhaps it won't ultimately be as influential. But I'm not so sure.

Starting point is 00:12:04 In the same way that Mid Journey and Dali 3 are democratizing how people create images and who actually makes them, I'll be interested to see how SORA and other tools like it as they catch up do the same for video. Fascinating times, but that is going to do it for this AI breakdown. Hope you're having a great weekend. Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - The Science Behind Sora, OpenAI's Game Changing Video Model

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.