The AI Daily Brief: Artificial Intelligence News and Analysis - OpenAI's Stunning New Video Model SORA Shocks the AI World

Starting point is 00:00:00 Today on the AI breakdown, we are looking at OpenAI's new SORA Video Generation model, which not only blows everything we've seen so far out of the water, but which many are already calling the GPT4 moment for AI video generation. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown.net Network for more information about our YouTube, our Discord, and our newsletter. Hello, friends. Quick note before we get into the show, SORA was such an exciting thing to discuss that I went a little bit longer than I normally do, and I decided to just make this the focus of the entire show.

Starting point is 00:00:36 We will be back to our normal format of Briefs and Mains next week. But for now, let's just talk some Sora. Welcome back to the AI breakdown. Usually I try to make sure that any content I'm doing works for both YouTube and for the podcast. However, if you are just listening to this, I highly suggest you also check out the YouTube version. inherently this is a conversation that is better had with the visual reference points of all of the videos that we're going to be talking about, starting with this one that's on the screen now. It's a famous set of video generations of Will Smith eating noodles that looks something like if Mid Journey version 1 or version 2 came to life. The movement is weird, the faces are frankly disturbing, but this is where we were just a year ago.

Starting point is 00:01:20 Now OpenAI has released SORA. OpenAI COO Brad Lightcap wrote, this is one of those things you tell yourself is coming and you think you're thinking. are ready for it and couldn't possibly be surprised by it, but then you see it and don't quite believe it, and you're not sure why you didn't think you'd be surprised. So what is SORA? Sora is a video generation model. OpenAI writes, it's an AI model that can create realistic and imaginative scenes from text instructions. What it really is is easily the most advanced and impressive AI video generation model we've ever seen, and it's not particularly close. Unlike video generators like PICA and Runway right now, which work in four or five or six-second segments,

Starting point is 00:02:00 SORA can generate videos of up to 60 seconds. The quality, just based on anyone's visual reference, also blows pretty much everything we've seen out of the water. The level of detail, the sheer physics involved in these videos, makes them nearly indistinguishable from something captured by film or created in all of the old-fashioned ways. The demo site has about 30 examples, and they run the gamut from people to earth, urban settings, to landscape visuals, to animal scenes, to cartoon style generations, to historic footage of California during the Gold Rush. Now, unfortunately for all of us who instantly want to get

Starting point is 00:02:35 our hands on this, this is a rare instance of open AI announcing something that isn't broadly accessible. They write, today's SORA is becoming available to red teamers to assess critical areas for harms or risks. We're also granting access to a number of visual artists, designers, and filmmakers to gain feedback on how to advance the model to be most helpful for creative professionals. We're sharing our research progress early to start working with and getting feedback from people outside of OpenAI and to give the public a sense of what AI capabilities are on the horizon. Now, it's not just the length that makes SORA different. As OpenAI writes, Sora is able to generate complex scenes with multiple characters,

Starting point is 00:03:12 specific types of motion, and accurate details of the subject and background. The model understands not only what the user has asked for in the prompt, but how those things exist in the physical world. Part of this comes from, as they put it, the model's deep understanding of language, which helps it accurately interpret prompts and, quote, generate compelling characters that express vibrant emotions.

Starting point is 00:03:32 Now, Dr. Jim Fan from NVIDIA wrote about how much more there is likely to be than even meets the eye. He writes, If you think Open AI SORA is a creative toy like Dali, think again. Sora is a data-driven physics engine. It is a simulation of many worlds real or fantastical.

Starting point is 00:03:49 The simulator learns intricate rendering, intuitive physics, long horizon reasoning, and semantic grounding, all by some denozing and gradient maths. I wouldn't be surprised if SORA is trained on lots of synthetic data using Unreal Engine 5. He then breaks down a video of two pirate ships in a cup of coffee for which the prompt was, photorealistic close-up video of two pirate ships battling each other as they sail inside a cup of coffee. Jim explains what's going on behind the scenes, writing, The simulator instantiates two exquisite 3D assets, pirate ships with different decorations. SORA has to solve text to 3D implicitly in its latent space. 3D objects are consistently animated as they sail and avoid each other's paths.

Starting point is 00:04:27 Fluid dynamics of the coffee, even the foams that form around the ships. Fluid simulation is an entire subfield of computer graphics, which traditionally requires very complex algorithms and equations. Photorealism, almost like rendering with ray tracing. The simulator takes into account the small size of the cup compared to oceans and applies tilt-ship photography to give a minuscule vibe. The semantics of the scene does not exist in the real world, but the engines still implements the correct physical rules that we expect.

Starting point is 00:04:52 Now, if Dr. Jim is explaining what's going on, the vast majority of people who are tweeting just have their jaw on the floor. Andrew Cotei writes, Once again, Open AI seemed to be one to two years ahead of everybody else. Sully Omar says, this is genuinely the first time I've audibly said what the F for an AI video. 100% AI generated. The face, the bread, the pores, everything about this feels real.

Starting point is 00:05:14 McKay-Rigley writes, I don't even know what to say. These clips generated by OpenAI's SORA model have me speechless. We knew good AI text of video would come, but this quickly, unreal, we're stepping into a new world. Buckle up. Now, fascinatingly, this comes on the heels, of course, of yesterday morning's announcement of Google's Gemini 1.5, with its 1 million token context window. I even said at the end of yesterday's video that I'd be interested to see how OpenAI responded

Starting point is 00:05:42 to that. Well, clearly, this wasn't just some reaction to Gemini's announcement, but boy, just when you thought that Google was catching up to OpenAI from a public perception standpoint, they go and drop SORA and just absolutely rip the conversation back in their orbit. Indeed, the simplest way to describe how most people feel about this comes from Siki Chen, who says, Open AI just casually dropping the GPT4 of video generation on a Thursday morning. Going back to Sora's announcement post, they do say that this model has weaknesses. They suggest it may struggle with accurately simulating the physics of a complex scene, and that it may not understand specific instances of cause and effect.

Starting point is 00:06:17 The example they give is a person taking a bite out of a cookie, and afterward the cookie not having a bite mark. They also say that the model might confuse details, such as mixing up left and right, or struggling with, quote, precise descriptions of events that take place over time, like following a specific camera trajectory. Now, when it comes to how this model works, OpenAI writes SORA is a diffusion model, which generates a video by starting off with one that looks like static noise and gradually transforms it by removing the noise over many steps. Sora is capable of generating entire videos all at once, or extending generated videos to make them longer. By giving the model foresight of many frames at a time, we've solved a challenging problem of making sure a subject stays the same, even when it goes out of view temporarily.

Starting point is 00:06:55 Similar to GPT models, Sora uses a transformer architecture, unlocking superior scaling performance. We represent videos and images as collections of smaller units of data called patches, each of which is akin to a token in GBT. By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before. They also note that, in addition to being able to generate video just from text, the model can also take an existing still image and generate a video from that. As with everything they do, they make it clear that this is not just an endpoint, but is a milestone for a bigger objective. They conclude their page about Sora saying, SORA serves as a foundation for models that

Starting point is 00:07:29 can understand and simulate the real world. The capability we believe will be an important milestone for achieving AGI. Now, some people, even if they were incredibly impressed by the quality of the videos, did call out the challenge of having AI-generated video that's this convincing. Avi Chal from Electric Capital writes, Deepfakes are going to be a big problem in 2024. How do we mitigate the risk of deepfakes? One, content should have a verifiable cryptographic signature. Two, publish the signature to an openly readable, difficult to corrupt database. AI handshake crypto.

Starting point is 00:08:00 Now, lest you think it's just the crypto people saying this, the Biden White House seems to be thinking in similar ways when it comes to verifying all official White House communications. Matthew Kobach writes, I just watched the open AI videos and then I scrolled Instagram for a minute. And now everything seems like it could be AI. This changes our perception of video forever. Now, on that safety front, OpenAI writes that they're working right now with red teamers, especially those who are domain experts in areas like misinformation, hateful content, and bias

Starting point is 00:08:25 to adversarily test the model. In terms of other safety initiatives, they also say that they're building tools like a detection classifier that can detect when a video was generated by SORA, and they reiterate their position that the only way to actually figure out how AI is going to work in the world is by trying it. Despite extensive research and testing, they say, we cannot predict all the beneficial ways people will use our technology, nor all the ways people will abuse it. That's why we believe that learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time.

Starting point is 00:08:54 Now there is another dimension of this which I wanted to discuss, which is the marketing dimension. I jokingly said that the most shocking part of SORA was that it seems like they actually had the marketing team involved in the name. No more GBT4 or Dolly 3 or anything like that. Instead, we get this pretty-to-say four-letter word that actually feels like a brand. Even beyond that, though, VC Rachel Horwitz pointed out the difference between the way that the CEO of Google announced Gemini 1.5 and how Sam Altman was introducing SORA. The way she summed it up was scrappy digital native founder casually interacting with people on the web to show off mind-blowing new tech over slick, overly produced marketing videos and PR-approved

Starting point is 00:09:29 statements. What she's referring to is that just after the announcement, Sam Altman also tweeted, we'd like to show you what Sora can do. Please reply with captions for videos you'd like to see and we'll start making some. Don't hold back on the detail or difficulty. So far on his thread he's published, a wizard wearing a pointed hat and a blue robe with white stars casting a spell that shoots lightning from his hand and holding an old tome in his other hand. A half duck, half dragon flies through a beautiful sunset with a hamster dressed in adventure gear on its back. A street-level tour through a futuristic city which in harmony with nature and also simultaneously cyberpunk and high tech. The city should be clean with advanced futuristic trams,

Starting point is 00:10:04 beautiful fountains, giant holograms everywhere, and robots all over. A futuristic drone race at sunset on the planet Mars. Two golden retrievers podcasting on top of a mountain, an instructional cooking session for homemade Noki, hosted by a grandmother social media influencer set in a rustic Tuscan country kitchen with cinematic lighting. And it just goes on and on, each in some way more impressive than the last. Now, if you're wondering what runway and Pika are thinking about this right now, all we've seen so far is the CEO of runway Cristobal Valenzuela saying, game on, which is of course exactly what we want to hear. Meanwhile, Prati, the creator of Sink, took the image of the cyberpunk-looking woman walking through Tokyo

Starting point is 00:10:42 and added a clip of dialogue from the Matrix with matching lip syncing. In this way, showing how these AI tools will interplay with one another. In other words, he built off this AI generation that came from OpenAI and added his own technology to turn it into a totally different visual scene. Now, one of the big byproducts of having a product with this potential level of impact is that people are zooming out and talking about the future. ST on Twitter writes, there are two schools of thought here. One, in five years, entertainment networks, they're not really social.

Starting point is 00:11:12 social networks don't need creators. They just churn out an endless stream of AI-generated content with a bunch of AI-generated ads linking you to AI-generated product pages, keeping you amused till the end of time. Two in five years, this leads to an explosion in creator-generated content. The baseline quality of content goes up. We get Hollywood-level production with more niche, more extreme storylines. The new creators are the prompters. Now, obviously none of us know exactly how this is going to go. However, I can share my observation that if you look at the communities that have already started to arise around runway and around PICA, what you're seeing is people who are taking advantage of a new medium to tell classic stories, but just in a

Starting point is 00:11:50 different way. There is a huge difference between the really great AI-generated stuff and the more boring, not-so-good AI-generated stuff. And when it comes to story, much of that isn't just about the quality of the technology, but about narrative and about how people put disparate elements together. When we think about film or TV, it's not just a collection of sets. It's a collection of very specific decisions about script, about character, about narrative arc, and even visually about the way that one shot interplays with the next. Yes, of course, AI will be able to do and replicate much of that, but it seems likely to me that there's going to be a huge chasm between the best AI-generated entertainment and the worst. I tend to agree that the people who are going to be best at

Starting point is 00:12:33 creating that AI-generated entertainment are likely not exclusively the people who are creating content already. I almost think that the inevitable reality is that both of these two models actually come to bear. The idea that entertainment slash social networks churn out an endless stream of AI-generated content that's perfectly personalized to the people who are consuming it, there's no universe in which that's not going to happen, at least in some way. And yet at the same time, I also believe that there's no universe in which the other thing, the explosion in creator-generated content, doesn't happen as well. I think just in the same way that there is a huge range of highbrow to lowbrow to different format content that we consume on any given day, so too will it be like that when we just have

Starting point is 00:13:13 more content because of how AI has changed our ability to create it. There was a thought coming into 2024 that perhaps this was going to be the year for AI video in the same way that 2020 was the real breakout year for AI generated images. Between the end of 2022 and where we are now, Tools like Mid Journey and Dolly 3 have made enormous, unbelievable strides. There is more control, more adherence to prompts, more ability to get exactly what you want out of them. And yet, until we saw SORA today, I think it was still hard for many to imagine what the sort of phase shift moment was going to be that made it feel like that level of advance. Now, of course, comparing those Will Smith eating noodles videos from a year ago to these incredibly detailed hyper-realistic renderings from Sora now is truly a reminder. of how fast things are moving and what we have to expect in the future.

Starting point is 00:14:04 Ben Tossel from Ben's bites got at that when he tweeted, Assume token lengths will be billions. Assume videos will be indistinguishable. Assume images will be indistinguishable. Assume every sentence is by AI fully or in part. Assume day-to-day tasks can be 90% automated. What mad stuff will blow us away then? You can, if you look hard enough, find people who are cynics about this, or who try to tell you that it's not impressive or not a good use of A&Sov.

Starting point is 00:14:30 AI, but in spite of their squawking, I think that the Sora moment will be one we look back on as a fundamental before and after break. It feels very much like just the beginning of something, with all the anxiety and excitement that that brings. Still, perhaps the best summary of yesterday, between Gemini 1.5's Million Token Context lengths and Sora's unbelievable advances in AI video generation, SWIX writes, it's been a fantastic year of AI progress in the past hour. And so, friends, that is where we will wrap the AIME.

Starting point is 00:15:00 I break down. I can't wait to get my hands on SORA and see what we can create. And I will be eagerly awaiting all of the creations of people who do have it in the meantime. Appreciate you listening or watching as always. And until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - OpenAI's Stunning New Video Model SORA Shocks the AI World

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.