Latent Space: The AI Engineer Podcast - Generative Video WorldSim, Diffusion, Vision, Reinforcement Learning and Robotics — ICML 2024 Part 1

Episode Date: December 10, 2024

Regular tickets are now sold out for Latent Space LIVE! at NeurIPS! We have just announced our last speaker and newest track, friend of the pod Nathan Lambert who will be recapping 2024 in Reasoning M...odels like o1! We opened up a handful of late bird tickets for those who are deciding now — use code DISCORDGANG if you need it. See you in Vancouver!We’ve been sitting on our ICML recordings for a while (from today’s first-ever SOLO guest cohost, Brittany Walker), and in light of Sora Turbo’s launch (blogpost, tutorials) today, we figured it would be a good time to drop part one which had been gearing up to be a deep dive into the state of generative video worldsim, with a seamless transition to vision (the opposite modality), and finally robots (their ultimate application).Sora, Genie, and the field of Generative Video World SimulatorsBill Peebles, author of Diffusion Transformers, gave his most recent Sora talk at ICML, which begins our episode:* William (Bill) Peebles - SORA (slides)Something that is often asked about Sora is how much inductive biases were introduced to achieve these results. Bill references the same principles brought by Hyung Won Chung from the o1 team - “sooner or later those biases come back to bite you”.We also recommend these reads from throughout 2024 on Sora.* Lilian Weng’s literature review of Video Diffusion Models* Sora API leak* Estimates of 100k-700k H100s needed to serve Sora (not Turbo)* Artist guides on using Sora for professional storytellingGoogle DeepMind had a remarkably strong presence at ICML on Video Generation Models, winning TWO Best Paper awards for:* Genie: Generative Interactive Environments (covered in oral, poster, and workshop)* VideoPoet: A Large Language Model for Zero-Shot Video Generation (see website)We end this part by taking in Tali Dekel’s talk on The Future of Video Generation: Beyond Data and Scale.Part 2: Generative Modeling and DiffusionSince 2023, Sander Dieleman’s perspectives (blogpost, tweet) on diffusion as “spectral autoregression in the frequency domain” while working on Imagen and Veo have caught the public imagination, so we highlight his talk:* Wading through the noise: an intuitive look at diffusion modelsThen we go to Ben Poole for his talk on Inferring 3D Structure with 2D Priors, including his work on NeRFs and DreamFusion:Then we investigate two flow matching papers - one from the Flow Matching co-authors - Ricky T. Q. Chen (FAIR, Meta)And how it is implemented in Stable Diffusion 3 with Scaling Rectified Flow Transformers for High-Resolution Image Synthesis Our last hit on Diffusion is a couple of oral presentations on speech, which we leave you to explore via our audio podcast* NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models* Speech Self-Supervised Learning Using Diffusion Model Synthetic DataPart 3: VisionThe ICML Test of Time winner was DeCAF, which Trevor Darrell notably called “the OG vision foundation model”.Lucas Beyer’s talk on “Vision in the age of LLMs — a data-centric perspective” was also well received online, and he talked about his journey from Vision Transformers to PaliGemma.We give special honorable mention to MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark.Part 4: Reinforcement Learning and RoboticsWe segue vision into robotics with the help of Ashley Edwards, whose work on both the Gato and the Genie teams at Deepmind is summarized in Learning actions, policies, rewards, and environments from videos alone.Brittany highlighted two poster session papers:* Behavior Generation with Latent Actions* We also recommend Lerrel Pinto’s On Building General-Purpose Robots* PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMsHowever we must give the lion’s share of space to Chelsea Finn, now founder of Physical Intelligence, who gave FOUR talks on* "What robots have taught me about machine learning"* developing robot generalists* robots that adapt autonomously* how to give feedback to your language model* special mention to PI colleague Sergey Levine on Robotic Foundation ModelsWe end the podcast with a position paper that links generative environments and RL/robotics: Automatic Environment Shaping is the Next Frontier in RL.Timestamps* [00:00:00] Intros* [00:02:43] Sora - Bill Peebles* [00:44:52] Genie: Generative Interactive Environments* [01:00:17] Genie interview* [01:12:33] VideoPoet: A Large Language Model for Zero-Shot Video Generation* [01:30:51] VideoPoet interview - Dan Kondratyuk* [01:42:00] Tali Dekel - The Future of Video Generation: Beyond Data and Scale.* [02:27:07] Sander Dieleman - Wading through the noise: an intuitive look at diffusion models* [03:06:20] Ben Poole - Inferring 3D Structure with 2D Priors* [03:30:30] Ricky Chen - Flow Matching* [04:00:03] Patrick Esser - Stable Diffusion 3* [04:14:30] NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models* [04:27:00] Speech Self-Supervised Learning Using Diffusion Model Synthetic Data* [04:39:00] ICML Test of Time winner: DeCAF* [05:03:40] Lucas Beyer: “Vision in the age of LLMs — a data-centric perspective”* [05:42:00] Ashley Edwards: Learning actions, policies, rewards, and environments from videos alone.* [06:03:30] Behavior Generation with Latent Actions interview* [06:09:52] Chelsea Finn: "What robots have taught me about machine learning"* [06:56:00] Position: Automatic Environment Shaping is the Next Frontier in RL This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the latent space coverage of ICML. 2024, this is Charlie, your AI co-host. We know it's been a few months since ICML actually happened, but now that all the talks are available online, and we are in final preparations for New Reap's 2024, we figured this was a good time to release our conference recap to get you in the moon. As a side note, regular tickets are now sold out for Latent Space Live at New Reps,
Starting point is 00:00:26 where we have announced our dream speakers to recap the best. of 2024 across the top, voted domains in vision, open models, post-transformers, synthetic data, small models, agents, GPU scaling, and a special 2024 in AI keynote from our friend and fellow podcaster Sarah Guo of Conviction Capital. Today, we are announcing our very last speaker and newest track, friend of the pod, Nathan Lambert, who will be recapping 2024 in reasoning models Like OpenAISO1, see you in Vancouver. Coming back to ICML, this is a very special episode in more than one because it is the very first episode not hosted by Swix or Alessio.
Starting point is 00:01:13 We are continuing to experiment with guest hosts, adding different opinions and voices to the show. And in this case, to cover conferences, we personally weren't physically able to attend. So we're very grateful for our friend Brittany Walker of CRV to step in as your guest co-hors. host for ICML 24. Our goal with these conference recaps is to give you an audio experience of what it's like to be there
Starting point is 00:01:38 and to provide a filtered recommendation of papers and backstories of authors that will be useful for the AI engineer today and tomorrow. Brittany worked enormously hard to put together the poster chats you will hear, and we're very grateful. Given that OpenAI has launched Sora Turbo today, we have bumped up our planned second episode to release first, since generative video happened to be a huge focus at ICML. Let's not bury the lead and go straight into the SORA Talk from Bill Peebles, first author of the Diffusion Transformers paper and research scientist leading SORA model development.
Starting point is 00:02:16 Since we're talking about video models, you may wish to tap into the show notes for direct links to the public talks. However, we think there is still value in editing the audio for eyes-free browsing. We believe this is the most recent public academic discussion of SORA before the SORA Turbo public release today. So we hope this episode is valuable background for anyone getting up to speed on video diffusion. Watch out and take care. I'm Bill and thanks a lot to Joanna for organizing this conference. Really excited to be giving a talk here. So I'm going to be talking about SORA today. So this is video generation models as world simulators. This was joint work with my good friend Tim Brooks and also some other wonderful colleagues at OpenAI.
Starting point is 00:03:02 So let's dive right in. So Sora is OpenAI's first video generation model. And in advance, I'm sorry for any kind of like FPS delay with screen sharing videos. It's always like the hardest part about working with videos is like showing results to other people over the internet. But this is a sample from Sora. And the text prompt is a stylish woman walks down a Tokyo street filled with warm glowing neon. and you can see the rest of it.
Starting point is 00:03:26 SORA is capable of generating 1080p video up to a minute long. And what's remarkable about SORA is kind of all of the simple things that we take for granted about the visual world. It really begins to pick up on when you train on video data at scale. So if you see that blue sign in the background, even when there's a shot change and it's occluded, it's maintained. And we see this very consistently for a large number of samples from SORA. So it really has a good understanding, not only, for example, of how light interacts within
Starting point is 00:03:55 in scenes and complicated ways, but object permanence and lots of other capabilities that have been very difficult for video generation models to grok in the past. So of course, it can do more than just photorealistic style. So this prompt is a gorgeously rendered papercraft world of a coral reef, right with colorful fish and sea creatures. So Sora, again, is capable of generating and non-photorealistic styles.
Starting point is 00:04:18 It can also do a number of scene transitions. So we didn't stitch these videos samples together. This is all one continuous output from Sora. It's capable of figuring out that if you want a scene with like a variety of sea life, you know, maybe there should be a shot of seahorses, turtles, etc. And it's also capable of modeling complex scenes. So this prompts as beautiful snowy Tokyo City is bustling. And so there's a large number of people in the scene.
Starting point is 00:04:41 You can see the camera is flying through. And while it's doing that, it's able to have, you know, interactions between people like this couple is holding hands. There are people selling goods at the stalls. There's soccer pedals flying through the air. So SORA has really begun to pick up on the intricacies of how scenes should look and do a great job at rendering them. One final example here is a movie trailer featuring the Adventures of the 30-year-old Space Man. So what's cool about this is Sora is kind of zero shot learns that you should have character consistency throughout a number of scene transitions.
Starting point is 00:05:15 So, you know, it knows movie trailers do not normally like change the leading actor halfway through. And so the man is the same across these different environments and different scenes. And all of this is just learned automatically by training on video data at scale. So now I want to go into a few technical details about SORA. A lot of the inspiration for SORA came from language models. And in particular, this notion of a unified representation of text data. So, you know, one of the key ingredients to the success of LLMs over the years has been this idea that, you know, you could take stories, you can take code, you can take math.
Starting point is 00:05:53 But at the end of the day, all this information is represented with a unified vocabulary being tokens, which makes it very easy to train on data at scale. This imbues language models with very generalist capabilities. It makes them polymaths at like a number of tasks. Now, we were really thinking like what the analog of this would be for visual data. And, you know, in particular, you know, there's no shortage of very diverse sources of visual data in the world. You know, there's vertical video, there's squared images out there.
Starting point is 00:06:23 You have like every kind of data of different durations, of different resolutions, of different aspect ratios. And the question is, how can you train on all of that in a unified representation so we don't have to throw away any visual data? And so this is really one of like the key ingredients for the success of SORA is coming up with this unified notion of kind of a visual representation on which we can train on, you know, internet skill. visual data. And so in order to accomplish this, we use a VAE, kind of inspired by latent diffusion models from Robin Rombach. And what we do with this is encode all this information into one unified latent space. So the idea here is on the far left, you know, we have like a video of a butterfly swimming underwater. We go through this visual encoder, and this will compress videos both spatially and temporally into a single sequence of data. And at the end of the day, we do this,
Starting point is 00:07:21 of course, so we can train transformers on this sequence of data. We train diffusion transformers at scale. And the benefit of this is we get a number of just great properties of scaling transformers up specifically for video and image data. So, you know, the name of the game here is like, how does visual quality improve as you throw more flops at the problem? And we find that improves like pretty steadily, which is great. So on the far left here, you know, we have a base compute trained SORA model. So this is trained with a small amount of compute. And you can see it gets like some details right. So for example, it kind of has some idea of like if a camera is moving through a scene, there should be some notion of consistency. But all the textures are wrong and it's not high
Starting point is 00:08:00 fidelity. If you four X the amount of training compute, you pump into that model, it begins to figure out kind of like, you know, what dogs look like, what humans look like. But the visuals are still not great. And if you really crank up the amount of flops you're pouring into these things to 32x, you begin to see that it gets a lot of these fine-grained details, right? The interaction of like the owner's hand with the dog, all of like the snowy textures on the ground. And so we're finding that these models scale extremely effectively if you kind of nail the basics, right? So in particular, if you can create this, you know, setup where you have this unified representation of visual data and crank up diffusion transformers,
Starting point is 00:08:41 they can really start to learn to do amazing things. Another cool property of SORA is how generalist it is at test time. So, you know, when you actually want to sample content, you can do it at any aspect ratio and resolution. And this is really great from the perspective of kind of like controllable generation, specifically as it relates to, you know, different devices. So now if I'm like watching a movie on my iPhone and then I transition to watching it on my laptop, those are going to use two totally different aspect ratios.
Starting point is 00:09:12 And normally, you either have to just pad with black bars or crop it. But with models like SORA, it's now possible to generate content natively for any device, which is pretty exciting to think about the possibilities of how that can affect content creation in the future. So the sea turtle here is just rendered out with different aspect ratios. Another exciting aspect of this very generalist training recipe is we can kind of move on from the days of just like cropping data for training generative models. So you know, back when I was like in grad school, I was always like spending time, you know, cropping to like 256 by 256 resolution to
Starting point is 00:09:48 train like whatever version of style again I was working with. And while that works well, it has certain downside. So, you know, there are certain biases actually within data. For example, the photographer's bias of centering objects. And so on the left here, we have a baseline SORA model where we don't train with native size video and image data. Instead, we actually do this like hard cropping to center. And you can see that the model essentially inherits some weaknesses of this cropping strategy, right? Sometimes like the scuba diver is going to be off center, which is an actually ideal framing. If you do this native size training, it's actually much more effective at composing scenes. So you inherit some nice benefits of the training data in the model by just, you know, not throwing away
Starting point is 00:10:32 pixels and training on everything you have. So SORA is also an image generation model. So the prompts here is digital art of a young tiger under an apple tree and a map painting style with gorgeous details. Here's another sample. We find that SORA in particular really excels at photo realistic kinds of content.
Starting point is 00:10:53 So there's a lot of details kind of in the woman's face here, which does a great job at rendering out. This is at 2K by 2K resolution. And of course, we can interact with SORA and otherwise beyond just tech. So all of the results before were text to video or text to image samples. But SORA can also accept visual inputs as conditioning. And so here we were seating it with an image from Dali 3 and then having SORA extend this out in time. So Sora is capable of kind of understanding, you know, what's going on in an image and then extrapolating from there.
Starting point is 00:11:27 And so we had a lot of fun with this. So these are Dolly 2 samples on the left here. And so SORA can take video conditioning or image conditioning at any temporal index. So here we condition the model in the middle of the sequence with the Shiba Inu, and then we extend it both backwards and forwards in time from that position. And you can see it's able to animate the dog's face. Of course, it can also do kind of more fun animated styles here. We've been using this to make emojis internally.
Starting point is 00:12:02 We have this nice SORA slack emoji now. And another cool thing with SORA is its ability to extend backwards in time. So of course, whether you're doing like temporal outpainting, like forward or backwards in time, it's all kind of like the same to these models. And so here we have the model end in the same way, which is with this San Francisco logo, but all of the events leading up to it are resampled by the model. So it's very flexible and how you can use this to edit or extend videos. Another cool aspect of SORA is it's zero shot editing capabilities.
Starting point is 00:12:44 So there's been a ton of great work from the academic community over the years on finding creative ways to use diffusion models to do, for example, image editing tasks. So one really nice work in that area is SD edit. And we find that techniques like this, of course, just work right out of the box with thrower because it's a diffusion model at the end of the day. So these are SD edit results. So the top left is the source video. This particular source video was generated with SORA,
Starting point is 00:13:13 but of course, it doesn't have to be. It could also be a real video. And you can use a variety of different text prompts to re-render this scene automatically. So for example, on the top right, we can rewrite the video in a pixel art style. And as you would expect, the best to edit, if you kind of use the right noise level,
Starting point is 00:13:30 you can get it to maintain most of the structure in the scene and just only update the style, which is cool. So towards the end of this video, you can see that there's like a cave that the car in the top left goes into. And across all of these like re-rendered style, as you see that it preserves like some notion of like a cave or like an overhang that the car goes through. Another thing that's cool is sort of smart about figuring out whether or not, you know, certain correlations make sense. So for example, in the bottom right, you can say it changed the video to a medieval theme. Sora knows there weren't cars in the medieval time.
Starting point is 00:14:01 So instead you get a red horse carriage. So it's kind of fun to see where Sora takes liberties and rerun. wondering your video. Another cool capability that SOR can do is blend between videos. So the far left and far right videos here define the endpoints of this interpolation. And the middle video is SOR is imagining of how you connect the dots. And so you can see you get these kind of fantastic creatures in this case where you can never quite see where it goes from being a chameleon to a bird.
Starting point is 00:14:31 It happens very seamlessly. And you can use this for all kinds of scenes. They don't even have to be like particularly related. So on the far left here, we have a drone flying through the Coliseum, and the far right is the butterfly flying underwater. And you can see that Sora is able to come up with a pretty reasonable interpolation between these two videos. So you gradually see the Colosseum decay and move underwater. And at some points, the drone morphs into the butterfly very suddenly because it kind of like has put these two things into correspondence automatically and infers that like this is like a reasonable thing that it should. should focus on blending between.
Starting point is 00:15:17 And here's an example of blending two scenes with totally different style. So the far left is like a photorealistic aerial drone shots. And then the far right video is it's kind of nice like gingerbread village. And it comes up with a really creative way to make this work. So rather than kind of morph the whole style of the scene
Starting point is 00:15:34 in one shot, it decides that maybe this like gingerbread village is kind of hidden off to the side of this photo a realistic town and it zooms in. So one other technique that we use for SORA is this notion of video recaptioning. And this is a technique that was actually pioneered by Dolly 3 by some other folks at OpenAI. And the high level idea is during training, diffusion models and really all generative models benefit from having a much cleaner source of conditioning than we've historically given them in the past. So, you know, there's like very crude text captions out there, like alt text,
Starting point is 00:16:19 for example, which doesn't actually contain a lot of information about the scene. They're like very coarse keywords, for example. Sometimes the content's like pretty unrelated actually to what's in your image or video, etc. And one of the key breakthroughs in Dolly 3 was generating synthetic captions that are much more detailed and contain much more mutual information with the content that you actually want to generate. And so what we saw with Dolly 3, this is an example figure from Dolly 3, is that this really improved the controllability of the model
Starting point is 00:16:51 and enabled you to create much more intricate scenes with a lot more ease than in the past. And so we took inspiration with SORA to also apply this technique to video. And one of the features of this is that at test time, when you're actually interacting with the model, rather than just directly kind of upload a prompt to SORA will actually use GPT under the hood to essentially upsample a user's base prompt
Starting point is 00:17:23 into a much more detailed video description. And so this figure here is the system prompts that we used for Dolly 3 in order to do this upsampling. It's actually pretty involved to get this to work well. And so there's a lot of prompt engineering, even at OpenAI, to get these systems to be reliable. But under the hood, this is what we're doing to achieve some of these kind of like
Starting point is 00:17:46 finer grained control of SORA. So the last topic I want to talk about is this notion of like emerging simulation capabilities. And this is really the aspect that we are most excited about with SORA looking forward. You know, we often get asked the question, you know, at OpenAI, you know, how does video generation really relate to the core mission
Starting point is 00:18:09 of AGI. And on the SOR team, we're actually really passionate about this being a model for world simulation moving forward. And so what exactly does that mean? Like, how do we actually use these models long term to do interesting tasks and to really extract intelligence out of the world? And what we believe is when we really scale up video generation models, they're going to get so good at simulating such a variety of complex scenes with, you know, different agents in them, that it's going to need to ultimately learn an underlying model of how people interact, of how people do tasks, of how people think, if it's truly generating, you know, high fidelity content.
Starting point is 00:18:49 Like at some point, right, if the conversation I'm having at a dinner table within SORA is not realistic, that means it's failed to do its job of, like, accurately learning, like, the distribution of, like, human behavior. And so as we approach the limits, you know, of achieving, like, the irreducible loss there, we think, like, pretty amazing things are going to emerge from these models. and it's going to play a really key role in developing more intelligent systems in the future. So SORA is obviously not there today, but we already see some cool phenomena by training on video data at scale that we just want to highlight. And we think this list is only going to grow in the future
Starting point is 00:19:21 as SORA continues to scale up. So the first one I'll talk about is 3D consistency. And so this is pretty clear from a lot of the samples thus far. But even when you have these very dynamic scenes with a lot of people moving in them and the camera being non-stationary, you can see that that a large number of elements in the scene really do move with like what appears to be like accurate geometry. And so this is achieved without any kind of, you know, hard coded inductive biases for 3D within the model. It's all learned jointly end to end as part of, you know, large scale diffusion training. It was really important to us when we were doing this project that whatever solution we came to for video generation was scalable and could just absorb a lot
Starting point is 00:20:01 of flops. And one way to do that right is to really strip out these inductive biases that in the past have sometimes been useful for achieving certain kinds of behaviors at low scale, but it's not clear that when you really crank up the training compute, if they'll either help or hinder you. And so we find that it's totally fine to not have these kinds of inductive biases as long as you're training at scale. Here's another sample. This one's kind of fun. So it's an aerial view of Yosemite showing both hikers as well as a gorgeous waterfall. The hikers do some very extreme hacking right here. I would not recommend trying this at home. And Ben Mildenhall, who used to be at Google, he took some SORA samples when we released them, and then he trained
Starting point is 00:20:53 a nerf on them. And in his words, it ners. So this is another kind of nice sanity check that the underlying geometry that SORA is learning, for some scenes, not certainly all yet, is actually pretty accurate. And so it's cool to see that this, again, just emerges automatically at without inductive bias. So the next capability I want to talk about is this idea of long-range coherence. So this is one of my favorite samples. This is the bling zoo shop in New York City. It's both a jewelry store in zoo, saber-toothed tigers with diamond and gold adornments, turtles with glistening emerald shells, etc. And so again, this is all one continuous shot from Sora. We didn't stitch it together. And what's cool about this is even when you have these sort of
Starting point is 00:21:31 scene transitions, Sora kind of automatically, you know, figures out like the vibe of what you're you're going for. So in this case, you get this coherence of like, you know, the environment you're in. You see this like outdoor component at like the start of the scene and like it gradually like moves indoors. But it all creates this kind of coherent narrative, which is awesome that you don't have to like, you know, manually stitch together everything. It can kind of just like figure it out in context. Of course, you can also do long range coherence and like the notion of character consistency as we alluded to earlier. So this is the story of a robot's life in a cyberpunk setting. And you can see you can see you get the same robot character across these different shots. So it really does understand this idea that, you know, if I have a long video with multiple cuts, I'm probably going to have some amount
Starting point is 00:22:17 of like characters that show up multiple times. It's not going to be an entirely new cast, you know, like every two seconds. And you just figure this out automatically. Object permanence is another big one. So, you know, in the past, video generation models have really struggled to keep objects in the scene under occlusions. And so, you know, this is an example sample where, even though this Dalmatian is getting included multiple times in the scene, Sora understands that that dog should still be there, even when the people passed. And this very simple capability that we take for granted was used to be a very challenging problem for video generation systems. But again, you don't necessarily need any kind of inductive
Starting point is 00:22:57 biases specific to objects for this to emerge. You really just need kind of like the right fundamental training recipes and to scale these models up. And so one other capability we're excited about is this idea of interacting with the world and an updating state. So kind of by definition, like you want a useful video generation system. At some point, it needs to be able to interact with objects in your scene and have those like interactions be meaningful. And by meaningful, I mean like they need to persist over time. So you know, in the simplest case, if I'm like drawing or painting in this case, some soccer pedals, I would expect that you know, as I'm leaving brush strokes, like they actually interact with the canvas and stick
Starting point is 00:23:39 around. So we find that sometimes SORA can do this. This is probably one of like the flakiest capabilities of the model currently. But in this case, it does work. This is another example of an older man eating a burger. And at the end here, there's bite marks in the burger. So I think this is one of like the larger challenges for video generation systems moving forward is, you know, this idea that if I do something in the distant past, can the model really remember that and recall that and have it affect things in the future? So these are like kind of very simple examples of that, but there's still a long way to go and creating like, I think like really compelling examples, you know, where like a past conversation or something influences like what the system outputs,
Starting point is 00:24:23 like me, you know, multiple minutes in the future, for example. And the last topic I want cover here is this notion of digital world simulation. So, you know, when people talk about video generation models, of course, there's like a lot of excitement about this idea that, you know, we can learn the real world's physics. And I think that's extremely valuable in a very important direction. But what's cool is, you know, these systems are very general. So there's no need to like constrain ourselves to only learning about our world's physics. There's all kinds of other crazy worlds out there, like, you know, laptop operating systems or like video game consoles that, that SORA like models could also learn from.
Starting point is 00:25:02 And you can have one model, which eventually is extremely generalist and is able to render out scenes in all of these different environments. And so one step towards this is Minecraft. So the prompts here is Minecraft of the most gorgeous high-res AK texture pack ever. And this is just a straight output from SORA. It's not even particularly cherry picked, actually. It's pretty easy to get good samples here.
Starting point is 00:25:25 And you can see that Sora is able to implicitly control the player here. with like an intelligible, if slightly boring policy while rendering out the full environment, rendering out MPCs like these pigs. And we think this is like a really cool, like extremely crude, you know, proof of concept of how SORA can do more than like, you know, just be used for creative purposes, right? It can really model whole environments and in the future be used to, you know, potentially extract information about like policies for. you know, implicitly this all lives somewhere in sources like activations and weights.
Starting point is 00:26:04 And it's cool to see that it kind of automatically learns these things, again, just by training on video data at scale. So here's one more sample with this prompt. So it chose a different texture pack for this one. But again, you see the same kind of things. You know, you got like a chicken and a pig. It's able to control the policies for those in addition to the character. As the character jumps around, it's able to render out this environment and pretty high fidelity. And so we're pretty excited to see, you know, all the. the kinds of knowledge that you can pack into this one model, not necessarily only real-world physics. Of course, SOR has a lot of issues, so it is kind of far from this ultimate goal of simulating everything.
Starting point is 00:26:44 And they're kind of fun failure cases, though. So everything about this scene is sort of messed up. So the woman looks way too happy. The hands in the background are kind of cursed. The candles are blowing in all the wrong directions. This is another one where a cup kind of spontaneously in the air and cracks in a really unrealistic way. So even pretty basic interactions like glass shattering, Sorah does not really understand yet. And there's a long way to go. This is, I think, most people on the team's favorite failure case.
Starting point is 00:27:18 So the prompts is like archaeologist discovering a plastic chair. But the plastic chair is a bit sentient and starts like flying and seems somewhat possessed. So it's always fun, you know, when you have these models, where on your scaling curves, you know, they're not pushed all the way yet. It's always fun to see, you know, kind of like the correlations they don't yet understand about our world and how they take somewhat creative liberties. And this one's pretty self-explanatory for what's wrong. So SORA is currently in a research phase, and we do not have it in a product yet.
Starting point is 00:28:01 We work with red teamers and artists to really get a handle on, you know, what are the potential risks of a model like SORA, should it be deployed one day? And also, how can we make it as useful as possible for, you know, both existing kind of like artistic workflows, but also potentially entirely new ones as well. And so one quote from shy kids, which is a group that we gave access to Sora to, as great as Sora is at generating things that appear real, what excites us is its ability to make things that are totally surreal. And so we really love this idea that, you know, SORA is not necessarily replacing elements of the artistic workflow,
Starting point is 00:28:46 but really enabling kind of entirely new processes that have not been possible before. And so I'll play this Shy Kids video now. I'm not sure if you guys can hear the audio. But if not, you can find this video online for the researcher Shy Kids Sora. Well, they say everyone has something unique about something that sets them apart. just in my case you know it's quite obvious what that thing is
Starting point is 00:29:12 I am literally filled with pot air yeah living like this has its challenges windy days for one are particularly troublesome or there was a one time my girlfriend insisted I go to the cactus store to get my own place area wedding present what do I love most about my predicament
Starting point is 00:29:32 through the perspective it gives me you know I get to see the world differently I float above the mundane and the ordinary. I see things a different way from everyone else. Yeah, I feel like it's because of that perspective. I'm reminded every day that life is fragile. We're all just a pinprick away from deflation. So I try to live life with a lightness, a buoyancy, joie de vivre.
Starting point is 00:29:57 I got a lot of ideas. Keep unist thankful. With any luck, I'll find a way to share them with everyone else. And so this video was made with a combination. of just using direct model output and of course also more like traditional video editing workflows on top so it's been really cool to see how artists have like embraced sorra and have uh begin to incorporate it there's also been kind of some some films that were at tribeca actually um which were made in a similar way with soror and i've been really surprised by like the level of creativity with
Starting point is 00:30:37 the kind of the current levels of capabilities that sorra has today it's really cool to see uh the community like lean in and use these models. So that said, that's pretty much the end of the talk. I have a few extra samples here. But yeah, thanks again to Joanna for scheduling this. And happy to field any questions. I don't know if it's possible to communicate them over Zoom currently or not. But yeah, if not, that's about it.
Starting point is 00:31:08 So thanks a lot. Thank you, Bill, for this great talk. I think yes, we can definitely, well, if you can hear me, then we can do a Q&A. I can hear you, so I think we're good. Hello, thank you for the great talk. I wonder how far we are from, let's say, a video producer making a whole movie with zero actors. So maybe if, like, hypothetically, a video producer can upload characters, how they look like, and they can describe the scenes and tell you, oh, this character is now running away or on a bike, et cetera, with zero actors, can they actually do a full movie?
Starting point is 00:31:44 Yeah, good question. So I think there's like a technical answer and like a cultural answer. So on the technical side, I don't think that, you know, there's necessarily any blocker to really, you know, making like character consistency work over very long time horizons. That seems like just a very achievable problem in general. And so I think in the near term, it will be possible for people if they want to do that to be able to be able to. to create these kind of like synthetic characters and kind of use them as they desire. Now, I don't know if people will really want to do that in the near term necessarily. We've been chatting with a lot of directors, for example. And a lot of them mention how like, you know, for kind of like very simple scenes,
Starting point is 00:32:33 how it's really convenient to use SORA's current capabilities. You know, for example, like have like a large crowd in the background, whereas maybe in the past that would be CGI driven. But you know, for like these really kind of like intricate and like meaningful like close up shots and when you're trying to develop more like an emotional connection with the audience, at least for like the near future, it seems like, you know, human actors definitely have like an edge on models like sort of today. So I suspect in the future there will be some kind of like mixing collaboration between the two. But yeah, it'll be interesting to see kind of like
Starting point is 00:33:09 when and where people choose to use totally like, you know, digital characters. versus traditional actors. Hi, I have two questions. The first one would be, how big of a role played synthetic data in the training process? So I can't answer any questions about training data, unfortunately. So yeah, sorry. Okay, then the second one would be about how much control does the user actually have over the camera angle and the trajectories? Is it just prompting or can you actually define a full trajectory or what's possible?
Starting point is 00:33:44 Good question. So currently, the only way that you can define notions like camera motion is either through text or through video conditioning, right? And so in the latter case, that would mean like if you see generation with like a video where the camera is like already moving in the way you want and then you want to extend it out from there, you could kind of infer that via in context learning to potentially get the right camera motion. Currently, there's not a more granular way to do camera control. I think this is something that. you know, we've definitely heard people want. And so it'll be interesting to explore alternative ways to more explicitly control those kinds of features. But right now, it would go through text primarily. Hey, so we've seen all this nice visual output. Can you share something about audio and consistency, maybe what you observed there? Or if you have any?
Starting point is 00:34:37 Yeah, this is a good question. So for SORA, we were really focused on trying to push the envelope with visual generation quality and we weren't very focused on, you know, for example, jointly generating audio. I think it's a really interesting direction to get, you know, extremely high fidelity, like joint video audio generation, but it's not something that we have with SORA currently. I think, you know, in the future, making these models more controllable and, you know, potentially like giving users all of like the modalities they want is certainly an interesting direction.
Starting point is 00:35:11 I sort of can also generate images. Do you think in the future video generation model will be stronger than current text to image models and we will stop basically training just on images? Yeah, I think so. And part of the reason for that is, you know, there's a lot of information about the world which, you know, if you're training on like huge data sets of images, you can probably infer to some extent. think, you know, there are still some things that like slip through the cracks that you only
Starting point is 00:35:43 get by training on video data. So, you know, for example, the fact that like a model can really generate like an accurate fly through of a scene and like really understand occlusions, my guess is like that actually helps like image generation capabilities and like understanding how, you know, some fingers on a hand may be occluded by an object. But that doesn't mean that like humans often only have like two fingers. It just means that, you know, there's like a physical interaction here that you didn't necessarily rock by only training on image data or you didn't rock it efficiently. And you get these concepts much more either data or compute efficiently from co-training on video. So yeah, I suspect in the future video generation models will generally supersede image generation.
Starting point is 00:36:28 Is already the case for SORA or not yet? So currently we haven't productized any of SORAS capabilities, including image generation. So today, you know, if you go to chat GBT or something, it's using Dolly 3 under the hood to do text image. Thank you. I was wondering if you can tell us a little bit about the size of the model on, for example, what is the maximum length in time that it can generate or resolution or amount pixels, something similar? Good question. Unfortunately, can't comment on that. Sorry.
Starting point is 00:37:05 And even something closer to the order of maximum. magnitude. So, for example, can we expect a user having a good model in the future or it will be something that only big clusters and big companies can afford? Yeah, that's a good question. I mean, I think I wouldn't be particularly surprised, you know, of like the evolution of video generation models ends up looking a lot, like the evolution of language models. So, you know, there will be a variety of models of different like capability levels and sizes, pretty similar to the ecosystem we have now. There's like open source models, which at least historically have tended to be somewhat weaker
Starting point is 00:37:45 compared to like these larger scale close source models. But I'm curious to see how the whole ecosystem evolves, just a guess. All right. Thank you. Hi, thanks for the talk. I'm curious about how you are thinking about more sophisticated, sophisticated, control like bullet time or in zone. Yeah, good question.
Starting point is 00:38:14 I think one thing we've heard from chatting with like directors and artists is, you know, they have like a very particular language that they use to describe certain kinds of like shots and like camera motion. And SORA out of the box is not so good at speaking that language. So a lot of kind of what we. think about in terms of like improving users interactions with this model is you know kind of like thinking about you know can we train the model to like use that same language so it to some extent is kind of like a captioning problem but I think the jury still out there on you know the best
Starting point is 00:38:53 way to like make these models controllable and is like is that only through text or are there other kinds of inputs I think it's a very interesting space currently and we're still just kind of at the start of exploring it. Thank you. One more short question is there was interaction between character and hamburger kind of things. Is there any way to make more interaction from the user like hitting the hamburger so that hamburger will be squeezed so that, yeah, I think you get it.
Starting point is 00:39:29 Yeah, yeah, that'd be cool. I don't know if Sora today can do that kind of more complicated interaction. I think there's no fundamental reason why that shouldn't be possible and why further scaling up these models shouldn't be able to achieve that kind of capability. It's just, you know, even this kind of like biting a hamburger thing, this sort of phenomenon was something that took a while within like, you know, the research process of SORA to like emerge and seem to require like at least a decent bit of compute for it to be like a non-trivial interaction. So I'm curious, like, you know, what level of scale you might need to fully smash a hamburger and have it be, like, physically accurate. I think, you know, we'll definitely get there eventually. It's just you never know where on the scaling curve you have to be for these capabilities to start popping up. Thank you.
Starting point is 00:40:23 Hey, great presentation. I was wondering, you've shown this nice Minecraft example where the actually had some kind of agentic behavior a little bit that was shown. Do you think SORA could actually help enabling as a world simulator, for instance, or as a foundation, model agents that interact with the real world, like make it better than what's there right now as like robotics, for instance? Yeah, definitely. You know, I don't know if like the current model is robust enough to be reliably useful to improve like real world policies.
Starting point is 00:40:55 But I think it's inevitable that one day these models will power these kinds of systems. There's just so much information about the world that you learn by training on large-scale video data that it seems inevitable that that knowledge should transfer to the real world at some point. Hi, I have a question about inductive bias. Have I ever tried to train the model with some specific inductive bias such as the physics or any rules in the video? No, we haven't. So from the onset of the SORA project, we were really focused on kind of training just like pure visual generative models of data with as few inductive biases as possible and really just ensuring that the foundation was solid for scaling. That was kind of like the core thesis of the project. So we haven't
Starting point is 00:41:48 explored incorporating inductive biases. I suspect, you know, for like certain kinds of like narrower use cases, it's possible that you can get some kind of win from doing that. And, you know, potentially if your model needs to be really small, but you don't need it to be extremely generalists, then potentially that could be a win. But we're really just trying to scale up the largest, most generalist model possible. And so to that end, we generally assume they'll be harmful at some points, which is why we haven't explored them too much. Thank you, Bill. Let's thank the speaker again. Thank you so much for the fantastic talk. In their original blog post, OpenAI describes Sora as a world simulator, and we have been
Starting point is 00:42:30 tracking the start of the summer of simulative AI. Google DeepMind is not resting, of course, having announced their VO model at Google Yano this year with Donald Glover's endorsement. However, the focus at ICML is on Genie, short for Generative Interactive Environments, which is an 11 billion parameter foundation world model trained on unlabeled internet videos to generate action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. It is comprised of a spatiotemporal video tokenizer, an auto-regressive dynamics model,
Starting point is 00:43:08 and a simple and scalable, latent action model. Jeannie enables users to act in the generated environments on a frame-by-frame basis, despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. More recently, deep mind and mind announced Seema, their scalable, instructable multi-world agent and Jeannie 2, which extends
Starting point is 00:43:35 Genie 1 from generating 2D worlds, going into 3D worlds. Genie 2 is a world model, meaning it can simulate virtual worlds, including the consequences of taking any action, for example, jump, swim, etc. It was trained on a large-scale video data set and, like other generative models, demonstrates various emergent capabilities at scale, such as object interactions, water effects, directional lighting, reflections, complex character animation, physics, and the ability to model and thus predict the behavior of other agents. In particular, Jeannie 2 has long horizon memory, meaning it is capable of remembering parts
Starting point is 00:44:20 of the world that are no longer in view and then rendering them accurately when they become observable again, ensuring it generates new plausible content on the fly and and maintains a consistent world for up to a minute. Finally, Jeannie's learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future, which we will explore in the last part of this pod. But first, here is Google DeepMind's oral presentation of Jeannie.
Starting point is 00:44:52 Hello, everyone. Good morning. Thank you for coming. I'm Jack, and alongside Ashley. I'm super excited to be presenting our paper, Generative Interactive Environments, otherwise known as Jeannie. Jeannie was an amazing collaborative effort from this wonderful group of people at Google DeepMind.
Starting point is 00:45:12 Our long-term goal is to train embodied agents that can safely perform complex tasks with long horizon consequences in the real world. It's safe to say there's been amazing progress in our field in the past few years, but this still feels pretty far away. So what's missing? Fortunately, there's another pretty cool paper ICML,
Starting point is 00:45:32 which has a nice way of thinking about this. In particular, they factorise agent capabilities by breadth and performance. And if you see in the bottom right cell, that's kind of what we want, general superhuman intelligence. Now, the good news is we've made some good progress on this grid, so we already have what they call emerging AGI, thanks to progress and foundation models. We also have superhuman agents in narrow domains,
Starting point is 00:45:57 In the case of AlphaGo and Alpha Zero, these agents have already been used to augment human intelligence. Note, however, this is made possible by access to the simulator for the game of Go, which we don't have for general intelligence. So the main claim that we make in this work is to get to the bottom right corner, the key missing ingredient is a more general environment. So the main motivation for us is how can we possibly get this more general environment. In parallel, it's pretty clear that video-generational, is having a bit of a moment,
Starting point is 00:46:33 but a pretty packed out room for this set of Orals. And this area of research is now front and center of progress in AI. What's pretty incredible, though, is that by making use of large video data sets, these models are increasingly understanding the physical world in ways they couldn't before. And so as a result, many people are starting
Starting point is 00:46:51 to believe that these video models could actually be accurate world simulators. However, we believe that while these models may have world knowledge, they're not world models. Indeed, many of the text of video models that we currently see are only controllable at a very high level via text captions, prompts the model and you receive a video click. It may be consistent and beautiful,
Starting point is 00:47:14 but it's not therefore a world model because you can't take sequential actions in the environment to learn new behaviours. The challenge here is that we don't have data with action labels, and so the largest world model settings that we have are limited by action-label datasets, meaning we can only model existing environments, model existing environments, therefore not being able to generate new ones.
Starting point is 00:47:34 It doesn't seem to make sense to use tons of compute to just recreate existing games that we already have. So this is the problem we're working on solving in this project. We want to make use of the vast amount of unlabeled videos on the internet, and we want to train an action-controllable video model, otherwise known as a world model. So to summarize, the goal of Genie is to learn what we call a generative interactive environment, purely from videos that's playable by both humans and AI agents. How do we do this? Well, the main idea is to use a latent action space, learned in a fully unsupervised manner. Intuitively, these correspond to
Starting point is 00:48:09 clustering potential outcomes from a given frame of video. I'm now going to have a bit to Ashley to talk about how we do this and show some of our results. Thanks, Jack. So the genie model consists of three main components and we're typically training over a sequence of around 16 frames. So the first component is a video tokenizer, which takes in patches from that entire sequence and disparate them into video tokens. The next is a latent action model, which is going to take in consecutive frames and discretize and compress them into what we call latent actions. And the main purpose of our latent action model is to try to encode what's going to change between our scenes,
Starting point is 00:48:51 such that given those latent actions, along with our prior frames, we can use them for predicting our next frames. And this is what's going to be important for controllability. And the final component of our model, is going to be our dynamics model, which takes in our tokenized frames, along with our latent actions, and predicts next frame tokens. So in order to actually interact with our model during inference time, we can take an initial prompt frame,
Starting point is 00:49:18 take some actions, and then generate the next frame, plug that back into the model, and continue this. And importantly, because we're learning discrete latent actions, we can actually just plug in integer values to interact in this way. And we found that it was really important to, in order to actually evaluate actually evaluate our model to actually step into it and basically play it ourselves, because it's one thing to measure the sort of quantitative performance of our model using
Starting point is 00:49:45 approaches like FPD, but it's another thing to actually model controllability and be able to actually evaluate that. So actually interacting with it ourselves was very important. So Jeannie was trained on a data set that consisted of around 300,000 hours of video game footage, consisting mainly of 2D platformer games, but we did find it was important to filter this down to 30,000 hours to get a more high quality data set. So before training our main model on this, we ran a few scaling analysis experiments
Starting point is 00:50:21 that showed us that it was important to scale both model size and bats size. And once we did this, we found that we came up with our final 11 billion parameter, Genie model with the back size of 512. So now let's take a look at some of our results. And so I want to point out that all of the results are going to be showing out of distribution examples.
Starting point is 00:50:42 So some of them might be example coming or coming from text generated images. So this first video here is going to show some of the environments that Jeannie is capable of creating. And so I think the exciting thing here is it's showing that we can essentially step into our environments. This is showing real human interaction within these generated environments and take actions
Starting point is 00:51:07 and change the world that we're experiencing. And it's important to point out that this is one of the main differences between what you see in Jeannie versus Video Generation models. It's that interaction that we're able to actually take actions within our environments. And this is why we can consider Jeannie to be a foundational world model because we can take these actions within our generations. So this is showing another example.
Starting point is 00:51:32 So given the same initial prompt frame, what we can do is we can take a different series of latent actions and plug them into our model. And what you see is that we're going to get very different and diverse trajectories. Again, this is showing human interaction with the model. And again, this is because we've learned this latent action space in an unsupervised manner. The other important thing is going to be consistency. It's one thing to be able to generate a diverse set of trajectories, but if you're having
Starting point is 00:51:59 to figure out what your latent actions mean every time you have a new image, it's not really that useful. So we also wanted to measure how consistent our latent actions So given four different initial pumped images, we could plug in the same sequence of latent actions. And what you can see is that there are very similar trajectories and behaviors essentially happening across these different environments, which is telling us that indeed our latent action space, at least in these environments, is consistent. And again, just to point out, we were able to learn these latent actions without using any ground truth action labels or doing object detection or object segmentation or any sort of domain-specific information. So one sort of exciting and fun thing we found within the project was that we could actually plug in sketches, even though we were only training on 2D platformer games.
Starting point is 00:52:46 So, for example, on the left here, we see a sketch done by Richie from a team. In the middle, that's one of Jeff Cloone's children made that, and on the right, I did that one, but don't judge me too harshly. And we can basically plug these images into our model and, again, create these environments. So we can see, for example, we're able to climb this ladder that Richie basically sketched down. And so I think it was it was in this moment when we really started to see the kind of creativity that Jeannie could enable. And so we also plugged in real world images, which is again very out of distribution from what the model was trained on. But for example, that's Jack's dog Doris on the left here.
Starting point is 00:53:23 And we can again sort of generate these environments and I guess interact with them even though we didn't train on anything that looked like this. Jeannie also works on real world data. So we trained a smaller model with 2 billion parameters on a robotics data set. And we again see that if we take different prompt images but plug in the same sequence of latent actions, we're getting similar behaviors, which is again evidence that the latent actions are consistent. We're also able to simulate deformable objects with this model here. And finally, while we haven't yet shown that we can train agents within the Genie model, We do show in the paper that we can take the latent actions learned from videos on the internet
Starting point is 00:54:08 and use those for labeling unseen videos, which allows agents to actually imitate from these. So this indicates that Genie can be used for training our generalist agents of the future. And speaking of the future, I'll pass it back to Jack to talk about future directions. Awesome. Thank you so much, Ashley. Right back to me. We want to emphasize that what we've shown here is that this is even possible. Before we started this project, the idea of training an action-controllable world model from videos seemed a bit like a pipe dream. And so as a result, this is the worst that Genie's ever going to be.
Starting point is 00:54:43 We're expecting to see rapid progress from here, which we think can have a huge impact in a variety of areas. So going back to our original motivation, we think that Genie presents a clear path to generating unlimited environments for training agents. And so for a more formal write-up of how we see this could fit into a framework towards getting to more general intelligence, Come check out our position paper on Thursday in the oral session. Not only that, but as Ashley mentioned, we note something pretty magical happening while playing our model is it enabled a new form of creativity as people such as Jeff Kloon's children, as previously mentioned,
Starting point is 00:55:17 were able to draw their own worlds and step in and play. And we think this is barely scratching the surface by what could be possible with this new form of generative AI. Okay, so to address the elephant in the room, so for those in academic institutions, thinking it's just another industry paper that use tons of compute that you can't possibly work on. Fear not, we've got something for you as well.
Starting point is 00:55:37 So in the paper, we have a case study where we show you can train your own much smaller genie model and a mid-range TPU in just under a week. With this approach, you should be able to see some pretty consistent latent actions and given different initial prompts in the coin run environment. As an example here, you see different actions from this model that we did train in a few days. And we're excited to see that this isn't just a wild goose chase. We have actually got some students that have been able to reproduce this. So come along to the controllable video generation workshop on Saturday to see their poster.
Starting point is 00:56:07 And finally, if 12 minutes wasn't enough for you, fear not. We've got a few other things going on this week. So we've got a couple of position papers. We've got the poster straight after this talk. And then we've also got some longer talks in workshops later in the week. And then many others in the team are here as well who would love to chat to you all. So yeah, that's a wrap. Thank you for your time. Thanks for this showing this amazing work. So I have a question about technical details. So in this genie phase, you have two training phase, right?
Starting point is 00:56:38 So in the first phase, you're training a inverse model for the latent actions. And then in the second phase, you're training a prediction model for this video generation, right? But in the first phase, when you train the latent action model, you already have a dynamic model trained. So why is necessary to train the second step? I'm just wondering. Yeah, that's a great question. So essentially what you're saying is why do we have a decoder for the latent action model that already predicts the next frame? And then subsequently train another one. We found that there's actually slightly different trade-offs for this decoder.
Starting point is 00:57:12 So we found, if you see in the paper, we predict in the pixel space rather than token space for the later action model. We found that that really helped to get more controllable and consistent latent actions. And then that decoder itself is just predicting in the pixel space. so actually it would have pretty blurry if you were to use it as a regenerative model. Whereas we found that the the Mascot objective wasn't best for learning latent actions. It just didn't lead to as consistent latent actions. So we have this like dual approach where we have different, two different dynamics models. Essentially we learn as part of the process.
Starting point is 00:57:44 But you're totally right that this isn't the most elegant solution and many of the team weren't overly ecstatic about it. But that's why we're saying this is the worst it's ever going to be. And hopefully some of you folks in the community can build a much more elegant solution in the next few months. Thank you very much. Really cool work that you guys are doing. I wanted to ask regarding the qualities that we can see on these world models. So on the videos we essentially saw some amount of like physics, so jumping and then falling down because of gravity. And then we saw platforms and saw some ladders. Do the world models ever generate other entities? Think of it like maybe enemies going back and forth that if you touch them, something happens? Or
Starting point is 00:58:23 like what other quailia do you think that have you guys observed? Yeah, that's a great question. So I would say in the sort of examples that we show, particularly out of distribution, examples, it is very difficult for it to generate anything that's kind of exciting. We are able to move the character around, but typically you would just see it, I guess, repeating the patterns that it's seen in the background and that sort of thing. I think that's another sort of exciting direction for the future, is trying to figure out how to make it a little bit more,
Starting point is 00:58:52 the generation's a little bit more exciting and diverse. Yeah, thank you very much. Now as well, one last question. I actually wanted to ask you something. What do you think is sort of like the cool killer application that you see in the future if you could really scale this up and train this on anything? So I think there's quite a few applications and really it's subjective depending on your interests. So I think if you were, I personally think this could have impact in quite a few areas.
Starting point is 00:59:23 So you can imagine some of the domains we use, you've already quite fun to interact with it as someone in those settings, but I think it could have quite a large impact in areas such as robotics, because it's currently quite hard for robotics, for robots to generalize equally unseen scenarios, but if you could generate a world model for any possible domain. And actually we've seen there's an open source, Genie model from 1X robotics that works pretty well. And so I think that they obviously think so too, and they probably know more about that than me. And yeah, so I think there's a lot of potential applications, but we're just not focusing on one right now. All right, thank you very much and congratulations on the best paper.
Starting point is 01:00:05 Because the Jeannie team were accepting their best paper award in Vienna, we were able to catch them at their poster session live to tell a bit of the human story behind Jeannie. Over to you, Brittany. I am here with the Jeannie Team. Generative Interactive Environment is the title of the poster. And I'm here with Jack. Jack, can you tell us a little bit about the origin story of the Jeannie Project?
Starting point is 01:00:29 Sure thing. Yeah, firstly, thanks for the chance to speak to you. So basically, Jeannie is kind of a fusion of a few different areas of research. Myself and some others who are working on open-ended learning and environment generation beforehand, and we were interested in world models and thinking about how we could scale them to internet videos. But obviously the key challenge with that is that internet videos don't have action labels. So if you want to train a model that takes actions as input to predict the future, you don't have the action, so you can't train that way. And then on the other end of the spectrum, Ashley Edwards had been working for many years on inferring actions from videos for a different purpose for directly training agents with behavior cloning. And so it seemed like a natural fit really to combine these ideas. And there were some pretty simple proof of concepts of people doing this at very small scale. But no one had really gone to the generative angle of getting an environment generator from a large scale data set. And so when we first spoke with Ashley a year and a half ago,
Starting point is 01:01:31 we were excited about this potential combining these ideas to build something completely new. And yeah, I guess that's where we got to. Nice. And can you give a little bit of an overview, I guess, of how the work went, what results you saw, that type of thing? Sure, yeah. So we started basically working on this 2D platformer's data set.
Starting point is 01:01:51 So we have 280,000 hours of publicly available videos of 2D platform games. We found one important thing was to filter this down because a lot of the videos aren't very good quality. So we trained a classifier with hand labels that we label as a team, a small subset. And then we ended up with 30,000 hours of good quality videos. And then from that point, it was just a modeling problem. And so we did a lot of research on different approaches to get these latent actions. And what we ended up with is we trained in kind of a, I guess, slightly quirky way, is that we predict pixels with a latent action. action decoder and then that allows us to learn a discrete set of eight latent actions.
Starting point is 01:02:32 And then we separately train a dynamics model that is using Maskit, which is like a way of generating next frames. And we train that separately given the latent actions that are produced from the video. Just predict the next frame condition on the actions. And then people were working on different things like the project was quite fast paced and a few of us kind of switched and wore many hats. and we started all getting different results in different areas. And then roughly around last summer, so probably just under a year ago, we realized that when we combined a few of these ideas, we actually had something that worked pretty well.
Starting point is 01:03:05 And then we were really excited, obviously. So then we started working on seeing how the model scales. Because the key thing about this project is that if you can figure out to generate worlds without action labels, then essentially there's nothing stopping you from using all of the world's videos because there's no reason why you need to wait to do that. So we started saying, okay, how can we scale this approach and what does it do? And then we produce these plots, which you'll see in the paper, if you have time to look at that,
Starting point is 01:03:30 where we show that as you increase the model size from a few tens of millions to in that plot, it's something like two billion, you just get an increased performance every single time you increase the scale. And then the same thing with batch size, when you increase the number of examples, the model C's. And so we realized that we had produced a scalable model. So we then decided to go for what in the end was an 11 billion parameter model. And then once we produced that, we just started all playing with it and seeing what we could do with it. And then finally, obviously, the goal of this was originally to get an environment for agents. That's how we kind of started, actually on the behavior cloning side and myself and others on the more water curriculum and open-ended learning side.
Starting point is 01:04:07 But we realized it was really fun to play with the model. And so actually, maybe the more interesting use case is how it enables new forms of creativity. And so there's some examples in the paper of things like drawings from one of our co-authors' children. and they sent us photo of the drawings, and then we were able to then prompt the model with those photos and play and move the characters and the photos around. And that's pretty cool, right, because you're enabling people to create their own world,
Starting point is 01:04:32 step into them and to back with them, which was not really what we first thought of when we started the project, but I think it tells you that if you do kind of ambitious, somewhat crazy stuff, then maybe new things will emerge. So that was pretty fun. There's also a picture of my dog there too, so that's another example I like. Nice. What would you see in terms of, I guess,
Starting point is 01:04:51 more near-term potential applications for this. A lot of the folks who listen to the podcast are kind of on the AI engineer builder side of things. Any creative ideas there? Honestly, this is going to sound a bit like a bit of a non-answer, but there's so many applications. So you can obviously see the ones in the paper. We have examples where we show generating 2D platformer-like
Starting point is 01:05:15 kind of short game experiences. But you also can see the models work some robotics data too. And arguably that latter use case is maybe more promising in the short term. There's actually already been an open source Genie model released on 1X GitHub repo in part of their World Model Challenge. And so I think they're more expert in robotics than I am. But the fact that they think it's a potential in a good direction for robotics probably speaks volumes. I think there's other use cases too. So things like maybe driving.
Starting point is 01:05:49 If you could generate scenarios for testing or even training, autonomous vehicles, and then be able to interact in the world in any custom situation, that could be very valuable. But on our side, we're mostly just pursuing the fundamental research and not really focusing on one specific application. Yeah. And I imagine the world has already moved to continue to move forward from a research perspective since you guys put this out there.
Starting point is 01:06:18 Is this a direction that you see your story? continuing to pursue or what have you been excited about lately? Yeah, so this work, I mean, ICML, the deadline is January, right? So it's already six months or so ago that we submitted this. And yeah, I guess most of the team are still working on it. Do you see coming out with a Genie V2 or a V3? Yeah, I can't speak exactly about specific releases, but hopefully we'll have something new at some point.
Starting point is 01:06:48 you at some point. Has there been anything else that's come out in the research landscape that you feel has either reinforced kind of what you've worked on or contradicted on the flip side? How do you see it evolving? Yeah, it's a great question. So definitely reinforced. I think just after Jeannie came out, there's been a flurry of like really amazing video generation results. So the first one was clearly SORA. I think they definitely took the space by the scruff of the neck and really pushed capabilities quite significantly. And that was really exciting to see. It's quite a different style of model and that theirs is text to video, so it generates entire clips, whereas Genie is like frame by frame level control.
Starting point is 01:07:27 But nonetheless, it does show you that with additional scale and, I guess, brilliant execution, you can get much more high-quality videos generated already than I probably thought was possible. And then since then, I think there's been kind of the floodgates opened in the space. So from our own colleagues, the VEO model came out. I was announced to I.O. was really impressive as well. And then competitors, I guess, have also other competitors have done similar things. So it's a really exciting time for that space. I think there's not really been anything that's action controllable like Jeannie.
Starting point is 01:08:00 But, yeah, it's definitely exciting time for video generation. So I think it's a good space to be getting involved in now. Yeah, given how fast the community is moving, I think in a few years time we'll have something pretty incredible. I just spoke with your video. poet colleagues as well how do you see this work like dovetailing with that work or how do you kind of work together I guess on the future of what video looks like or world models so I think that they're they may be more interested in and more like cinematic video experiences and like generating
Starting point is 01:08:33 entire clips whereas Genie still remains like quite quite fundamentally different because only generating one frame at a time and it's like it's a it's kind of a video model but it's also kind of like a auto-aggressive image generation model in a sense. So it kind of sits in its own, like, it's kind of a new area. I guess a lot of researchers always claim that they're inventing a new area, but it is kind of a new area of research and hard to classify. It's a bit different. It's also a lot of us come from an RL background, so we're much more thinking about agents, which I think is quite different to all the video work, which is much more, I guess, focused on generative media and generating cinematic quality videos. But there's
Starting point is 01:09:13 definitely some overlaps in the architectures and these kind of things. infrastructure and we both want to use lots of compute so I guess that's another thing we have in common. What do you make of all of the hype around the agent space? Do you see that continuing or do you see people getting tired of the agentic buzz? You're venturing into hot takes territory. So it's tricky because I mean a lot of us who worked in RL, right, we've been working on agents for a long time and because I think RL is often it's often dubbed as like reinforcement learning research but really it's agent research for a lot of us. It's
Starting point is 01:09:46 agent research where RL is currently the best method to get agents. It seems like now that's shifted because people are starting with LLMs and then training them on top to get additional capabilities. But it's a lot of the same people that were doing RL research. So they've always been working on agents. It's just now it's called LNM agents before it was Tabula Rasa RL. So I think, yeah, this hasn't changed a huge deal. It's just we're starting with base models rather than Pabila Rasa in maybe some kind of
Starting point is 01:10:15 toyish environments. It's kind of a natural progression of that line of work. I think it's exciting, but I think the goal of genius is a bit different in that we're going more for an embodied AI. We want agents that can interact in the real world over a long horizon. And for that, I just can't look past how you would need a simulator of the real world, which I don't think we're going to build by hand. So I think it's, I think they're kind of complementary in a sense. I think the LLM agents will become more capable in doing long horizon tasks in like text-based substrates. But I think that so then in the real world
Starting point is 01:10:49 take long horizon actions for some kind of BLM, it's going to need to be able to interact in the world and we're not going to just be releasing them to do random exploration. So I think a real world simulator will play into that at some point. Awesome. Thank you so much for the time. Believe it or not, Google also won a second best paper award at ICML for video generation, with Video Poet Deep Mind's take on zero-shot video generation. Video Poet is a simple modelling method that can convert any auto-regressive language model
Starting point is 01:11:21 or large language model, LLLM, into a high-quality video generator. It contains a few simple components, a pre-trained MagV, 2-V tokenizer and a soundstream audio tokenizer, transform images, video and audio clips with variable lengths into a sequence of discrete codes in a unified vocabulary. These codes are compatible with text-based language models, facilitating an integration with other modalities such as text. An auto-regressive language model learns across video, image, audio and text modalities to auto-regressively predict the next video or audio token in the sequence. A mixture of multimodal generative learning objectives are introduced into the LLM training framework,
Starting point is 01:12:12 including text to video, text to image, image to video, video frame continuation, video in painting and outpainting, video stylization and video to audio. Furthermore, such tasks can be composed together for additional zero-shot capabilities, for example, text to audio. Let's cut to Lejeune-speaking for the video. Video Poet Oral Presentation. Good morning, everyone. This is Li Juni from Google Demind.
Starting point is 01:12:42 Excited to meet everyone at Vienna. This year, I believe many of you may have witnessed significant progress on video generation, especially with text-to-video diffusion models. Today, I'm going to talk about a completely different approach, which shows that diffusion may not be a necessary component. We appreciate that the award recognizes the contributions of this work. Now, please allow me to introduce Video Poet, a large language model for Zero Shot video generation.
Starting point is 01:13:18 This work wouldn't be made possible without our talented team, with members coming from diverse backgrounds and moving forward along different paths. The core contributors were Dan, myself, Xiu Ye, Jose, Jonathan, Brian and Lou, along with many other video plays. Reflecting on the progress so far, we realize that video generation has already come a long way from the early days of gun models. In case you have never seen generated videos from a large-scale model by the definition of 2016, here are two examples for classes of golf and bib. Since then, people have scale up gun models, and, and developed pixel space auto-regressive and diffusion models,
Starting point is 01:14:06 which were getting less affordable. Some works try to model it as a foreign language of images or videos, but lossy discrete tokenization poses inevitable limitations. Later, latent diffusion has become the dominating approach, given its appealing sample quality. Big companies and startups have ignited a risk of scaling up compute and data. Now, nearly 10 years later, models can easily generate a video clip from a tax prompt, like this skeleton drinking soda.
Starting point is 01:14:41 But is latent diffusion the only way to go as we embrace the LM area? Absolutely not. In fact, this video is generated with video poet, a purely LLM-based approach without diffusion. VideoPoint is a foundation model that takes input. that takes inputs of text, image, visual dance signals, partial videos, audio combinations. It is capable of text to video, image to video, video stylization, video editing, video to audio, and many other tasks. In short, video poet is an auto-regressive alarm that synthesizes videos with high fidelity motion and matching audio from a large variety of condition signals. The diverse capabilities of video poit are facilitated by defining a universal multi-model sequence-to-sequence problem.
Starting point is 01:15:41 The condition sequence includes task indicators, inputs from text, visual, and audio modalities, as well as output format controllers. The model generates the output sequence of visual and audio tokens in a fully auto-regressive manner, just like a usual language model. In order to define the token space for each modality, we resort to a collection of unimodeled tokenizers. Megavit V2 encoder and decoder define a bi-directional mapping between the pixel space and a compressed space of discrete visual tokens. It can tokenize image, depths, or optical flow, as well as corrupt or mask videos. Soundstream does similarly for the audio waveform. Although text tokens can be directly fat in, we use a pre-trained, to extract text features to reduce the burden of learning human language from scratch.
Starting point is 01:16:35 The Megavitou tokenizer defends the visual language. It resembles the quantized VAE with the temporary causal 3D CNN architecture, processing pixels. This causal design enables joint training with large-scale image data and seamless support for long videos. For higher prediction bandbase, we adopt a large vocabulary of over 200,000 words enabled by our scalable quantizer. The model is trained with both reconstructive and adversarial objectives. In a human reader study, our advanced video tokenizer achieves even better compression quality than VVC, the next generation video codex standard. This tokenizer lays a solid foundation for high fidelity generation of videos, especially for those with large motion.
Starting point is 01:17:22 Similarly, the sound stream tokenizer defines the audio language, which adopts the causal DC in a view form. It uses residue vector quantization to produce multiple levels of tokens, where VideoPoint uses the first form, and its quality is better than Opio's audio code X standard. Now that we have defined a multi-modal token spaces, we can convert video data sets into discrete token sequences. Then we can use an out-of-the-box, LIM transformer training infrastructure to learn these as foreign languages.
Starting point is 01:17:56 In VideoPoint, we adopt a decoder-only prefix al-arm architecture, where bi-directional attention is applied on the condition sequence, followed by causal attention on the target output. Compared to a diffusion transformer of the same size, the VideoPoint framework has significant flexibility and efficiency benefits at both training and inference times. It can flexibly train arbitrary tasks between any modalities together, with variable lines of condition and target sequences in a single model. With causal attention, the transformer learns the entire decoding trajectory for video
Starting point is 01:18:32 in a single training step. At inference time, we can leverage various types of existing acceleration techniques, such as KV caching, so that the entire decoding flops are no more than one full forward pass. Video data comes from different sources in diverse formats. Text to video diffusion models usually require text video pairs with high ascetic
Starting point is 01:18:54 value, which may be scarce and costly to accurate. With our flexible design, video poet can pre-train on a mixture of pre-existing data, where a large fraction remains unlabeled or noisy labelled. In this table, we have a large number of raw videos with audio from the public internet, some videos with noisy machine caption, and another set of videos with high-quality human captions. We also leverage image text pairs to improve language alignment. After pre-training, we can have a second training phase of task-specific adaptation
Starting point is 01:19:29 with the corresponding high-collar data set, such as for text to video. More details about the training data can be found in the paper. We have a large mixture of training tasks on these data, starting with self-supervised ones, such as unconditional generation of various modalities. With an auto-regressive model, they also imply the corresponding continuation tasks for video, audio and both of them, as well as the image to video task. Video Poet is trained to generate audio given a video or vice versa, and perform various types of video editing, such as in-painting, outpainting, and interpolation.
Starting point is 01:20:09 In addition, leveraging the captions, it learns to generate video, audio, and image from text. Video stylization is supported by depths or optical flow conditions. After the LRM backbone generates video tokens, we can optionally apply a latent super resolution module before decoding to pixels. It uses the Megavit mask transformer with non-autoregressive decoding, which runs faster at small scale, with multi-access windowed attention to handle long sequences at high resolutions. While VidiPoid has broad generation capabilities, much of the existing automatic benchmarks
Starting point is 01:20:47 are defined around text to video. Here we compare with D-O-D-R methods on the commonly used MSRV-TT and UCF-101, zero-shot text-to-video evaluations. On metrics of clip similarity in SEPL score and F-D, video poit performs favorably against prior models, which were specifically designed for text-to-video. As automatic metrics got to be saturated and less indicative, we conduct user study with human readers to compare zero-shot text-to-video generation in various aspects. We compare against pearl works including Fanakie, Show One, video crafter, runway, and pika, as well as concurrent works such as Watt and Lumier. On axis of text fidelity, video quality, motion interestingness, and motion realism, Video Poet is preferred to Preerworks and Prefer to Concurrent Works in majority cases.
Starting point is 01:21:41 This is a collection of text to video samples by VideoPoint, and we highlight their high fidelity motion. More samples can be found on the project page. Here we demonstrate the image to video capability, which can be potentially applied for 3D rendering as well. In addition, video stylization and editing are natively supportive. Video poet can generate the corresponding audio for video, where it understands the content. Here, we show a few examples where both video and audio are generated by video poets. We hope our work can empower the community to explore in broader areas, We can explore in broader areas and greater depths.
Starting point is 01:22:38 With L-Irm-style foundation models, we could further leverage their generalization capabilities for in-contact learning of a new motion, character, or object for video generation in a customized and controllable way. We can even think of how a new modality can be added into the model at inference time. On the efficiency side, we probably want to care about how video generation can run in a real-time streaming fashion.
Starting point is 01:23:04 fashion. This will not only enable interactive neuro gaming, but may also facilitate neural user interface. Imagine for a neural-based operating system, it could have no more blue screen crashes, but may reboot when it runs out of memory due to a long context. Further advancement would hopefully take us to a universal multimodal generating model that excels at text, video, audio, image, and beyond. Think about text to video as machine translation in 2018 when it first beat human performance. It took another five years before we have chat GPT. I guess it will take sooner before we can reason and generate across modalities with our
Starting point is 01:23:46 arms level intelligence. Looking forward to seeing how it answers, can you show me how to tie this shoe with a single hand in a live video? In summary, video poet represents a distinct approach to video generation. It challenges the diffusion monopoly with still the art visual quality, while offering multitask flexibility which goes beyond the text-to-vill translation paradigm. It is the video-first foundation model with diverse generation and editing capabilities, building upon out-of-the-box infrastructure for native integration.
Starting point is 01:24:21 That concludes my talk today. Thanks for your great work, and actually I'm very curious about the instruction following ability of video poet. So do you take some measures to evaluate the instruction following ability quantitatively or qualitatively and how does it compare with the traditional classifier free guidance of diffusion models? Yeah, that's my question. Okay, that's a great question. First of all, I don't think there exists a very good quantitative metric to measure that thing, but it's a very promising future direction people should work on to evaluate this video generation model Secondly, we did use classified-free guidance in our model as well for autoregressive.
Starting point is 01:25:08 It works. And then I think it is very tricky to fairly compare with the diffusion model beyond a system level comparison because they use different latency, one is continuous, one is discrete, and you train on different of them, you have different reconstruction quality, then you can calculate perplexity for language model. People don't really know how to do that for diffusion model. I believe it really warrants further study on comparing these two systems, to methods system market. Okay, thank you. Okay, great works and great team. Hi,
Starting point is 01:25:43 Legion. Hi. So we are also working on the topic of video and like that has the capability of generating video. So my question, as we all agreed on, that the video, tokenization might be the bottleneck of the, for example, the model, right? Yes. And so do you have any insights that you want to share about how to build some very capable video tokenization technique? Okay. So first of all, the video tokenizer consists of encoder and decoder, but it basically
Starting point is 01:26:29 learns a bi-directional mapping. Sometimes people use diffusion decoder on the other side of language model, but that means you are doing another generative model there. In video point, we only use as a pair of encoder and decoder training with reconstruction and adverse objects. It is a bidirectional mapping. In this case, it gets really tricky to train it for high-quality reconstruction because it will always be a lossy compression problem. And on the decode side, you always have one-to-many mapping. So the real help, I guess, from going away from the blurry reconstruction is the help from the adversary objective, which gets you sharp videos. Also, the 3D causal CNN in the Megavitou architecture, that helps a lot, especially when coupled with auto-regressive modeling.
Starting point is 01:27:20 So you have full temporal causality for the training. It is very friendly for auto-regressive decoding. Okay, thanks. So maybe we can have one more chat. Yeah, for sure. We'll come to the poster session. It's happening last minute. First of all, thank you for your work.
Starting point is 01:27:41 And I have a question about open sourcing Magwit first. As I know, it was planned, but somehow Magwit first was not open source yet. And do you have maybe any plans about it? Okay. I may answer this question more confidently if you asked me one month ago.
Starting point is 01:28:04 But now as I joined the time, I don't really know. The good news for you is the Megavavitou version 1, tokenizer was already open source like a year ago. Yeah, you can use that. And I think it will only take like another 100 lines of code to reproduce Megavavavitu. So that you have the tokenizer, the full reconstruction and advisory training logics.
Starting point is 01:28:28 Yeah, I know it's simple to reproduce, but it's hard to train same to all the QYA-like methods. It's very difficult to find the right IPR parameters. Yes, I agree with that part. It is tricky, and it takes some hyper-prameter search, especially it varies with the data set statistics, which I think there's a lot of room for future improvement. I believe the current solution is not perfect. And even for now, although I have worked on so many video tokenizers, My real dream now is getting rid of them.
Starting point is 01:29:05 Last question. Yeah. Thank you. Thank you for your talk. So you presented a really interesting direction in the foundation, multi-model foundation model. From what I see, the whole architecture or the approach in training is very much sequence continuation, right?
Starting point is 01:29:23 Yes. So I'm wondering if you work on some more capabilities or architect, or architectural components which help the model to generalize and to see, how to say, to simplify them, connecting the dots, especially in video and audio signals, it is a very hard task for any model to see the structure behind the diversity. So if you're thinking or working on this direction.
Starting point is 01:29:57 Okay, that's a great question. First of all, one of the advantage of we deploy it is we take the RIM training infrastructure from an auto-box version. We take it for granted, and we actually made no modification to the model architecture and training recipe. So all you need to do is define your token space and curate our sequence datasets. I think that part requires some really smart designs, like you can have text to image as a prefix of text. to video and you have video to audio tasks as the prefix of unconditional video audio generation stuff like that so you help model generate us beyond different tasks all right so thank you so much for the talk and congratulations again on your best paper
Starting point is 01:30:47 award thank you I'm here with Dan Kondratyuk to talk about the video poet a large language model for zero shot video generation poster Dan with Luma AI, which is one of the leading companies in the AI video generation space. Dan, can you give me an overview of kind of how you started working on this and maybe a brief high-level summary of what it is you have here on the poster? Yeah, so we started, yeah, we started this project as mainly a way of thinking about video generation from a foundation model perspective. So foundation models, like typically when you think about them, I guess at the time,
Starting point is 01:31:26 They were all like language models or visual language models. So they output primarily text, but we thought, what if we approach it from a video perspective? And this approaches the design from a very different perspective from how the current video generation models are, which are primarily diffusion-based. So, like, we thought maybe we could envision a task where we take off-the-shelf language model. One of the things we change the least about this project is, like, We just take a language model. We don't do anything special with it.
Starting point is 01:32:02 And our real innovation here is on the data side, like how you design the tasks as input to the language model. And the way we designed it is we translate all of our modalities, so text, video, image, audio into one embedding space. That means you translate into one language that the language model can understand. Typically, when you think of language, just like human language, natural language, the type of text that you can read. But you can actually think of images, video, and audio as a type of language too. So we have this tokenizer, which we called Magnit v2, which takes this, for instance, an image or video,
Starting point is 01:32:50 and translates it into a discrete sequence of tokens with a very large vocabulary. So there's like a vocabulary of say 200,000 tokens. And that can be input directly into a language model. So this language model speaks the language of video in some respect. And all we do is just train it on hundreds of millions of videos. I think we train more than a billion images. And also some of our data set had video and audio pairs. And depending on how you order things,
Starting point is 01:33:23 we have this bidirectional attention prefix. It just means that we input these, and the model has a way of incorporating all these modalities, this text, images. We also have some alternative types of dense input prediction for stylization audio. And depending on the order, what you input in the beginning, you can condition it to output different things. So for instance, you input text, you can output video. So depending on your description, it outputs like an astronaut starting to start. dancing on Mars and then it starts generating the output video based on how we
Starting point is 01:34:01 trained it. Similarly with our output audio we can also do for instance take an input image or video and try to generate accompanying audio without using some audio tokens generated by a sound stream which is a previous Google paper that did kind of language modeling on these audio tokens. So our approach is primarily how we combine these tasks together. And we see we're not the first to show that you can use a language model for this type of generation, but we are a work that shows that you can actually scale this to a level that's actually competitive with existing works that do video generation. So you can see, it can do, because it's a fact that you have a fact,
Starting point is 01:34:53 foundation model, like it can do tons of tasks just based on how we were able to design it. Like for instance, you can do text video, we can do image animation, take input Mona Lisa, and you just ask Mona Lisa to yawn and all of a sudden it just like based on what you describe where you want the image to do, it just does it, which is really cool. And then we can chain this with other tasks, like for instance, stylization. If we use something like depth and optical flow conditioned, it just basically strips out all of the contents of the original video and condition it on just the depth of the optical flow. If you describe it like oil painting of a snowman with the red hat opening their mouth dion,
Starting point is 01:35:39 and then it just paints on top with the same motion as the original video. So that's another really cool thing that the model was able to do. And then we have a whole bunch of other tasks like it can even generate audio. It can outpaint videos where we take an input video and try to paint more contents on the bottom and top. So overall, we evaluated the results and we see that the results are quite competitive with a lot of existing works. In fact, exceeds the most of the works that we tried, which is really cool. And one, a couple of things in particular the model does really well is on prompt following.
Starting point is 01:36:20 because we can train it as like a language model, it's actually easy to scale with existing infrastructure that we had. And also, it does pretty good on motion. You can apply existing image or text to video results. And it's like compared to the other works that we tried at the time, it applied much bigger and more motion that looked interesting in the video rather than having something that moves very slightly. like more akin to image animation. So that's like the overview of the work. If any questions, we've got to answer. Yeah, if you, so we're talking to an audience here of folks who, you know,
Starting point is 01:37:02 are AI engineers, they build applications, oftentimes using some of these models as the underlying underpinnings of the AI part. So I'm curious if you would say that the work you've done and the language model approach is a better fit for some use cases versus others. And if there might be other use cases where diffusion models may be better or how would you think about like the trade-off? Sisyphouse? I think there are a couple things.
Starting point is 01:37:25 If you want to do something that does very good pixel quality, I think diffusion models are still unmatched in this regard. And that's primarily because of the tokenizer. The tokenizer does extreme level of compression. So that's why we're forced to like generate at these pretty small resolutions and need a super resolution model to increase the fidelity. But with a diffusion model, you don't have that restriction. you can do diffusion over these latent tokens that are not as compressed.
Starting point is 01:37:55 And as a result, it's a bit easier to get these high resolution, high level of quality results. However, one thing that diffusion models have a problem with is it takes a very long time to converge, very long time to train. I think the language model approach is definitely quite a bit more efficient. And we trained it only for a few weeks. And already it converged pretty well. And it scales proportionally to existing language modeling approach. So you can easily predict, like, if we keep increasing the model size, as we see here,
Starting point is 01:38:36 like one billion model is pretty good, but eight billion model, we just like increase eight times more parameters and we get much, much better results. I suspect if we just like keep increasing the model size, it'll keep improving. So that's also another nice result. Diffusion models do have a scaling property, but it's a lot harder to predict, I would say. So I think some of the nice things about language model is a lot more research has been done on the scaling properties. And there's also, because the tokens are flattened into a 1D sequence, this multimodal representation, whereas diffusion model typically only operates on like one modality. There's also maybe like you could try to do something like video and audio generation at the same time with a diffusion model.
Starting point is 01:39:26 But I don't think you can generate all modalities combined at the same time with diffusion model, just at the same level quality. Like text diffusion right now is really hard to do and has not had the same level. performance as like auto-aggressive models. So if you want a general foundation model to do everything all at once, I still would say language model is at the top right now. But who knows? Like people are doing research in many different areas. If someone can crack a text diffusion, I think you could also create a foundation model that does all these modalities. And is this work that you've been doing in the context of your role at Luma and is it work that you plan to continue to kind of push forward in that context?
Starting point is 01:40:10 So I recently left Google to join Luma to do some video generation that I was really excited about. So this is just a work that I worked on while I was at Google. I think there's some continuation of this work possibly in the future, like this general approach. Obviously, video poet is still not out, and I think that's just a testament to how fast the field moves right now. It's just like incredibly competitive space. But I do think this general approach could surface in many different areas in the future. Who knows? Right now, language models and diffusion models are battling it out in this battleground.
Starting point is 01:40:50 So who's to say which one will win out in the end? Both approaches have been shown to work pretty well. They have their strengths and weaknesses right now, and more research is like going in the space. So right now, I'm really excited about video generation going forward to build out these more general purpose models. But at least for this work on video poets, I'm also like really excited about like the future prospects of, you know, what's going to happen next. Awesome. Thank you so much. You may have caught that Dan Kondratyuk, the lead author of the video poet paper, has left
Starting point is 01:41:25 deep mind to join Luma Labs, which is responsible for the Luma Dream machine model that went viral for turning popular memes into videos this year. To tie off our generative video discussions, we will bring in Talley Dekyll's invited talk from the text, camera, action. Frontiers and Controllable Video Generation Workshop on Saturday on the future of video generation, beyond data and scale. As a reminder, all talks have public links. So if you want to see the videos she is talking about, click into the show notes. Hi, everyone. I'm Tally, and it's a great privilege to be here.
Starting point is 01:42:02 So today I'm going to talk about the great revolution that we are witnessing in generative AI, and especially in video generation. And as you know, models in this domain requires a tremendous amount of training, data, and compute. But I'm hoping to convince you, based on my own experience and work, that I think that the future of video generation goes way beyond just data and scale. It's going to be a high-level talk
Starting point is 01:42:29 that's going to cover different topics, but I do hope to also dive into technical details on the more recent works. So again, in the context of this workshop, I think it's redundant to say that, you know, we are all aware of the fact that the generative AI revolution has been recently expanded to videos, and we are now not only able to generate these. Whoops. Mind-blowing steel images, but we can also make everything move. And really, I think the past couple of years was shown a dramatic. rapid development in this area and when we are witnessing this progress we can
Starting point is 01:43:12 start envisioning how movies movie production of the film industry might look like in the near future and think that we might be able to generate movies completely computationally so maybe it will look something like that we'll ask CHEGPT to help us with the script like this and it will generate the script for us and then all so it will take the script and generate the movie completely computationally like this. And maybe we'll then ask it to add some special effects, like a bullet time effect, and it will just do it completely computationally without any real actors or cameras just using generative AI.
Starting point is 01:44:13 Yeah, you can hear it outside. Okay, sorry. No more audio in this talk. And if you are too young, You probably don't know, but this is not a real generated video. It was taken from the Matrix movie that was produced somewhere in the 90s. Okay, so I'm sorry to disappoint you, but I think we are very far from this future. And despite all the amazing progress that we are witnessing, state-of-the-art text-video models still depict some fundamental failure cases, even models like SORA. So, for example, they tend to fail to simulate real physical interactions in the world,
Starting point is 01:44:56 like this object here is supposed to be a rigid chair. And you can see it is floating in an unrealistic manner in space. In this example, the trade mill follows the person also in a physically implausible fashion. And also, when we are dealing with more complicated scenes that involve multiple entities, objects tend to unrealistically appear and disappear spontaneously. And this basically tells us that video generation is still not solved. And furthermore, the costs of scaling up video models and developing this universal foundation models are just huge.
Starting point is 01:45:41 You know, a single model training requires about roughly on average 200K GPU hours, which translate to almost $280K dollars, and this is just translate to millions and millions of dollars to train such models. And in terms of energy consumption, just to generate half a second of a video amounts to driving roughly four miles on an average car. And because of this cost, it leads us to the fact
Starting point is 01:46:15 that these foundation, video foundation models are being sealed in the way industry and there are also there are only very few big player in industry that can develop and design such models so what do we do on the research community in that case and also if we go back to our moonshot goal of you know generating films completely computationally in order to do that we need explicit fine grain control we may want to exactly control the camera position the character identity, their emotion, their positions, their movements. We may want to control lighting and also
Starting point is 01:46:54 sound and speech. And all of these controls are not currently provided by video foundation models. So my research journey in the realm of videos has actually started on the other side of the spectrum with single video models. And what do I mean by that? I mean that basically we have some neural-based framework that is overfitted to a single test video. So as NERF, for example, is overfitted to a single 3D scene, in this case we have some neural networks that only observe this test video alone without any additional data. And it turns out that you can do some pretty impressive things with these single video models. So for example, we showed how you could take this really busy and complex scene and let's say you want to just focus your attention on a single
Starting point is 01:47:52 dynamic object so we can actually remove all the rest of the moving people in this scene except this girl and you can notice how not only we remove the people we also remove the complex deformation that occurs to the trampoline in this case here this is a video of my son riding his bikes for the first time and I can take this video and stylize only the background and you can see that everything moves consistently and physically correct with the original scene and these works are from 2021 before the big generative AI revolution and we can also you know not only map texture onto rigid object we can also map texture to deformable articulated objects or for example we can add add these flowers to the dress and they are moving in a physically correct manner as the original video.
Starting point is 01:48:54 And again, these models, the only information they have about the world is just the single video, the input video on the top. Of course, their big disadvantage is that they don't have this rich and powerful prior knowledge about the world. So just to show the advantage of this in more detail, I think one of the big advantage of this approach is that it allows us to go way beyond just working with raw, huge pixel volumes. So we can design sophisticated and more advanced representations for real world videos. So in layered neural atlases, our key idea, we wanted to support this consistent video editing, and the key idea was to basically, turn the video or estimate from the video a unified set of canonical images. So given this input video, we estimate two Atlas images, like you can see here, one for the background and one for the foreground, that represents either the entire background or
Starting point is 01:50:03 foreground for the entire video. And each pixel position from the original video is being mapped onto these Atlas images. And this allows to basically reconstruct the original video from this representation. And now the key advantage of this representation is that it allows to reduce this really difficult task of editing huge pixel volumes of real-world videos to editing a single 2D image.
Starting point is 01:50:31 So what you can do, you can just take these images, plug them in into any image editing framework, or just load it up in Photoshop and draw some stuff on it and then use the mapping to map it back to the original video. Sorry, the animation doesn't work. Okay, of course that I'm showing you a discrete set of images, but in practice everything is being implicitly represented through MLPs and through neural network.
Starting point is 01:51:01 So very briefly, each pixel position in the video is fed into these MLPs that, maps it into a 2D coordinate in this Atlas space. So this is just a 2D coordinate between minus 1 and 1. And you have two such networks for the foreground than the background. And its such position in this 2D unified space is fed into another MLP that predicts the RGB color of that at that position. And there is also another small MLP that predicts the visibility of each point,
Starting point is 01:51:39 how much it observes from the background versus foreground, and this allows to basically reconstruct the original color of the video, teach position, and to train this entire framework completely in a self-supervised manner where the driving loss is a video reconstruction loss. There are other terms in the objective function to make sure that this representation is interpretable, that the structure are being preserved, that the correspondences in the videos are being preserved, But basically, you can train these things end-to-end in a self-supervised manner. Here you can see the editing map.
Starting point is 01:52:24 Okay, so on one hand side, we have this, again, video foundation models. They require this huge cost to train. They are limited. We don't have access to them in the research community that much. And they provide limited controllability. On the other hand, they can learn this really powerful, amazing space-time priors about our dynamic role. On the other side of the spectrum, we have the single video models that requires only few GPUs to train. They are accessible, and they allow us to be much more flexible and creative in the way we represent video content.
Starting point is 01:53:04 However, they do not have any prior knowledge about the world. So you can probably guess that the way... I think we should go about videos is actually to combine the best of both worlds. And what do I mean by that? So on the one hand side, we want to have this flexibility and this freedom to represent video content and to gain explicit control over what we are synthesizing. On the other end, we want to fuse into this representation, external look, knowledge learned from universal models.
Starting point is 01:53:45 And this not only restricted to just video models, we can integrate external information from an ensemble of foundation models that can provide us motion priors, generative priors, generative priors, and semantic priors. And my first attempt to do so was in text to live. So in text to live, we wanted to support text-driven editing. And I think it was, to the best of my mind,
Starting point is 01:54:14 It's the first method to demonstrate text-based editing for videos, for real-world videos. Again, this was ECCV-22. And the key idea there was to use a pre-trained neural Atlas representation of the video as a video render. We're going to have this representation, keep it fixed, and then replace the manual edits that we can perform on the Atlas images with automatic text-driven edits described by text. And to achieve that, we combine this representation with a pre-trend clip model back then that allowed us to gain this for the first time.
Starting point is 01:54:55 And here you can see how we can perform localized and semantic editing to real-world videos. Without any real generative model, this was just using clip. And again, I think that, you know, performing this localized semantic edits and the type of edits that they showed you for removing dynamic content is still a challenge even to big foundation models. that are very powerful. But again, with all the respect to clip and this approach, with the rise of text-to-image models, we wanted to take this approach further and to think of how can we leverage stronger priors about the world.
Starting point is 01:55:38 And I think one of the main challenge in pursuing this approach of combining external knowledge to this sophisticated video representations is that most foundation models are basically black boxes to us. We do not understand exactly the priors that they learn and how these priors are internally encoded. So this approach poses this challenge
Starting point is 01:56:02 of how to distill learn priors from black boxes. And basically one of my research aim is to, an approach, is to dive deep inside those foundation models and find out, like just reveal more gain, better understanding about what they learn and their internal representation. And if we can achieve that, then we can build much more, much better algorithms on top of them. So with the rise of text-to-image models, diffusion models like stable diffusion, I was really amazed by the ability of these models to capture this really complicated signals about our visual world.
Starting point is 01:56:50 So just viewing these images, we can see that these models can learn priors about composition, about pose, about interactions between objects, appearance, and so on. So I was focusing on this aim of taking text to image models way beyond what they're meant to do, way beyond just generating images from text. And we had a line of works in the lab that introduced some of the early works in this space. So, for example, in plug and play, we conditioned the generation not only on text,
Starting point is 01:57:26 but also on a reference image, and the output image preserved the semantic layout of the original reference image. In multi-diffusion, we extended pre-trained text-to-image models to generate images at arbitrary resolution and also to receive as input region-based text controls, like you can see in these examples. And in the context of videos, I was thinking,
Starting point is 01:57:54 how can we take these powerful priors, the text to image learn, and extend them to video synthesis tasks. So in Europe, we introduce scenescape that allows not only to generate beautiful scenery, but also to walkthrough, to generate 3D plausible walkthroughs inside those scenes. And behind those videos, there is actually a real 3D mesh representation of the scene
Starting point is 01:58:19 that is being built. And in token flow, we showed how can you not only synthesize static sin, but actually edit real-world dynamic scenes. And I think, again, many, a huge bulk of work is doing that, like adapting text to image models, expanding them in various ways. I think what's kind of like more unique in these works is that, we insisted in keeping those text-to-image models fixed and striving to better understand the generation process, the internal representation, to make these black boxes more transparent and utilize our understanding of them.
Starting point is 01:59:06 So I want to dive more deeply into some of the works. So let me discuss in more detail, in more detail, token flow. And again, our goal in this work was to perform this consistent video editing. And we started with this naive baseline of applying plug and play or a different method to edit each frame independently. And as you can see, the content is really inconsistent. It's not just at the level of high frequency flickerness, the content really changes from one frame to frame, and there is really no reason to believe that the text to image model
Starting point is 01:59:51 would give us something else. So we wanted to dive inside the model and understand how these inconsistencies are being represented inside the model. So in order to do that, we take the original video frame by frame. We use some inversion technique to invert it back to the model. Then we can just extract some features from intermediate layers. And because those features are really high dimensional, we cannot make sense of them. So we use PCA to reduce them into three dimensions and visualize them as videos.
Starting point is 02:00:30 So here you can see the original video. And on the right hand side, you can see the PCA reductions of tokens of features. extracted across different levels of the unit. And what we can easily observe is that this PCA visualization, they depict shared and consistent representation. We can see that the consistency in RGP and the features resemble, again, similar consistency in its feature space for this video. So we wanted to look at this consistency in more fine-grained manner.
Starting point is 02:01:12 fine-grained manner. So in order to do that, we looked on nearest neighbors. You take a feature at a certain position in one frame and just compute its nearest neighbors to all the rest of the frames. And what we saw is that those correspondences, they exhibit this semantic and accurate matching across different frames. As you can see in these examples. And you can compute this nearest neighbor field densely. So for each, if you are given two frames, you can take each feature in the source frame and compute its nearest neighbor in a target frame, and this will give rise to this dense nearest neighbor field, which we named token flow. So this provides us with semantic and accurate matching, but we wanted to see also to gain more information about what these features hold
Starting point is 02:02:09 about in terms of information about the frames. And in order to do that, we checked how well we can generate the target frame from the features provided from a source frame. So this has been done by basically taking the source frame and the target frame, extracting their features, computing the token flow, and then just warping the source feature tokens. And now we can intervene in the generation process of a target frame. We basically do DDAM inversion to get the initial latent,
Starting point is 02:02:48 but then we swap each feature of the target frame computes its nearest neighbor from the source frame, and we just swap the features. So we want to check how the generation of the target frame would be impacted by this swapping. And we observed that the target frame can be synthesized accurately from the source features, from the source features, which means that those features are interchangeable for the model. Okay, so what happens now, again, we applied this per frame editing, and we saw that the consistency breaks in RGP. What happens to the features?
Starting point is 02:03:35 Here you can see the feature visualization, of this per frame edited video, and you can see that the features depict the same inconsistencies as in RGB. So basically, consistent features gives rise to consistent frames and vice versa. So our key idea in token flow is that in order to achieve consistent editing,
Starting point is 02:03:59 we want to achieve consistent features during the generation process. And the way we suggested to do that is by and the original token floor, the original feature matching of the original video on the edited video. So you can see the edited video and the underlying features of that edited video. And just to summarize, so this method works as follows, we take the original video, we do the DDRM inversion, we extract the features and compute the token flow. And then during the general, The generation process of the edited video is composed of two stages.
Starting point is 02:04:45 In the first stage, we sample some keyframes and jointly edit them with extended attention. This gives basically just rough global coherency between the frames. And then we extract the features of these edited frames, and we propagate them using the original token flow of the original video to the rest of the frames. And we repeat this process. process. Here you can see some generation results and comparison to several methods. Again, I think since we published this work generated a great body of follow-up works. You've seen the nice work on editing XT slices today. So these matching and token flow correspondences, they hold between
Starting point is 02:05:40 nearby frames but indeed when the frames are more distant from each other those matching those matches tend to be incorrect so indeed our method would break for very complex and motions where these correspondences would be difficult to achieve. Okay so I guess I talked about how can we use text to image models beyond what they are meant to do but the main limitation of just using text to image image models is obvious. It only provides us with two-d information and we don't have any motion priors. And if we really want to model our dynamic world, we need to know something about how object moves, how they tend to move in the real world. We want to know priors about actions
Starting point is 02:06:30 and that's something that text or image model cannot provide us. But again, I remind you all that we are in this amazing world where progress happens really fast and now we have this powerful video models and that really motivates their use and their use of their understanding of motion in various applications. It could be generative tasks but I don't think it has to be limited to that. Okay. So that brings me to the last work that I want to talk about, space-time feature for text-driven motion transfer that was presented at last CVPR and the motivation there was again film industry and the you know big efforts that and manual work and professional work puts into transferring motion from motion markers and so on to animation
Starting point is 02:07:32 using this CGI type of animations so we wanted to in this work to achieve this computationally. So given an input driving video like this dog jumping to a river, we want to be able to transfer it to dramatically different objects just using simple text prompt like you can see here. And you can see that the big difference between this setting and this task compared to, let's say, what we've done in token flow, is that you must enable deviation.
Starting point is 02:08:10 from the shape of the original objects in order to convey or to fulfill the target edit. In order to transfer the motion of this dog to a dolphin, and must change the shape of the dog dramatically and adapt the fine-grained characteristic of the motion such that it will be plausible and natural with the target objects. Maybe the dolphin moves his tail in a certain way and so on. So we really need to distill the essence of the motion. motion from the driving video, but flexible enough to allow this adaptation of the content
Starting point is 02:08:47 in order to fulfill, to get a naturally looking at it. And for that, we must have a prior about how things are moving in the real world. So in this work, we used Xeroscope, one of the publicly available text-to-video models. You can see some samples from this model. So it's way, way far from you know, state-of-the-art text-to-video models that keep being better and better, but this model still is able to learn valuable information about our dynamic role. Okay, so just in context of this work, we are not defining motion anymore as pixel-level correspondences, because again, we want to allow this flexibility and deviation from the shape of the object. So in our context for this task motion is defined as a sequence of semantic objects parts positions.
Starting point is 02:09:45 So you can think about an object as being, you know, just a set of the parts that and how and their general progression throughout the entire video. And again, in terms of related work, I think none of the existing method allows, it is not designed to enable this. a big deviation in the structure of the objects. So we followed token flow and took a similar approach and asked ourselves how space-time information is internally encoded in this text to video model. And again, we want to dive deep into the features and understand them better. So in this case, our input is a video and we can directly invert it into the video model,
Starting point is 02:10:37 Again, using off-the-shelf DBM inversion technique, and extract features. And in this case, the features are four-dimensional. So F is the number of frames, M by N is the spatial dimensions, and D is the number of channels. And so here, instead of doing PCA visualizations and so on, we adapted a feature inversion technique. So I guess many of you are familiar with it in the context of a understanding classifiers, pretrained classifiers, it's a classic method. So the general idea is that we have some pre-trained and fixed model. We take our input, we fit it into the model and extract some target features.
Starting point is 02:11:20 And in order to understand better what these features encode, now we solve this optimization task where we want to optimize for an image in this case, such that when we'll fit it into the model, it will give rise to the same target features. And in many cases, of course, you need to somehow regularize this optimized image to avoid adversarial solutions and so on. So in our case, our input is not an image. It's a video. We can fit it into the model, extract features.
Starting point is 02:11:53 And now the goal is to optimize for a new video, such that one will fit it into the text to video model. It will give rise to the same features. And if we solve this optimization task, So again, you can see the objective at the top and the original video on the left. You can see the feature inversion results from different seeds at the right. And you can see that we can accurately reconstruct the original video in terms of appearance, motion and so on.
Starting point is 02:12:26 And this is not what we want, because we want to allow much more flexibility in both in terms of shape and appearance. So how can we take this space-time field? and built a descriptor out of them that will allow us this flexibility. Our first step towards removing this pixel level dependency was to average out or reduce the spatial dimension. So we basically take these features for each frame and just average pull them across the spatial dimension.
Starting point is 02:13:01 So for each feature we have a D-dimensional vector, And so to describe the entire video, we have F by D tensor. And now we can repeat our feature inversion experiment with those spatially reduced features. And we were really surprised when we got this result to see that even though we averaged out the information across space, you can see from this inversion that we still preserve the pose and accurate,
Starting point is 02:13:37 movements of the woman in this video while allowing for more flexibility in the structure and appearance. And just in terms of intuition, again, those features are really high dimension dimension as they live in this high dimensional space. So even though we average them specially, this information can still be preserved. Okay, so in the next step we said, okay, so let's use these features for editing. We're given some video, the original video. We can extract those specially mean features from the original videos and just use them as guidance during the generation process of the edited video. So you can see the equation up here, but we basically want to optimize the latent such that when we denoes them with the target text, in this case, a camel.
Starting point is 02:14:35 we want the resulting features, the spatially reduced mean features, to match those of the original video. We do that through guidance, through the generation process, and you can see here the result. So indeed, it allows for some flexibility. We can get different deviation in shape and in appearance, but still it looks kind of like a camel that was squished. into the shape of the elephant. And so these features, although we average them, they still contain this information, too much information about the original objects in the video.
Starting point is 02:15:17 And that led us to basically build the pairwise SMM differences matrix. And this idea is basically, it is inspired from this entire line of works from self-similarity, that we basically, we don't want to encode the absolute values of these features, but only encode how they relate to each other, all their pairwise relations throughout the video.
Starting point is 02:15:44 So basically, we take this D-dimensional features for each frame, and we build this F-by-F matrix in which each entry is basically just the difference between two spatially averaged features. And you can think about it as encoding some motion in this semantic space of features, because we are just encoding all their pairwise differences and deltas between all the frames. And now we want to again intervene in the generation process of the target video and use guidance, but this time we want to encourage the generated videos to have the same pairwise SMM difference matrix.
Starting point is 02:16:28 So this will be our objective function during the generation process of the edited video. And now you can see that we can get a much better-looking camel and still preserve the motion in the original video. Here I can see some more examples. And I think, you know, if you look on transferring the motion from this kitten to bunnies, you understand that really we want to synthesize the bunnies here, and they need to move in a realistic manner as bunnies tend to move. and that really, I think, exemplifies the need to have a motion prior. There are some more examples with more dramatic shape changes.
Starting point is 02:17:25 And some more examples on well-known videos. We also have a way of initializing the initial latent of the video. I'm not going to go into the details of that, but we use a combination of DDM inverted noise and in low frequencies with random noise at the high. frequencies and this allowed to get the method to be more robust and less sensitive to the exact seed that we are using in the optimization. And again, compared to previous method, they really tend to preserve pixel level correspondences and they're not able to fulfill the edit in a way that
Starting point is 02:18:10 is flexible enough. So how do we measure success here? In order to measure the fidelity to text, we can use clip score, but we wanted to somehow quantify how well we capture the motion of the original video. And again, we want to measure that under this dramatic shape changes, so we can no longer measure just pixel-level similarity between motions. So we suggested a different diameter for that and we suggested to measure the similarity based on the similarity of two sets of unaligned trajectories. So you can take off-the-shelf tracker and just apply a tracker on the original video and on the edited video. And that provides us with these two sets of long-range trajectories. And now we can measure their similarity using the chumfer distance,
Starting point is 02:19:08 where the distance between two tracks here we use just correlation between the tracks. So each trajectory in one set finds its nearest trajectory, highly correlated trajectory in the other set and vice versa, and we sum those correlation values. So here you can see the evaluation of different methods. So on the y-axis, we have the motion fidelity score, so higher is better. And on the x-axis, we have the clip similarity score.
Starting point is 02:19:44 So we want to be on the y-axis. on the top right as much as we can. So, and you can see that our method provides the best trade-off between providing good motion fidelity and fulfilling the text. Token flow which preserve with high fidelity, the regional motion gets better motion fidelity score, but pays in clip score because it cannot fulfill the edit fully.
Starting point is 02:20:15 SD edit on the video model with low noise level, again, is able to preserve the motion with high fidelity, but it cannot deviate much from the original content of the video. And if we use SD edit with high noise level, it's vice versa. It's the opposite, like we can fulfill the edit, but we can no longer preserve the motion. And again, our method provides the better trade-off between these two ends. Of course, there are some limitations, so we are still bounded to the priors that can be provided to us from the text to video model. So if the target object cannot be fitted in terms of the motion, of the video prior,
Starting point is 02:20:57 to the motion of the source object, we will get deviation and this weird motion happening as in this example. Okay, so just to summarize, I talked about the two ends of video generation, editing, and synthesis, the video foundation models on one hand side, the single video models on the other hand side. And I hope I managed to convince you that this approach of combining the two is effective and powerful. There are still tons of stuff to do in order to pursue this goal.
Starting point is 02:21:34 We still need to understand this huge, big foundation models and device, new smart representation in order to fuse this information into them. And there are lots of open questions on how to do that. I'd like to thank all my students and collaborators from Google and from Weitzman. And I'll continue to work towards breaking new grounds in video analysis and synthesis tasks. And hopefully, in the future,
Starting point is 02:22:07 we will be able to generate even such professional effects using computational tools. Thank you. So you mentioned that the open, like obviously open source video models, there's a huge gap in performance compared to what, you know, we can see. What do you think there's still to be done that doesn't really require training? That would, sorry? That does not require training a model.
Starting point is 02:22:34 So what do you think, for example, in text to image models, we saw so many papers on different ways of controlling images. What do you think we can do in videos that would be similar? Yeah. So I think the last work I showed, take a first step in this direction. I think that when you see this generation results, it is evident that these models learn some useful representation about motion, about how things evolve over time. And I think utilizing the internal representation of text to video models.
Starting point is 02:23:13 is still very under explored and there's tons of stuff to do there that won't require heavy training in order to adapt them or to leverage them for various downstream tasks. It could be generative tasks, but not only. I think there is a great potential of, as we all use, you know, pre-trained image features for various tasks, I think the way to go forward is also to use video features. for downstream task. And in order to do that, I do think we need to understand these models much better. And I think there are also many open questions about how to gain control over video generation, what will be the correct interface, how intuitively would you would want to even interact with videos. I think it was discussed here at different talks that just using text is not sufficient
Starting point is 02:24:09 in order to model our dynamic role, then we need to build new tools, new representation, new intuitive interfaces to interact with dynamic content, which is currently not there yet. Hi, thank you for the interesting talk. So my question is a bit of follow of what you just highlighted, and more on the course side of universal video models. So what would be your thoughts on like,
Starting point is 02:24:39 on like, since we are in the early stages, do we anticipate like an order of two reduction in the coast? And it could be algorithmic, it could be on the architecture side. As you said, like, how do we control these models? Might even be the factor that takes us to like this, like the two order of magnitudes further. So what's the future look like compared to where we are today?
Starting point is 02:25:05 I think also it was discussed here in previous talks, but I really think that one missing ingredient in order to push the boundaries of video foundation models is compression. Like, how do you effectively represent or compress information across a video? Right now, I feel that, you know, the early stages of video foundation models are mostly doing the straightforward extensions that we can think about from the image domain and building and effective and, video compressor that you can work in its latent space. I think that will be crucial for pushing the boundaries of video generation, order of magnitudes more.
Starting point is 02:25:53 Yeah, and I believe we'll get there. It's just a matter of time. Yeah. Yeah, hopefully. Thank you. Thank you. That was the end of part one of this pod on Generative Video. In part two, we turn to exploring related topics in generative modelling and diffusion that we feel represent the most important work of 2024, that are also helpful building blocks for generative video.
Starting point is 02:26:21 First, we have two more Deep Mind researchers. You may be observing a pattern in how much work Deep Mind is putting into multimodal generative AI. Here is Friend of the Pod, Sanda Dealeman, who works on both DeepMind's VO video generation. generation model and Imogen 3. Over the past year, Sanda has developed an intuitive interpretation of diffusion. Where traditionally diffusion models and auto-regressive models are viewed as polar opposites, with different hardware utilization and inference paradigms, Sanda's perspective of diffusion as spectral auto-regression in the frequency domain caught the community's imagination this fall, and for the first time, Sander expands upon this in his workshop.
Starting point is 02:27:06 So I'm going to talk about an intuitive look at how diffusion models work, and specifically in the context of modeling audiovisual data, sort of in the spirit of the theme of the workshop. So it's roughly structured in four parts. So the first thing I want to do is explain how diffusion works from a geometric perspective, because I think this intuition is really valuable. And one thing that sort of bothers me about the diffusion literature is that it's, you know, as a beginner, it must be extremely confusing because there's so many different formalisms, so many different ways of saying the same thing. And I think this geometric perspective is sort of a nice way to tie it all together and link these things together.
Starting point is 02:27:50 And then the second section, I'll try to highlight some other perspectives that I think are useful and maybe less well known. And then in the third section, I want to talk about diffusion guidance, which is a very powerful tool. that is also very easily explained with this geometric perspective. And then finally, I want to talk a little bit about Imagine 3NVideo and VO, which are the models that have been working on recently. So first, let's talk about a geometric perspective on diffusion models. So I don't need to repeat this probably, but we know that diffusion works with iteratively noising.
Starting point is 02:28:28 So we have some data distribution that we're trying to model in the examples of all. I'll show this will be an image distribution, and we gradually add a bunch of noise, and then we try to remove it. That's diffusion, the diffusion models in a nutshell. So I'm gonna talk a little bit about this corruption process first. So we first define a way to destroy all the information
Starting point is 02:28:51 that is in the data distribution. And so I'm gonna take an example here from the training data. I'm gonna call that X not or X0. The index zero stands for a time step in the corruption process. So we treat this as kind of a temporal process, and at time step zero, we are in the data distribution. And then this process will proceed by adding small increments of Gaussian noise, which I've called delta here. So think of this as a tiny amount of gaucian noise. And we just do that repeatedly. We add these small increments
Starting point is 02:29:18 repeatedly. And then at some time step T in the process, we can look at what our image looks like, and it will be a noisy image, right? And then if we keep doing that indefinitely, then eventually that noisy image is going to look like just gaseon noise, and we're not going to be able to see anything from the original image in there. A very nice property of doing this with Gaussian noise is that if you have a lot of small increments of Gaussian noise, you can add them together into one larger increment of Gaussian noise.
Starting point is 02:29:48 And this allows us to simulate this process much more efficiently. And that's kind of a key idea behind diffusion model training, just that for any time step T in the process, we can write XT as our clean data X0, plus a scaled version of a standard normal variable. And the scaling factor sigma of T is what we're going to call the noise schedule
Starting point is 02:30:11 of the diffusion model. In practice, we make things slightly more complicated, but also slightly easier to work with, by not just adding noise at every step, but also slightly rescaling the input before we do that. So we introduce this extra scale factor alpha T, which is also dependent on the time step.
Starting point is 02:30:34 And then another change that will make is we won't run this process indefinitely, because we don't have time for that, we're going to stop it at some time step capital T, where basically the image that we get is basically indistinguishable from Gaussian noise. But now the interesting part is the backward process, right? How do we run this process in reverse? Because that then allows us to do generative modeling. And again, this is going to be a gradual process where we add these increments, delta, but now these increments are not just random gaussian noise. Now these increments actually require us to
Starting point is 02:31:09 understand something about the data distribution to know how to gradually remove this noise. And so I like to represent this geometrically. And before I proceed, I do want to express some words of caution. This is kind of a dangerous game what I'm going to do here. Because really this diffusion process is happening in the input space, right? In this case, in the pixel space. And if we think about image data as a vector space, then the vectors that represent the images are very high dimensional, right? Because you have lots of pixels. Each pixel has three color channels. These are very high dimensional vectors. I am going to represent these as two dimensional vectors because two dimensions is all I have on the screen.
Starting point is 02:31:52 This is dangerous because as we know, it can be risky to draw conclusions from low dimensional observations and generalize them to high dimensions. But in this instance, I think it's actually really quite instructive to look at diffusion in this way. So what does a diffusion model actually do? We start with some data point X not,
Starting point is 02:32:13 We add noise to it with that formula that I showed you before, some given amount of noise, depending on the timestet T. And then we end up at a different point in space XT, which is a noisy version of the image. And what the diffusion model is going to do is it's going to try to predict X not from that XT. So we are in XT, and we try to predict where do we need to go in space to get back to X not.
Starting point is 02:32:36 Now, this is a very difficult task. And the reason this is a difficult task is because, of course, the noise is obscuring some information, that was in the original image X-0, and we can't really recover that. So what we end up predicting is not X-NOT itself, but rather the expectation of X-0 given XT, right? We're predicting sort of what are all the possible X-0,
Starting point is 02:32:59 what are all the possible images that could have given rise to this particular noisy observation at time step T? And this is not a single image, but rather a sort of region of the input space. And what a diffusion model is going to do is predict the direction that we need to do that we need to move in to get closer to that region of the input space. And effectively what we're predicting is the centroid of that region.
Starting point is 02:33:20 And if we try to visualize that prediction, if you try to visualize that centroid, it looks like a blurry image. And the reason for that is that this is kind of an average across many possible images X not. And the noise is kind of obscuring the high frequency content of these images, but not the low frequency content. So the result that we get is a blurry image. So how does the few things?
Starting point is 02:33:46 sampling process proceed, where we just predict that direction that we need to move in, and then we take a small step in that direction. And you can kind of compare this to how we optimize neural networks, right? In optimization, we also predict an update direction, but then we only take a small step because really that prediction is only valid locally. And then one thing that we do here that we typically don't do it in neural network optimization is we add a little bit of noise back. And there are theoretical reasons for doing this that I'm not going to go into, but the
Starting point is 02:34:15 intuitive reason for why this might be a good idea is that we're doing a sort of two steps forward, one step back thing, which is going to be more robust to any systematic errors in our predictions of this direction. Because of course we're doing this repeatedly in a loop and errors might accumulate. Not all sampling algorithms do this, of course, but some do. Okay, and then we just repeat the process. So now we're in a new point of space xT minus 1 which looks like a slightly less noisy version of the image and we just make a new prediction x-not. And as you can see here, that prediction is going to be slightly different.
Starting point is 02:34:48 Right? Because now it's pointing to a smaller region of the input space because the noise is obscuring less information So we can kind of make a better guess of as to where we need to move So we have this new prediction as I said again. This is kind of reflecting a smaller region of space that we need to move towards And then the process just kind of repeats so we add a little bit of noise again. We do this a while longer until eventually we reach time step zero and then what we should end up with is a sample from our data distribution. We are not probably not going to end up in the original X not, but we are going to end up in a sample from the data distribution. So that's kind of this geometric overview of the diffusion process. So everything I've explained so far assumes that a diffusion model predicts X not,
Starting point is 02:35:41 predicts the clean input. Now if you look in the literature, that's typically not what people are doing. Instead, a very common approach is to predict this quantity, epsilon, from the formula I showed you before, which is basically just a standard Gaussian noise variable. But it turns out that once you have a trained model, you can always convert a prediction for X not into a prediction of epsilon and vice versa. And that's because of this linear relationship that we have. XT is given. XT is our input. And we know that XT is linearly related to X,0 and epsilon. So if we have one of these quantities, if we predict one of these quantities, then we can convert that into a prediction for the other.
Starting point is 02:36:21 And people have kind of taken this one step further because you don't just have to predict X not or Epsilon. You could actually predict any linear combination of the two. And that gives rise to things like V prediction and the flow matching target, which is epsilon minus X not. For the same reason, predicting X not is also equivalent to predicting XT minus one. And the reason I bring this up is that this is the kind of approach that is taken in the original denoising diffusion probabilistic models paper, the DDP. The DDPM paper starts from saying, okay, we have this gradual corruption process, we're going
Starting point is 02:36:57 to invert it one step at a time, and then the natural thing is to predict the previous time step from the current time step. But as is shown here, actually, because of these linear relations, by solving a simple linear system, you can show that this is actually equivalent. This is equivalent when you have a trained model. It's not equivalent during training, which is a little bit tricky. So during training, this choice of prediction target. actually affects the relative importance of the noise levels in the aggregated loss across all noise levels and that is in turn going to affect the perceptual quality of the outputs. So that's why choosing this prediction target is actually important. But once you have a train model, all these prediction targets are essentially equivalent. All right, so in summary, how does the diffusion training process proceed? So we we take each training example X not to be samples in random time step T, the corrupt X not to get XT with this formula that I showed you,
Starting point is 02:37:53 before, we don't have to run the process one step at a time. We can just do it in one go. And then we use our model to make a prediction for x-0 or for epsilon or however, we decided to parameterize the model. And then to train the model, we just minimize a squared prediction error. And this is just the MSC loss that we all know and love. So this is a very stable training objective, which is nice.
Starting point is 02:38:17 The reason we use MSE here, the intuitive reason, why this is a good idea, is because really what we want to recover is that expectation from before, right? We can't predict X not exactly, but we want to recover the expectation of X not given XT, and that is precisely the minimizer of the mean squared error. And then for sampling at each time step T, we can predict X not or epsilon from XT with our model
Starting point is 02:38:43 and then just take a small step in the predicted direction to partially the noise XT to get XT minus one. And then as I said, in some algorithms, we add back a little bit of a noise. In some algorithms, we do not. Okay, so that's kind of the basis of this geometric perspective. Now I want to talk about a few other perspectives that I think are useful that I may be less well known. So one thing I'm going to skip is this sort of score matching perspective.
Starting point is 02:39:10 This is also linked to what I just explained, but I think that one is pretty well known nowadays. So I want to talk about a few other perspectives. And one is this way of looking at diffusion models as recurrent neural networks. So if we think of the diffusion sampling loop, we're kind of repeatedly applying this denoiser network that we've trained in sequence. And if you unroll that computational graph, that actually just looks like a much deeper neural network. And then you could ask, why don't we just drain that
Starting point is 02:39:41 with backprop like we usually do? And the answer is, of course, it's very, very deep. It's often tens of thousands of layers. If your base diffusion denoiser model is 100 layers and you have 100 time steps, then this is going to be a 10,000 layer neural network. So you can train that with backdrop through time. people have done that. It's that what you get is actually called a continuous normalizing flow.
Starting point is 02:40:02 But you can do another thing, which is to train this with score matching, and then you don't have to backprop through this loop. You can only, you only have to backp up to one step of the denoising. So this gives you a way to look at diffusion models as kind of a deeper current neural network that was trained without backp through time. Kind of a kind of a hack to train deeper networks, if you will. This is a perspective that I really like. So one question that I get a lot is like, why, why, why, why, Why do diffusion models actually work so well for image and video? Why did they kind of come in and take over essentially all generative modeling for all modalities except language? And so for images, there's this interesting spectral analysis that we can do that sheds some light on this. So we can take, we can calculate the spectrum of an image.
Starting point is 02:40:50 We can kind of summarize that in one dimension. And if you plot this spectrum on a log log plot, what you get is a power law, that you get a straight line, and that reflects that there's a kind of power law going on. So the amplitude, or actually rather the power of a particular frequency in the image is proportional to that frequency raised to some negative power. Usually it's like around minus two.
Starting point is 02:41:13 And it seems to be some sort of law of nature, right? So you get this negatively sloping line for natural images. If you do the same thing for Gaussian noise, you calculate the spectrum. What you should get is a horizontal line. Because in Gaussian noise, all frequencies are present in equal measure.
Starting point is 02:41:30 Right? Now the interesting thing that happens is when you superimpose these, because that's what we do in the fusion models, right? We add noise to images and you add them together and then you look at the spectrum and you get this hinge shape that you see on the third plot there. And if I increase the noise level, so if I increase the amplitude of the noise, then that hinge sort of shifts position. And what this is going to do essentially is it's going to obscure more and more of the high frequencies in the signal. But the low frequencies because they're more powerful, they're kind of kind of just, shut out above this noise floor, and so they're going to be preserved. And so based on this interpretation, I think it is fair to say that the fusion is kind of an approximation of spectral auto-regression. We're generating images from low frequencies
Starting point is 02:42:15 to high frequencies. And so this is true for images, this is true for video, also for audio follows this sort of power law, but obviously not necessarily true for other modalities such as language. This is not an idea I came up with, so I actually got inspired by this paper from Severi Rissan and his colleagues on genitive modeling with inverse heat dissipation, where they do this kind of spectral analysis. And it's really important because the different noise levels actually correspond to different spatial frequencies in the image in a way. And so that means that when we're re-waiting, rebalancing these different noise levels in our training objective, what we're actually doing is we're saying which spatial frequencies matter to us.
Starting point is 02:42:59 Like which spatial frequencies do we want the model to really understand well? And this actually means that diffusion loss is actually kind of perceptual loss, right? Because we're kind of emphasizing the frequencies that the human visual system is sensitive to, and we're deemphasizing the ones that we are less sensitive to. And I think that's one of the big reasons why diffusion models for images took off so rapidly, even if we didn't necessarily understand this at the time. All right, so one thing I also want to do a little bit is, contrast auto regression and the fusion because these are sort of the main generative modeling paradigms that are popular today.
Starting point is 02:43:37 So we know we all know probably what autrogression is. You kind of turn everything to one of these sequence, generate that sequence once at about the time. With the fusion, we use this noisy process, this corruption process. So these are just two different ways to do generative modeling, but they're both iterative. They're both using many network invocations to do generation. So they both use this kind of divide and conquer approach to generative model. And so for video specifically, there's kind of a continuum almost in some of these choices. So we could just model video progressively and that would require taking the sort of the spatial temporal volume and dividing it up into tokens, which would be like three-dimensional patches or voxels and just choosing some order in which to predict these, right?
Starting point is 02:44:24 Because we need to turn it into a sequence. Then on the other end of the spectrum, we could just take this entire cube, this entire volume, and just just model that with diffusion. But there's kind of a hybrid approach that seems to make a lot of sense for video specifically, which is to treat the temporal dimension ultra-aggressively and you diffusion over the spatial dimensions. So that's what I'm showing in the middle here. And all of these approaches have sort of their own advantages and disadvantages.
Starting point is 02:44:49 So the ultra-rogressive approach is nice because it would make treating multi-modal models very easy. So if we want to integrate this with large language models, right now that seems to be the way to go. But of course, these sequences will get very low. long and so that means we run the risk of getting problems with error accumulation if we generate very long videos. Then on the other hand with diffusion we have kind of robustness against this error accumulation and we have powerful methods for accelerating sampling, for example through distillation.
Starting point is 02:45:21 I believe also that guidance, while it's not exclusive to diffusion, you can apply it to ultragressive models as well, it does seem to be, at least to me, it does seem to be more effective in the future. the diffusion setting. But of course, working with these very large spatial temporal cubes, sort of having to generate this in one go, can be quite unwieldy and can create quite a lot of memory pressure. So the hybrid approach could be seen in some sense as best of world, but it also has some advantages and disadvantages.
Starting point is 02:45:51 For example, if you want to do distillation, then this hybrid approach where you do temporal retrogression might actually cause issues with error accumulation again. But of course, one nice aspect of the hybrid approach is that we can reuse a lot of stuff that we've done for images, right? Because essentially, this is just an image conditional image generation model, if you will. One more general trend that I want to talk about in generative modeling for perceptual signals is this sort of moving away from measuring likelihood in the input space. So back in the day when I started working on generative modeling, we had models like pixel CNN and WaveNet. These were just likelihood-based models in the input space. But they didn't really scale very well to larger inputs because likelihood is actually a very poor perceptual metric.
Starting point is 02:46:38 And it's precisely because it's putting way too much emphasis on these high frequencies that are perceptually less relevant. Of course, it works very well for language, as we know. But so the general trend for perceptual data, for audiovisual data, has been that for autrogressive models, we've started measuring likelihood not in input space, but in some latent space. We first learned latence to kind of make abstraction of a lot of this, entropy that is not actually perceptually relevant. Like the individual blades of grass in a grassy texture, for example, don't need to be modeled by a likelihood-based model.
Starting point is 02:47:12 We just need to be able to paint with a grassy texture essentially. And then the same thing is kind of implicitly happening in diffusion models in a continuous way, because by re-weighting the noise levels, we're kind of also implicitly downweeting these less important frequencies. But of course with diffusion, we're also nowadays often using a latent space to kind of amplify this effect. And I want to talk a little bit more about that. Why that makes sense, why that is a good idea.
Starting point is 02:47:41 So visual perception, I think, works differently at fine scales and large scales. At very fine-grained scales, our perception of texture kind of makes abstraction of all these little details. We don't, you know, I can take an image with a grassy, let's say a dog playing in a field, you know, like sky above, grass below. I can take an image and modify it in Photoshop by shifting that grassy texture one pixel to the left and show it to you again and you won't be able to see what happened. It's too subtle. So that perception is kind of making abstraction of these funerine details and it's not actually necessary to model all these possible variations.
Starting point is 02:48:22 We just need to be able to generate one good one, right? And that's precisely what what adversarial models give you, right? They don't really bother modeling all the modes of the distribution, but they can give you. a few good ones. So this is just a really good match for fine-grained perception, whereas at the larger scale, we care a lot more about covering all the possible modes. And so there it makes sense to use something that's closer to a likelihood-based model or a diffusion-based model. Right. So the next thing I want to do is talk about diffusion guidance, which is, I call it a cheat code for diffusion models because it allows them to perform way above their pay grade in a sense. Guidance allows us to allows us to trade off sample quality for diversity.
Starting point is 02:49:08 And it just generally makes diffusion models work a lot better. And so I want to revisit this geometric diagram that I was talking about before. So again, we have our clean input sample from the data distribution X not, and then a noisy version of it at some time step T in the top right corner. And as before, our diffusion model will predict which direction of, we need to move in in the input space to move towards the data distribution. But now we're going to do something slightly different. We're going to do classifier guidance, which means we're going to take a classifier that is robust
Starting point is 02:49:45 to noisy inputs, and we're going to ask it to classify this noisy image, and we're going to take the gradient of these lodges that we get with respect to the input. And what this is going to give us is a direction in input space that we should move in to make this image be more likely to be classified as that particular class. So it's kind of amplifying the aspects of the image that make it adhere to that particular class. And this gives us a different direction in input space. And instead of following the direction we predicted with our diffusion model, we can actually superimpose these directions, just add them together and then move in that direction instead.
Starting point is 02:50:21 And I want to kind of show you the underlying Bayesian perspective on this as well, which you can get simply by taking this formula for classifier guidance, which is expressed in terms of score functions like the gradient of the log likelihood, you can actually just undo this gradient operation and this log operation to see what happens in terms of probability. And that's what I'm showing on this slide. So actually what we're doing is we're taking an unconditional base model, unconditional diffusion model, adding this classifier, PFC given X, and then combining those two to get a conditional model, right?
Starting point is 02:50:54 So we can actually turn an unconditional model, conditional, after training. But the real power of classifier guidance is. unlocked when we introduce this scaling factor, which is called the guidance scale. So we're going to scale this great interaction that we get from the classifier by some constant gamma. And what this is going to do is just say like, make it look like a rabbit, like really make this image look like a rabbit. I want to get all the characteristics in that image that make it look like a rabbit.
Starting point is 02:51:27 So our new update direction is going to be this one. And so we're going to end up in a different point in space. that is kind of following this new direction. And again, if we kind of look at the Bayesian perspective here by undoing this gradient operation and this log operation, what's happened here is the classifier probability is now raised to this power of gamma. And what does it mean when we raise a probability distribution
Starting point is 02:51:53 to a power and sort of renormalize it? That's that's tuning the temperature, right? That's something we do with other rationales models all the time. We're actually just tuning the temperature. But what's interesting about guidance is that the temperature tuning is happening in the output space of a classifier and not in the input space of the generative model.
Starting point is 02:52:10 And personally, I think that's why it's so powerful because we're able to tune temperatures at a kind of high level of abstraction. We're kind of sharpening this classifier distribution. So next let's look at the classifier free version of guidance. So kind of doing the same thing here again, like looking at our diffusion model prediction. But now we're actually going to make two predictions.
Starting point is 02:52:33 We're going to make an unconditional one and a conditional one. And these are going to be slightly different, obviously the conditioning signal gives us a little bit of information about where in space we might need to move to draw samples from the distribution. The way we can achieve this in practice is by training a conditional generative model and then maybe dropping out the conditioning signal 10% of the time. And that gives us a model that can operate in both conditional and unconditioning modes.
Starting point is 02:52:58 So we have these two predictions and we can look at the difference vector between the two, which I've called delta here. And this difference vector is the direction that we can move in to make samples look more like they belong to this class C. And again, that we can do the same thing that we did in Classifier Guidance, which is to amplify this difference by some scale factor gamma, just to allow us to really hone in
Starting point is 02:53:20 on the characteristics of this class C. And then this gives us a new direction, which we should move in during diffusion sampling. And again, as before, the sampling algorithm kind of proceeds as before, so we might optionally add some noise here. All right, now let's look at the Bayesian perspective again. This is very powerful because you kind of applied base rule twice and effectively this vector delta corresponds to a Bayesian classifier, right? So this classifier probability
Starting point is 02:53:53 that we had before is now replaced by this ratio of pfx given C and pfx. But again, raised to this power gamma. So we're again tuning this temperature. And this is effectively what classifier free guidance is. And this is a lot less prone to to sort of adversarial directions in the input space than classifier guidance would be. I have these examples. They're quite old in the meantime. So these are from the glide paper, which
Starting point is 02:54:19 was sort of one of the first large-scale text to image models from Open AI. But I really like these because they kind of show what the model looks like without guidance and with guidance, which is rare in modern papers. In modern papers, we only see samples with guidance. But here you can kind of really see just how much of an impact this has.
Starting point is 02:54:37 And you can also see the impact of the impact you can see the trade-off between diversity and quality. You can see the images come out looking much less diverse, but the quality is clearly improving. Another example from the same paper here with a slightly different prompt, again, sort of reducing the diversity in favor of making the images just look more,
Starting point is 02:54:58 a lot more, a lot nicer overall. And I think nowadays, a lot of these state-of-the-art models that we're seeing, if you were to sample from them without guidance, I think you would be surprised at just how bad they are. These models are really relying on guidance to produce these incredible results that we've been seeing. So if there's anything you remember for this,
Starting point is 02:55:17 the main thing I want you to remember is that classified guidance is just two applications of base rule. Or is it? There's an interesting recent paper by the NVIDIA group from Finland where they kind of call this into question a little bit and give some other intuition about why guidance might actually be working. I won't go into this here, but it's a very good paper.
Starting point is 02:55:39 I recommend taking a look at it. It came out last month, so it's very recent. All right, and then to wrap on my talk, I'm just going to briefly talk a little bit about Imagine 3 and VO, which are the text to image and text to video models that we've been working on recently. Both were announced at Google I.O. in May. Imagine 3 should be available shortly.
Starting point is 02:56:03 Vio is obviously a sort of more in-reelty model that might take a bit longer. But hopefully you'll be able to play with Imagine 3 soon. And I just have a few samples here from this model. So this is a latent diffusion model, just kind of a change from our previous family of models. And you can kind of see that, yeah, it does a pretty good job at fine-grained detail, large-scale structure.
Starting point is 02:56:30 It's a very nice text image model. All of these samples are on the DeepMind website on the relevant blog post. And hopefully we'll be able to share some more details about the inner workings soon as well. And then finally I also want to talk a little bit about Vio, which is our text video model. This is kind of probably looking a lot like what you would expect. So it's again a latent diffusion model. We have a sort of we have a text encoder that encose a text prompt input an optional encoder to condition it on frames with the image input and then the diffusion operates in a in a latent space and
Starting point is 02:57:15 and then we have a decoder that turns us back into pixels at resolutions up to 1080p and and relatively long lengths and then I have your I don't know if this is going to play but yeah this is kind of a show reel of samples which you may have seen before okay that is supposed to move because it's a video okay here we go so this is just kind of a show reel of some samples of the Vio model I don't know if the quality is quite visible is quite coming across here, but it's producing a high-quality video at 1080P. All right.
Starting point is 02:58:02 So to wrap up, one thing I want to highlight is that pretty much everything I've talked about today is on my blog. So I have a whole series of blog posts on diffusion models and on generative models in general, where I kind of try to build intuition. So it's not necessarily about theory and being mathematical correct. It's about building intuition for these models and how they actually work. And so most of the content from the slides here is kind of spread across a few of these different blog posts. Okay, that's it for me. So link to my blog, also to my Twitter account and my email address.
Starting point is 02:58:39 If you have any comments or suggestions or questions after the talk, feel free to contact me. I'm happy to take questions now as well. Thank you. Yeah, I'm curious to hear about where you see the capabilities of these models going. That's kind of a vague question. Yeah, I mean, so I think bigger and better. Yeah, I think we're kind of early. Like I kind of compare it to what's happened in language modeling,
Starting point is 02:59:11 where we kind of a bit further along in the scaling process. I would say on the video side and on the image side as well, I think we're quite early. So I would expect more big leaps. I have a question about latent diffusion models. I haven't seen them written down mathematically. Would you mind giving us some intuition? So if the input is fixed, like we're doing diffusion models on X,
Starting point is 02:59:34 it makes sense you can add noise however much noise you want. You can try to reverse it. For latent, it means you're training an neural network, and from the latent values, you're doing the same process as the network itself is shifting, is training? So usually it's a two-stage process. So we're first going to learn some latent space that essentially compresses the input, right? Because one of the issues with generating very large images, very large videos, is that just takes up a lot of memory.
Starting point is 03:00:04 And one of the key advantages of latent diffusion is that you can actually compress a lot of the redundancy out of this and still get a representation that's sort of learnable, right? This is also how it differs from sort of standard compression. You know, you have standard compression algorithms like, you know, like JPEG and each two, six, for and whatever. They're really just focused on making things as small as possible. Here we're kind of trying to control a trade-off between how much we can compress while maintaining output quality and also how learnable the resulting representation is. Because if you compress too aggressively, that might get difficult. Like if you were to do entropy coding on the latent space or something like that, that might
Starting point is 03:00:45 actually make learning more difficult. So it's kind of an interesting. twist on the compression problem because you have this straight off. But it's generally a two-stage process. So you learn the latent space first, and then you freeze that, and then you just train a diffusion model as you always would, except that you're just extracting this feature representation and operating on that. Thank you. Great talk.
Starting point is 03:01:09 I'd love to hear your thoughts on current metrics, things that are missing, how we can better evaluate, particularly from a video generation, but in general diffusion models. I have mainly complaints and not many suggestions. It's tough, right? We don't have a lot of great metrics. We do a lot of eyeballing for image as well, especially also for video.
Starting point is 03:01:36 It's trickier for image, sorry, it's trickier for video than for image because for image is kind of easy to generate, say, 200 samples, put them in a grid, just take a quick glance at them and have a rough idea of what your model is doing. For video, it's a lot, harder because everything is moving, right? So it's much trickier to kind of glance at things.
Starting point is 03:01:54 You kind of have to look more at individual samples. And then for audio, it's actually completely impossible, right? Because you just have to listen to them one by one. And this is a very persistent problem that, yeah, I haven't seen any great solutions so far. Yeah, we use the, you know, use the classical metrics, FID, FED, but we also know they are flawed in various ways,
Starting point is 03:02:18 that sometimes we can't trust them. But they're at least useful as canaries, right? They kind of tell us when something is seriously wrong, at least. So that's helpful. But yeah, definitely a very fruitful space to work in if you want to make an impact is to figure out how we evaluate these things, especially computationally, without involving humans in the loop.
Starting point is 03:02:40 Thanks. I would like to ask if you think that predicting human evaluation with a model is promising as a direction or not for evaluating these models? Quite possibly, yeah. I guess it kind of depends on what your human evaluation data looks like, but I think that's a promising direction. Did you try to scale, to train a model on a lot of human evaluation and see... And then sort of use it as a proxy, as a reward model in a sense. Yeah, I would say that's a valuable direction to move in. I have one concern with that, which is that you know, every metric when it becomes a target eventually ceases to be a good metric, right?
Starting point is 03:03:25 So it would be very interesting to kind of see how that applies there. And I think we should be careful about that. Thank you. Hi, we have seen that some of like the diffusion models always produce or often produce data that's very close to its training data when you ask it to produce something. Do you have any ideas on how they might get more creative or general? and further away from the trainings' data? I think that's probably the easiest way to solve that is to get more data.
Starting point is 03:04:00 Like if you have an order of magnitude more data, then something like that is more of magnitude less likely to happen. But I think, so I don't deny that this is a problem, but I think we should also, you know, when diffusion models kind of rose to problems, came onto the scene, I think one of the very impressive things was this kind of combinatorial generalization that they exhibit. So I think to some extent there's all there's already a lot of sort of creativity on the part of these models and
Starting point is 03:04:32 sort of combining things in ways that they have that they don't exist in the in the training set and I would expect with more data that that ability would improve. Hey man, thanks for the great talk and join. With some of the video models that have been released. If, for example, you have, like, say, water and waves sort of flowing, you can kind of, as a human watching it, see that the laws of physics aren't sort of strictly abided to in the same way that you'd see in real life. What do you think are some promising directions for sort of ensuring future video diffusion models
Starting point is 03:05:09 kind of more closely adhere to the physical laws of nature in that kind of sense? Scale? is one. I think a lot of this sort of behavior is emergent and with more data and more capacity the model will learn to do this. But maybe in the shorter term there's something we can already do to improve this maybe by curating the data, maybe by building in some physical priors into the model, although we do have to heed the bitter lesson here where often it turns out that it's better to just let it learn and not try to meddle with that too much. But yeah.
Starting point is 03:05:47 Our second deep mind speaker in this section is Ben Poole, who works on inferring 3D structure with 2D priors, which you can see is a key component in upgrading something like Genie 1, which is 2D to genie 2, which is 3D. He also introduces the neural radiance field concept, or Nerf, which is now incredibly popular for 3D environment simulation, and of course has implications for synthetic data in generative video. Ben combined NEARFs with score distillation from diffusion to create dream fusion and reconfusion. Let's tune into his invited talk at the structured probabilistic inference and generative modeling workshop led by Joshua Benjillo.
Starting point is 03:06:30 Yeah, thank you everyone for being here so early. Thank you so much for the workshop organizers for the invitation to speak. And today I'm going to be sharing some of our work on inferring 3D structure with 2D priors. And so ICML has been really fun, but people keep asking me why I'm working on 3D generation. We've seen some amazing progress in video generative models. And as we scaled up the data and the compute, we often see that the quality improves. And then if you also look at some of the 3D consistency within these video models, they've also improved as we've scaled things up.
Starting point is 03:07:02 But the way that we consume content isn't always just staring at a flat screen. We have amazing new AR and VR mixed reality headsets. And the type of content that we want to consume is often interactive. You see some really creative, interesting scene, and you'd love to move around in it and see it from other angles. And it's not just moving around in it in VR headsets. Oftentimes the most fun way of exploring worlds is interacting with them, be it in video games or exploring on a mobile device.
Starting point is 03:07:28 And it's unfortunately really challenging to create this kind of 3D content. 3D modeling is really hard. I remember working on some of this in middle school and being immensely frustrated at the inability to create the seemingly simplest objects. And even once you have these 3D models, how do you interact with them and add them to worlds, light them, rig them?
Starting point is 03:07:45 It's all an extremely challenging and time-consuming problem. And it's not just about creating things. I think something that I find really frustrating is we've seen the amazing power of AI in a number of different domains, but I'm seeing all of you sitting in front of me. And as a human, I feel like I have this really innate sense of the 3D structure around me where objects are.
Starting point is 03:08:01 I know my water bottle is here and I can grab it. But it's really challenging for AI systems to have this kind of spatial intelligence. So if we can make more progress in building 3D priors and understanding the 3D world, I think it could really influence the direction that things are going in robotics as well. Luckily, we've seen amazing progress as well in 3D reconstruction. So here's an example from ZipNurf, which is a powerful method based off of NERF, and you can capture an entire house and turn it into a 3D model that you can move around in and interact with.
Starting point is 03:08:29 And the quality of this and the photorailism of this often exceeds even our best video models today. And how do these methods work? The idea is that we have the space and front. of this and we can parameterize it as a 3D volume. And at every point in this XYZ space, we can use a neural network that maps from a point in space to a density and a color. And there's all sorts of different 3D representations
Starting point is 03:08:49 that people are exploring these days, but the key idea is that you have a differential mapping from somewhere in space to a color or the ability to query different points along array. And the way that we train the parameters of these neural networks that are representing the 3D world is that we can cast array into the scene from a known camera, and we can evaluate
Starting point is 03:09:07 evaluate a bunch of points along that ray using our neural network. This gives us a color and density along the ray that we can accumulate to get an RGB color. And the way that we train these neural networks for 3D modeling is that we have gotten and collected a bunch of images and we can see how well does the image match the prediction of this neural network. And I think what people don't realize about NERF is how data-hungry they are. So if I want to capture a Lego bulldozer on a table, I can't just take one picture. I have to go out and collect a huge handful of pictures that surround the object and view it from almost all of viewpoints.
Starting point is 03:09:38 from almost all viewpoints. Their ability to generalize the unseen regions is basically nothing. It's really interpolating between known viewpoints. And once you do this, you can get high quality 3D reconstructions that represent the color and also learn to some extent the 3D geometry depicted by the depth here on the right. So what happens if I haven't had my coffee this morning and I wake up a little bit early and I only take three photos, but I'm really curious to see what the scene might look like in 3D. Well, here's an example of the state-of-the-art 3D reconstruction methods on a three-view reconstruction. And what you can see is it matches really well at the observed images, but as I deviate away
Starting point is 03:10:13 from these images, we get really inaccurate predictions of what the world might look like. And if you think about building a robot that's going to go and grab that Lego bulldozer, that depth map and the 3D geometry looks hugely inaccurate. It's not going to be useful for any of these tasks. And in general, I think we're at this structured probabilistic modeling workshop. What's the problem that we're trying to solve? Well, we don't have access to the 3D world or even a lot of ground truth data in the 3D world. We just have the shadows of the 3D world.
Starting point is 03:10:39 We have the projections in our eyes, or we take out our camera and just see a two-dimensional image. But we would like to understand what that 3D world is so that we can reason it over it. So we're really trying to solve this inference problems of, okay, what's the distribution over what could be in the 3D world given some set of observations? And there's often this kind of spectrum that goes from reconstruction where you've collected a lot of data, you know exactly what should be there and you want to recreate it in a digital world. So maybe something that's a little looser.
Starting point is 03:11:06 Maybe you have a picture and I just want to hallucinate plausible 3D content. for what is a 3D scene that could be consistent just with that image? Or maybe I don't have a bulldozer in front of me, but I want to create it for my game or visualize it. Maybe I just want to describe it with text. So we have all these different ways of thinking about observations that we want to condition on, but there's a shared common goal. How do we create this 3D structure given these partial observations? And we've done some work across the spectrum.
Starting point is 03:11:31 So we started off working on Texta 3D with Dream Fusion and then worked on FewVue reconstruction with Reconfusion. And then more recently, we have some work Cat 3D that enables us to do everything from, text to single image to few view reconstruction for 3D creation. I'll talk a little bit about each of these projects today. Okay, so why is 3D hard? I think I got into 3D mostly, not necessarily because I cared about 3D and understanding the 3D world, but it was more, this was a problem that didn't feel like data could just solve.
Starting point is 03:11:58 Across the board on, you know, language generation, text generation, and image generation, we've seen that there's an incredible amount of progress just by collecting big data sets. But as we saw before, it's really hard to acquire ground-truth 3D models of the world. It's really expensive. It involves a lot of human effort. But let's say we do this and we collect a big data set. Now what? How do we represent it? We have all these different 3D representations.
Starting point is 03:12:18 We have splats, voxel grids, nurse. You have to pick one of those. And then once you pick one of those, you have to design an architecture that can scale up as you increase those datasets. But let's say you do this. Here's an example, a bit of an old example these days. Can people hear me okay?
Starting point is 03:12:32 I'm realizing them. Okay, great. So if you have a decent size 3D dataset and you train a model on that data set, you can get OK 3D models, but most of our 3D models are just of isolated objects, and it's hard to get the realism to be as high as, for example, the image samples that we get out of state of the art text image models. And I think the real problem here is that there's this huge gap. I've been presenting this for a while, and I think it is still very true.
Starting point is 03:12:59 There's a huge gap between the 3D data that we have access to and the visual world. And I think a lot of that is driven by everyone here has mobile phones in their pockets with cameras. But not all those cameras have depth sensors. And even if they have depth sensors, when you take a photo, you don't I don't often take a video of the object that encircles it and captures all the different ways that you can imagine viewing it. And so the bet that we made was that, okay, maybe we could find ways instead of building explicit priors in the 3D space, could we build priors in 2D? And if we have these priors in 2D, now we need to solve a more complicated problem because we can't just do inference over the 3D space. We don't have a prior there.
Starting point is 03:13:32 We need to be creative for ways of thinking about using these two-dimensional priors for 3D generation. And the general inductive bias, or the way that we're going to hack 2D priors into 3D. is that we're going to say, well, what is a good 3D model of the world? As a human, I often don't have the ability of knowing that the 3D world around me is really accurate and precise, but I can view it from different angles. And so the idea is that we're going to take this 3D model and that we're trying to learn or do inference over, and we're going to render it from a bunch of novel viewpoints. And what does it mean for that 3D model to be a good 3D model?
Starting point is 03:14:05 Well, it just has to look good. And how do we measure how well it looks good? Well, we're going to look at the renderings and we're going to use a 2D prior to score this amount of goodness. And so here we have like a bear playing a guitar. So you might imagine, okay, if maybe from one view it looks good, that might be insufficient for the 3D model. But if every way that I look at that 3D model looks good, then maybe it's a good 3D model of a bear. This opens up a number of questions and problems and research directions for you to solve
Starting point is 03:14:31 what 2D prior condition on what information, how do we actually measure goodness? I think there's been a lot of tremendous work in probabilistic modeling on what, you know, what does it mean for an image to look good? And I think we still don't have a really great sense of what that metric. is or how to optimize it across all different kinds of probabilistic models. Another big problem is which views. Some objects I can put a camera over here, but then depending on where things are in the scene,
Starting point is 03:14:53 it might be really challenging to think about where do I want to evaluate how good this model is. I don't want to put the camera inside of the object, for example. And then also which 3D representation? Nowadays we have a plethora of choices for, you know, if you use slats, you could use snares, you can use all these different things. And the 3D representation you use might change on the setting that you care about. So who here doesn't know about diffusion models?
Starting point is 03:15:19 Oh, wow, that's great. So the general gist of a diffusion model is that's a way of modeling high-dimensional continuous distributions, and we pair a simple destructive process where we take that data and we add more and more noise to it. And eventually we've degraded all the structure that's present in the initial image in this case. And then what we're learning to do is how do we reverse this process
Starting point is 03:15:39 and slowly introduce more structure back into the data. And diffusion models are great if what you care about is sampling. So you've trained on this big data set, for example, 2D image, and you want to sample 2D images. But in 3D, we don't actually care about sampling 2D images. What we really want to do is back out and infer some kind of 3D structure. And one approach for this is you can think about, well, we're building parameterized images. There's some parameters of the Nerf or a generative model, and we can use those to create an image, and then we'd like to evaluate how good is that image.
Starting point is 03:16:07 And so what we're missing here is a loss function that we can use to score these generations or renderings. And if we do have that loss function and it's differentiable, then we can back propagate into the image and then back from the image to the parameters of that generative model. And the idea that we proposed in Dreamfusion was built around this idea of probability density distillation. And so we called it score distillation sampling. And I guess another way of thinking about diffusion models is that they learn a sequence of marginal distributions that start from a clean data point and map to noisier and noisier data distributions. And these noisier distributions are often simpler. They're smoother than the initial data density. And what we want to do is maybe pick out a single mode of this complicated data distribution.
Starting point is 03:16:49 So here you can see P of X is the complex data density defined by the diffusion model. And we just want to infer one mode of that distribution. And the hope is maybe that the mode might be a good looking sample. And we do this not just at one noise level in the diffusion model. We can average it across all these different modes. And this allows us to learn a loss function that is applicable to any differentiable image representation. And here what's nice, while we don't have explicit access
Starting point is 03:17:16 to the marginal distribution and diffusion models, we do have access to the gradient of its log density, which is all that we need to evaluate this loss function. So in DreamFusion, we combine the score distillation loss with a 3D differential representation from NERF. So if you want to peekock on a surfboard, you start with the randomly established NERF, and you can iteratively optimize with the score distillation loss.
Starting point is 03:17:39 And over time, that builds up a 3D model that looks good from all these novel viewpoints. And at the end of the day, after optimizing this 3D model, you get out hopefully a high-quality 3-D asset that you can use in different ways. And what's cool is we didn't have to use any 3-D data to create these text of 3-D generations. And on top of that, we maybe don't have any 3-D at all
Starting point is 03:18:00 for a lot of these categories. So if you collected a 3-D dataset, almost all of these kind of text-3-generations might be out of distribution. But the more I played with these text-a-3-3 systems, the more it's like gambling where you come up with a text prompt, you hit go, you wait a while, and the result stink, and then you do it again and again and again.
Starting point is 03:18:15 And it's not a very fun form of control, and it doesn't allow you to ground these 3D generations in the world, especially if I take a photo, I don't want to take that photo, describe it with text, and feed it to a text image model. I would like a better way of grounding it in real scene content. So in some follow-up work reconfusion, we tried to generalize this method
Starting point is 03:18:33 from conditioning on text to conditioning on images. And if we go back to our bulldozer example, here's the original 3D reconstruction of this bulldozer model from three images. And if we apply a method that uses a generative prior at these novel views, you can see that we can accurately cover novel views and decent geometry from just three input images. So how does this work? It's very similar to the earlier work,
Starting point is 03:18:59 but we're going to augment a 3D reconstruction pipeline with a model that's not conditioned on text to describe what this novel view should. should look like, but it's conditioned on images. And so what should a novel view of a scene look like? Well, we often have one or a few images of what that scene is. And so what that novel view looks like should be very informed by what the existing content is in the other 2D pictures that I captured.
Starting point is 03:19:22 So what might this image look like? And the idea was to train a new kind of diffusion model that was conditioned on the set of input images and their camera poses. And then given some novel target pose, we want to predict this novel So it's still an image diffusion model. It's only producing one novel view. What should this it look like from over here? But now you condition on one or many different inputs that you have of the scene.
Starting point is 03:19:45 So I can take our lazy three captures of some image and then turn it into a 3D model. And the architecture that we use, the condition on the set of input views was pixel Nerf, which is an image-based rendering method. And this was inspired by all your work like Nerf diff and GenVS. And you have an as input, you have a set of input images and their camera poses. You pass it through the pixel inner to get some rendered features at the target camera pose. And then you combine this as input into a typical text image latent diffusion model, where we replace the text features with now clip embeddings of these different image inputs. And now, unfortunately, unlike the existing work that we did before,
Starting point is 03:20:23 Dreamfusion on text to image, here we need data that doesn't just have text and image annotations. We need sets of pictures and their camera poses. So we're way more restricted in terms of the kinds of data sets that we can pull. in this novel view synthesis setting. Here we trained on a combination of real estate 10K to get some real world scenes, CO3D and NV ImageNet, which are often orbits around objects but in context, and then a bunch of synthetic
Starting point is 03:20:45 renders from 3D models from Octiverse. And if you apply these methods to real-world scenes, you can see that you can get out decent novel v synthesis predictions. But one issue is that these images are predicted independently. We aren't modeling the correlations between views that happen when you have one 3D model. 3D model. And so we had to design a procedure that could take these inconsistent 3D predictions or inconsistent 2D predictions and turn them into one consistent 3D model.
Starting point is 03:21:14 So here on the top you can see the results of the 3D reconstruction and the bottom of the samples. And so this is similar to Dreamfusion where we have these, we don't know exactly what the novel view should look like. So we have to generate a much of samples or use the optimization procedure to resolve these difficulties. I think the big problem with all these iterative optimization-based methods for 3D generation is that they're really slow. Dreamfusion takes around half an hour to create a 3D asset, reconfusion it was around an hour. And what could you do in that hour? You probably could have just gone out and taken more pictures of the thing that you were trying to capture. So it doesn't seem like a great practical solution for improving the efficiency and our ability to capture the 3D world. And if you're a robot, you don't want to wait an hour before you move your hand to reconstruct the 3D system.
Starting point is 03:21:56 Another thing that we didn't actually show in ReConfusion was what happens if I put a single image into the system. And one of the issues with the Reconfusion work was in areas of uncertainty where you didn't know what should be in the scene. You often end up with blurring. And this is because those independent image observations would often conflict. And while we use these optimization procedures to resolve them, you're kind of fighting against this aspect of averaging out all these different ideas for what it might look like from this novel view. So here are like some single image results for a hydrant event. So in our next work, Cat 3D for Create Anything in 3D, the hope was that we could address these problems of hallucinating novel content effectively.
Starting point is 03:22:38 And the main idea behind this method was to address this problem of independence. We know that if I have a 3D model or if we have a video of something, the frames are correlated. And so we would like to model these correlations and not just resolve them post hoc in our 3D extraction procedure. So here are some example samples from Reconfusion, where we have three input images and then we have these independent output images. And we can resolve them, but it's a very slow process. And the main idea of this work was building on the amazing success of video diffusion models
Starting point is 03:23:08 for jointly modeling the correlation between multiple images. The model that we trained took a set of observed use as inputs. You could have a single image or a set of images. You also have to have their camera poses. And we encode both the image into a latent space and the cameras using a ray representation, which is kind of representing which kind of and the corners of the image that you're generating. And then we also have a set of targets.
Starting point is 03:23:32 We have where we would like to create outputs. And it's not just one place. We want to create a whole set of image outputs, and we want those outputs to be correlated, such that they could be realized from a single 3D model. So we have the set of observed and unobserved views. We also add a mass to indicate to the video model, which of these are observed, which of these are unobserved.
Starting point is 03:23:51 And then we get out not just one view, but a whole set of views that we can decode back to an image. And if we train this model on the same exact exact data set that we train Reconfusion on, we can see that this model is successful at learning correlations across images, and the resulting samples that we get out are already pretty consistent. But they're not perfectly consistent, and they don't allow that kind of interactivity that we might want from a real 3D model. So what did we do? All we did is we took at a single input image or set of input images.
Starting point is 03:24:17 We generate samples using this multi-view latent diffusion model that gives us a generated set of views, and then we just feed that to a 3D reconstruction pipeline. And there's some additional tricks that you need, like using. using a robust loss, which allows for some reconciling of these different details across different views. But this whole procedure now just takes a minute instead of an hour. And here's some examples comparing the reconfusion results to the CAT 3D results. All these are conditioned on three images, and you can see not only is it faster, but also if you look especially at the backgrounds, you get way higher quality hallucinations in regions where you actually had uncertainty.
Starting point is 03:24:52 What's cool is this works on images and single images, unlike the reconfusion work. So here's a picture of how we have very cute. You golden retriever puppy, and we can take the single picture, and then we can render it and create a 3D model that works from novel views. And here, if you just, for example, had an RGB and a depth map and tried to warp, you wouldn't be able to have the same degree of freedom for moving around and visualizing the scene. This is my grandma's dog, Woola. And it doesn't just work on real-world images. You can also use text-to-image models to first cascade a text-to-image generation with an image-creation. So here's a factory robot, assembling intricate electronic components with a text-to-image-trotable. precision. Here's a goblin of some kind, some other creatures, and it even works on some small-scale
Starting point is 03:25:35 scenes. I think what's really fun about this is I've been really sick of just staring at like 360 spins of objects for the past two years, but now we can turn these into real interactive 3D models. And it's a, I'd encourage everyone to check out the website and play around with this. It feels fundamentally different when you get to interact with something than when you just have a video that's playing right in front of you. There are several important bits to get this working. I mentioned the robust loss. I think a huge open question is how do you decide where to put the cameras? It should really depend on what content is in the scene. And right now, we have kind of a few discrete sets of camera trajectories that we chose for different scenes,
Starting point is 03:26:10 but it would be great to find ways of learning where to place the camera as well. The way that you do the camera and conditioning can impact the quality of the results in this multivu-weight and diffusion model. And what's nice because we kind of have the set-based representation versus the ordered representation in video, we can come up with different and more efficient sampling strategies to create a number of frames in parallel. So, like, what's left? I think this is maybe, you know, an interesting toy, but not useful yet. I think one of the biggest issues is we moved away from just using large-scale text
Starting point is 03:26:38 and image data to requiring posed multi-view data. And if you are aware of, like, using some of these state-of-the-art systems for posing, they don't often work when you have a lot of dynamics in the scene. So I think it's still an unsolved problem of how do we actually scale up these methods and get accurate camera poses if we want to train camera conditioned latent video diffusion models. The recovered geometry is often inaccurate, even though the novel views look good. And as I said, the camera trajectories don't consider the image content. I think one of the biggest issues is that the scene and input are often assumed to be static.
Starting point is 03:27:08 And there really is no such thing as like static 3D videos. If you look at a lot of the datasets, as I move through the scene, I cast a shadow into the scene that changes as I move around it. And that's present in these data sets as well. So we really need to find models that can work with dynamic scenes as well as static scenes. So what were some takeaways from this work? I think that in the cat 3D work, we found that separating 2D priors from this 3D inference process by first sampling and then reconstructing was a really flexible and efficient framework. Unfortunately, it does require more expensive multi-view or video models to generate those correlated samples.
Starting point is 03:27:40 And something that I've been frustrated by is these optimization-based inference methods like score distillation and variational score distillation. They can handle uncertainty. They allow you to express an uncertain prior from what these novel views should look like. but they're way slower, they're lower quality, and it's more complicated. And I still think there's a big gap between the sample quality you get out when you naively sample from these models
Starting point is 03:28:00 and when you use the optimization-based approach for sampling. So I think there's still room for a lot of innovation and inference methods. And I think the other thing that people don't talk about in the 3D spaces, these 3D models are useless. They often don't have good enough geometry. They don't estimate the material properties. If you turn them into meshes,
Starting point is 03:28:18 the topologies are not actually useful. They have baked in lighting. So if we actually want these to be useful, as for example assets in a game, there's still so much more work to be done. So with that, thank you very much for your time and happy to take any questions if we have time remaining. I have a question. There is a recent work from ICML last year, multi-diffusion. They generate panoramas. Would it be possible to combine something like this, some of this approach for like 3D scene?
Starting point is 03:28:50 Because it's also about consistency, generating different scenes. and so on support. Yeah, I think the multi-diffusion work is very cool. It allows you to take a lower dimensional model or a model on a small number of pixels and extend them. And there you can think about the way that they resolve differences between, for example, different frames as averaging within the diffusion process.
Starting point is 03:29:10 There are some people who have tried this for 3D as well, where you can think about resolving this inconsistency in 3D over the course of the diffusion process. But it's often a little bit more finicky, because to update that 3D representation might require multiple steps of optimization, where you can't analytically solve for this update and average them together.
Starting point is 03:29:27 But yeah, I think it's really cool to think about how to combine different ways of doing conditioning and guidance and sampling and diffusion processes to do some of the kind of enforce more of the consistency at sampling time as opposed to this just let the model do whatever it wants sampling wise and then do something on top of the samples. Thank you so much.
Starting point is 03:29:47 Does anyone have any questions? Hi. I think from all these three works, You are, it seems that from the DreamFusion to reconfusion to CAS 3D, the more multi-view you are modeling in 2D, the better 3D you get. Is that the conclusion maybe you just do everything in multi-view, like a trigger. If you have a model can generate 200 images at once. So you don't need to have any optimization, right?
Starting point is 03:30:19 Yeah. So would that be the future? Is that extreme or maybe something in the middle? Yeah, I think what's very frustrating and I think very broken about these models is there's this back and forth. So, you know, how much of the structure do you put into the multi-view prior or the video prior or 2D prior? And how much of the structure do you kind of extract afterwards? And it seems very bizarre to me that we put all this effort into training these 2D priors. We train them until they're 3D consistent. And then only afterwards do we touch anything in 3D.
Starting point is 03:30:47 And going from like ReConfusion to Cat 3D, we removed 3D structure in the diffusion model and it got better. And so I think these existing methods don't really support the kind of like real-time interactive generation. You know, maybe I want to start capturing the scene and have it fill in the details and iteratively update some 3D structure. And we don't really have methods that do that right now. It feels a little bit broken. Ideally, you could build one system that gave you the 3D outputs and could learn from solely image-based data. And there's some cool work like U-Sat diffusion, render diffusion, that tries to build diffusion models that have that 3D structure inside of it. But so far, the performance of these methods haven't been as much because we don't really know how to scale them and train them on large, you know, bigger data sets like we do for more of these pixels-based models.
Starting point is 03:31:29 So I'm not sure which way it goes, but I hope we have more hybrids and find ways of incorporating that 3D structure into the 2D models as well. Hello, thank you for your talk. I think we always see in the literature this effect, particularly in text 2D generative models, where you have these oversaturative colors when you're generating assets. I was wondering if you have any more intuition about what might be causing this. Yeah, it's a great question. I think my initial intuition was that this just has two things. One is that we have all these tricks for diffusion sampling and also for these distillation approaches that are built around guidance.
Starting point is 03:32:07 And so you're going in a direction that matches the text prompt better and moves away from the unconditional prior. And if I have a frog, frogs are often green. And so in the data set that we have, they might be biased to green things. So putting a green background there might often lead to a higher density mode, but that might not be a good sample. So I think a lot of the issues with this kind of oversaturation and contrast come from how broken and bad the loss functions are, combined with how hacky it is to use classifier guidance for solving these problems. And that's kind of why we've moved to more of these sampling-based approaches, is that they just work better.
Starting point is 03:32:40 You don't have to worry about the artifacts that you get out of optimization. But I think that's like, this is like a gap. I wish that we had a better explanation for why these artifacts appeared, because you do see them with classifier-free guidance as you crank it up, that you get over-contrastration, but not nearly to the degree that we get in these optimization-based methods like score distillation. It's very frustrating. Yes, thank you.
Starting point is 03:33:03 Believe it or not, there were other research labs than Deep Mind represented at ICML. We stay in the generative modeling workshop, but transition to Ricky T.Q. Chen of Meta-A-I-A-K-Fair, who presented the most approachable explanation of the flow-matching technique in generative modeling that we have yet heard. So I wanted to give a brief talk on flow matching, or this sort of idea of flow matching, and applying it to various different kinds of domains, from Euclidean to Romanian to discrete domains. We recently had a paper that put out called discrete flow matching, which basically uses the sort of flow-matching recipe, I'll call it,
Starting point is 03:33:41 as a way to motivate a way to construct general models over discrete domains. But really, it seems like you can use this recipe, this sort of very abstract notion of a way to build a general model, and apply it to any sort of domain. So let's get started. The goal of this talk is to discuss a few different application domains, but also I want to say that in all of these, there is one very simple process to build a general model over these domains,
Starting point is 03:34:14 and they share the same sort of underlying principles. And the idea is, I think more people are familiar with the Euclidean space. So here on the top left, we start with primitizing some velocity. And with that, if we transport the particles according to that velocity, we also change the distribution of those particles according to some law. And here we can, yeah, we applied this to material generation. We also applied this in the discrete domain to code generation and text generation just to scale it up and see what happens.
Starting point is 03:34:46 So I'll put the end, the last slide here at the very beginning, which is maybe a quick preview of the flow matching recipe. I'll call it, at least for this talk. I'm not sure if I'll call this later going forward. But here we just want to define conditional velocities, very simple velocities as conditioned on X1 that generates X1. So these UTs and these transport formulas on the line, left, if I start from xT and advance to xT plus H, then if I follow this velocity, then I
Starting point is 03:35:16 basically transform our particles according to this pt given x1. And in particular, pt at time equal to 1 is going to become a direct distribution centered at x1. So if this is the case, that is I can create these conditional velocities that basically just generate a single data sample, then basically the learning problem becomes just, I just want to learn the expected velocity, and that's it. So given an XT, this expectation is over X1, which is sampled from some data distribution, turns out if you learn this expected velocity and follow that with the same transport rule that you have on the top left,
Starting point is 03:35:51 that allows us to generate from the data distribution that we trained this expected velocity form. And this relationship is because of something called the continuity equation, and it's actually because of this linearity of the divergence operator as well. And I'll get to these, you know, I'll unravel this explanation in the very basic. But just to sort of sets the scene, let's start with the Euclidean setting again, which is what most people are familiar with. So assume we have some samples from a data distribution, Q of X1, and we're going to construct these conditional probability paths, pt of X, given X1, such that they basically converge to a delta at X1, right? And if you think about all these probability paths,
Starting point is 03:36:30 and we're just going to just margize that over the Q data distribution on the bottom, right, this marginal polypath is going to generate the data distribution at time one, starting from some whatever noise distribution that we've said. In particular, time one is the data distribution. Now it turns out that if you just look at the velocities that generate these conditional polypaths, that is if I were to follow these velocities,
Starting point is 03:36:55 then I create samples that are marginally from this PT given X1. Now, I want to also marginalize the velocity, in the case, in the sense that I take the conditional velocity and I take an expectation over a P1 given T. So it's P1 given T is the conditional expectation, right, of X1 given X at this current time point.
Starting point is 03:37:15 You can think of this as like a responsibility, if you're familiar with Gaussian mixture models. Basically, we weight the conditional velocity is by this weighting P1 given T, and that's how we define the marginal velocity. This is the thing that we're gonna fit to you, right, in that expectation. And it turns out there's a really simple
Starting point is 03:37:32 sort of explanation for linking this behavior between the marginal probability and the marginal velocity. And to get to there, we need to start thinking about, okay, how do we connect the velocity to the probability of the distribution that we're transporting these samples? And that relationship is from this continuity equation, right? So in particular, this continuity equation says,
Starting point is 03:37:54 at a certain point X, the change in probability at that point is related to the negative divergence of this velocity field times that problem. probably. So what is the divergence? The divergence is at that certain point, let's take a small, people usually take a ball or you can take a hybrid cube around that point, and then we look at all the outflow from that area, subtract the inflow, right? So how much mass am I losing from this point? Sorry, from this area. You take that area to be infinitesimal, and the divergence is basically the continuous approximation to how much mass am I losing from this current X. Right? So the change in
Starting point is 03:38:30 probability is going to just be the negative, right? So it's the, if I lose mass, it probably goes down. So that is the very basic relationship between velocity and probability. And if we assume the continuity equation and we basically can find a velocity that generates a conditional probability, then we've, basically, this is the three-line proof for flow matching on nucleot in space, right? So the first line is just, it's by definition, we're defining the marginal PT, and as just a mixture of conditional PTs,
Starting point is 03:39:04 and then we apply the continuity equation, assuming that we have these conditional UTs in hand that that can generate these conditional probably paths. The third line is just, it's an interchange of the divergence operator, it's a linear operator with integral, right? It's because of this exchange that we can basically move the integral inside, and now we just have this definition
Starting point is 03:39:22 for a UT, which is the marginal velocity field. And we've said, okay, well, and this is the form of the continuity equation, and so this marginal velocity field must be generating the marginal probably path. It's a really simple three-line proof. And we're going to be seeing the same proofs for other domains as well in this talk.
Starting point is 03:39:42 But just to complete the picture, here's what we did for flow matching. We're going to just directly regress a VT, which is a neural network for priming to the velocity, onto the conditional UTs. And this gives the optimal solution, which is the marginal velocity. The way that we proved it is we basically looked at gradient,
Starting point is 03:39:59 and the gradient is same thing in expectation as the intractable flow matching loss which directly matches regressives onto this marginal beauty. Okay so I mean just show some examples this is you know flow matching applied to text image generation you have some text you know people like to look at these figures so all right all right it started getting a little bit more interesting now okay so a lot of people here in the audience like structures so we're not going to stay in ukulean space for too long there's a need to consider non-incolitan structure
Starting point is 03:40:30 as the entire panel's discussion was about. So I won't really go into the motivations anymore. I don't think there's a need to do this. There's a lot of different domains that have imposed structure, and we really want to be modeling that type of structure, maybe explicitly in the general model itself. In particular, if you think about Romanian manifolds, that is, manifolds where locally,
Starting point is 03:40:53 we can basically have a kind of like a first word approximation to the manifold. Locally, it's a Euclidean. space, right? So people call this a tangent plane. At every location, there's a, there's a tangent plane. And because this tangent plane is just Euclidean, we can also just define vector fields on this Euclidean space, everything from sort of Euclidean sort of definitions for continuous flow matching, so continuous formatting flows, kind of just naturally extend to a Romanian manifold setting. And in particular, we can just replace the continuity equations divergence
Starting point is 03:41:24 with the Romanian divergence. I won't go into any of the details on this, but But for the sake of just being complete, let's look at the continuity again. So assuming that we have a UT that satisfies the continuity equation with that remaining divergence, I've just replaced that in the second line. So here the integral is over, it's not the same volume, right? There's a different volume elements. And it's a different manifold. There could be boundary conditions.
Starting point is 03:41:54 I'm just sweeping all that under the rug in this notation. And the only thing that's important is now I can still exchange this divergence and integration, and basically there's still a UT that is the expectation of the conditional UTs. So there's a lot of, you know, complexity in here. It's not clear how they even find a condition of U.T. I won't really go into that in too much detail. So one thing we did at Meta was we basically applied this to material generation.
Starting point is 03:42:22 This is an idea that, so you want to generate a crystal or a material, which is you represent as an infinite set of atoms. that's repeating in every single direction. So the way we represent it on a computer is we represent only a unit cell. And we just assume this unit cell just repeats in all direction. So basically, this unit cell has a periodic boundary condition.
Starting point is 03:42:43 That's the manifold we're working with. Now, a material is basically, we call it stable when it can actually be synthesized in the real world. It's the most basic, but the most important property, it's not clear how they even check if a material is actually stable. People usually rely on the database. But we basically applied Romani FlowMet to try to generate materials, given a set of stable materials,
Starting point is 03:43:03 and see if we could generate something that's novel and also stable at the same time. One thing I'll say about this is that it was actually surprisingly hard to combine manifolds with eco-variance at the same time. So when people, for example, when people use, let's say, a point cloud, right? So you want to impose some sort of translation invariance. The way people do this is you kind of just remove the mean,
Starting point is 03:43:25 right? You take the mean of the point cloud, you just subtract it, and then when you define it, And then when you define the flow or the diffusion process, you basically take a zero-mean noise variable, and so the path basically is always zero mean. There's no zero mean in this periodic boundary condition in this, because there's no origin, right? It's a periodic space. So we actually had to basically project the velocity field to make sure it doesn't actually move the mean.
Starting point is 03:43:50 There's a few tricks in here that actually was a bit surprising, but I think it's worth thinking about, like as I said, in particular, you're interested in structure, and how do you combine different types of manifolds with different types of equivoreances. It's not actually like an orthogonal combination. But anyway, so yeah, we took a look at this and a material is represented by three components, a unit cell, which is basically like a conformation, deformation of like a 3D space, 3D coordinates. And then we have these fractional coordinates which define where the particles are, where the atoms
Starting point is 03:44:21 are inside this unit cell. And then we have the atom types for each of these atoms. So reminding flow matching was great. on the continuous variables, which are these unit cells and coordinates. It was basically, you know, we could say it's state of the art where at least it proves a little bit
Starting point is 03:44:35 to a diffusion baselines. So that's condition on the atom types, we just tried to find a stable confirmation. But if we wanted to do de-novel generation, that is we want to generate a completely new material from scratch, including the atom types, it was okay, it was not great. I would say it was on par with some of our baseline LM approaches,
Starting point is 03:44:55 but it wasn't a big change. So that was a little bit disappointing, I think. But yeah, so maybe I'll explain a little bit. So here, the atom type, we basically represented as a continuous embedding, so embedding into continuous space. And then we just did regular Euclidean flow matching to try to learn the atom type.
Starting point is 03:45:13 And then during sample generation, we would just take a one year's neighbor and to say, okay, this is the atom type of my sample. But that didn't work very well. So we wanted to go a bit further and say, okay, well, can we just take what we learn from flow matching, just directly apply it, directly to discrete space in the sense that we don't assume any sort of metrics, we don't
Starting point is 03:45:31 assume any sort of like continuous space, we just have a bunch of like the different possible values for my, for my samples and you know what happens. So before I get into actually the yeah, so yeah, so the answer is yes and our work is not the first, definitely not. There's a really interesting work from Campbell et al, I've also presented ICML, General flows on discrete state spaces, which is basically model a continuous time markup chain. And it's really nice. A lot of the things that I'm going to go over are sort of also in that paper, just maybe with a slightly different spin on it. So, okay, so the first thing is, what is the velocity?
Starting point is 03:46:13 So I said we're going to just take the expected velocity. But what even is a velocity for transporting a discrete sample XT? So in the continuous case, we just added a little offset to the particle itself. In the discrete case, we're going to add a little offset to. to the probability. So we're gonna start from a direct distribution centered at XT, and then we're gonna modify that distribution and then sample from it.
Starting point is 03:46:36 So as long as we, sorry, this is like a preview. I'll actually prove that this is correct. I'll justify this while there's sampling in a few slides, but this is a preview to try to understand what velocity is. So, oh sorry, and people also call this the rate matrix for the continuous time markup chain fans. So here the updates are independent per dimension
Starting point is 03:46:55 or token as we say it. And as long as we can find a velocity that basically implies that this new variable, this xT plus H, is following the probability path, then this is our definition of a velocity. So again, in the continuous case, here are some visualizations. We basically model for each dimension, each coordinates, some change, and then we're going to just take all the changes at the exact same time, right? So we're just going to move the vector field, sorry, move the particle according to that vector field. In the discrete space, again, we have these axes aligned. This is a sort of visualizing the discrete space on a grid, like, but you know, it's just a visualization, right?
Starting point is 03:47:31 We don't impose any sort of neighboring information. But basically, for each, each coordinate or each token, there is a set of possible values that we can move to, and this UT is essentially a change in the probability mass from X, from XT to some other states. This will be a bit more clear. So if you look at this styridus equation still for a bit longer, right, there are some additional constraints in the context. continuous case, the velocity could just be anything. We could just always move, like, we're assuming Euclidean space, no boundaries, nothing. But in this setting, UT and E's satisfy certain constraints. Because we already start with a PMF there. The delta XT is itself a PMF, right?
Starting point is 03:48:08 It's just 1 at XT and 0 everywhere else. It's already at 1, so, I'm sorry, it's already 0 when the point is not XT, and so the only way to make the right-hand side a valid PMF is to make sure UT is positive or is non-negative when XI is not equal to ZI. I'm talking with this constraint here. And the other constraint is we need to make sure the normalization constant stays the same, right? So here, it normalizes the one,
Starting point is 03:48:32 or sums to one, right? We want to make sure this is a valid PMF, and so we need to make sure that this UT sums to zero, right? And so what that implies is basically there's, you know, at XT, if this point is XT, it needs to be negative, and if it's something not XT and it needs to be positive.
Starting point is 03:48:48 That's it, that's it's, I'm just saying, there's some additional constraints on this. And so velocity, basically here is just modeling the transport of probability from one state to another. If you're at this current state, you have some positive mass and I say, OK, well, with probably 20%, I want to move 20% of a mass currently to some other states,
Starting point is 03:49:07 then for each particle, I'm going to flip a coin with 20% probably I'm going to move. 80% probably I'll stay, right? Something like that. So OK, let's actually derive that. That's just a sort of preview for what a velocity would be that coincides with the continuous time markup chain equations. So let's start again with the continuity.
Starting point is 03:49:25 equation, right? But here we're going to try to define discrete divergence. So divergence is again outflow minus in flow, it's the amount of mass moving outside this domain, outside this node let's say at this point, minus the amount of mass that's moving in. So in the discrete case we can basically just assign values to edges on a graph and this edge will just denotes you know how much mass is moving from one node to another, right? It's going to be, you know, when we compute divergent at a certain point, we're just going to take in all the outflow minus the inflow. So let's say this V is our flux or current.
Starting point is 03:50:04 This is the scalar function that is defined on the edges of the graph, right? And we want to sum over all the domain, right? Not axis of line, this Z is just over the entire discrete space where each coordinate has D possible values and there's n different discrete variables. Here's where we make an assumption. We're going to assume the single token change graph really, basically that V is only defined, or it's only non-zero, when X and Z defer by one token, by one coordinate, right? If it diverts by more than one, it's just zero.
Starting point is 03:50:39 We're explicitly making this assumption that the velocity, or this flux right now, can only modify one token at a time. I'm just think this is slightly different from previous treatments of this, where they define these continuity equation or the comograph equation on 1D, and then so, so, you know, of justify why to do it in high dimensions. Here we're going slightly a bit of reverse. We start with that continuity equation to find in high dimensions, and we're just going to make an explicit assumption
Starting point is 03:51:04 because we don't want to be modeling basically a complete graph over this really large space. All right, so we're going to assume this flux, again, in the continuous setting. So v is going to be p times v. It's our p times u, p times the probability. So we're going to make that assumption here as well. And if we do some algebra, we arrive at this equation
Starting point is 03:51:23 for the divergence at a certain point X. So this is our definition for a velocity field. So there's two ways to think about this. The continuity equation is a way to, it's a way to build in the relationship between the velocity field and the probability. I'm right now thinking about it as a way to define the velocity. So given the probability, how do we define the velocity field?
Starting point is 03:51:49 So there's two things, right? So one thing is if we have a UT, that's satisfies this discrete continuity equation, then the parallel oiler sampling that I just described a few slides ago is now justified. In particular, if we take the first order approximation of PT, so PT itself is going to be just the expectation of an indicator function, right? Here again, we're going to look at this as an expectation as well. So this is a we're just sampling from PT, and then this is the term inside.
Starting point is 03:52:19 If we do a little bit algebra and move a lot of the terms that are a little over, of H or higher outside, we arrive at this expression, which is the oiler sampling. So for each coordinates, we're going to just sample independently for that coordinate, and the rest is just little O of H. So by assuming this continuity equation, we've justified the oiler sampling. And also, if we assume this, then we can also prove that the marginal velocity, this flow matching recipe also holds. So everything is the same.
Starting point is 03:52:51 The first time is the definition of PT, it's the mixture of the conditional PT, the discrete continuity equation shows up, and then we're going to just exchange what happens inside with this sum over X1, and the sum over X1 gives us the marginal velocity. So again, if we take conditional velocities that satisfy this, we define the marginal velocity, then if we have access to the margin of velocity, we can transport using this oil or sample. That's what we're saying. So how do we actually define,
Starting point is 03:53:24 that was a little bit abstracted. So let's actually define this discrete pass and maybe look at a few special cases that are very strong in practice. So first off, we're going to just say, let's define the marginal PT as a bunch of, marginalization over a bunch of conditional PT's, and then each conditional PT is independent in each dimension.
Starting point is 03:53:46 So a special case of our framework, we basically work with like an arbitration, mixture of M different distributions, but let's consider two different distributions. So condition on x0 and x1, xI only has two possible values. It's either going to be x0 or x1, and it's just a mixture between the two, with some probably kappa t. There is another special case which is, that works really well, which is x0 is going to just be the completely masked state. So given a sequence, and I'm going to just start from the completely masked sequence, and then I slowly unmask each token until it reaches the X1 distribution.
Starting point is 03:54:24 It was on probability kata. Now this is very effective. It's a little bit unsatisfied that it actually is effective, but it's a very effective probably path, and it's used by many works. It's related to mass language modeling, obviously. So yeah, I just want to bring up these works also do a very good job at generalizing this construction a little bit to to learn different components. Okay. So, yeah, but let's not worry too much about the mass state. Let's work with a mixture of two dallas for now.
Starting point is 03:54:54 And one thing that we proved in this paper is that basically there's a conditional velocity, and then if you marginalize it, you get the marginal velocity that is currently highlighted in gray here. And there's two, okay, it's similar to the continuous case where you can always add, like, a divergence-free term to the flux, and there's an infinite number of velocities for the same, for the same probability path. It's very similar here. Basically, we can express the velocity either in terms of a P1 given T
Starting point is 03:55:21 or P0 given T. The first one makes sense when we want to solve things forwards in time. We're going to predict X1 and then we transport forwards in time. The second one makes sense when we want to transfer backwards in time, so we predict a P0 given 1
Starting point is 03:55:34 and it is transport backwards. Now, what's interesting here is that both of these velocity fields satisfy that continuity equation. So we can always plug it in. In particular, or we can use any combination of these two multiply by some coefficient.
Starting point is 03:55:47 So what we actually do in practice, or what actually works in practice, is we basically take a very large step with a forward time velocity field, and then we take a small step backwards. So this is kind of similar to a predictor-correcter steps. So where we take a small step and then you kind of like do some extra computes to sort of change the variables itself
Starting point is 03:56:08 but without changing the marginal probability. So this gives us a little bit if you take the mask setting, it's just a little bit more flexibility in, let's not just unmask, which is what the forward process will only do. Once you unmask, you cannot mask again. But if you add in the reverse time velocity, you can also add the ability to remask and then unmask again. So it adds a little bit of a, you know, tractor sample. All right. So another interesting thing is this is very analogous to the continuous case that we've been looking at. So in the continuous case, if you look at the polypath as being a convex combination of x0 and x1, usually people
Starting point is 03:56:43 write it in terms of the denoiser or the epsilon prediction. So here again we also have very similar things. This is the denoiser, this is the epsilon prediction, except now it's predicting the whole distribution rather than just an expectation. Here are just some examples to show, you know, we tried at scale. We basically trained a 1.7 billion models to try to do some text completion, to try to beat LMs at their own game, which we kind of failed, but we gave it an honest try. So the first thing is, given some doc string,
Starting point is 03:57:13 we're going to generate the code. And basically, this is a more trustworthy way of evaluating large language model. It's not just based on abstract type. There's actually this code will either run and succeed or it'll fail. The interesting thing is because we have a completely non-autaggressive model, right?
Starting point is 03:57:34 We can do any sort of code infilling. We don't need a condition on the left and then generate the right side. we can just condition on arbitrary things that we want. So this is one property that's beyond just LOMs, but there's no good benchmark for trying to do this. And on the right side is just the illustration of the sampling process.
Starting point is 03:57:54 So here it's just pure mass. There's no unmasked, there's no corrector stuff. Just looking at the, using a masked P0 and then just unmasking it. So here's more of a sunny check that we did on open web text is learning a language model in particular. The thing is, equation nine is the mask probably path, which is what most people do,
Starting point is 03:58:15 except with tune like the scheduler and we tune the corrector step also as well. And the equation 10 is some combination of a mask distribution, a uniform distribution, and then the delta of x1. And here we see that the equation 10, this mask was used from uniform actually does better than the pure mask setting,
Starting point is 03:58:33 but really only at the low NFE setting, right? Or as high NFE, the mask is okay, right? So the reason for this is for the mask case, if you just unmask one token at a time, I think it'll be correct. But if you sometimes unmask two things at a time by parallel, you can end up maybe with incorrect samples. And so you really need to correct that,
Starting point is 03:58:53 or maybe allow a uniform parlorid pathway, the noisy state includes some other state, and the model will learn to correct itself later on. And again, yeah, these are just numbers for the code generation for the 1.7 billion model. We also try the discrete flow matching for image generation. So there's no quantized gaussian, there's no metrics at all. We just took the masks, again, the mask case, and tried it.
Starting point is 03:59:19 Yeah, it seems to be slightly worse than continuous-form matching, but, you know, it's almost at 3FID, which is pretty good. So, yeah, so that's the end of the target. Like I said, this is the last slide of the target. I just put that at the very beginning. So as long as we can define velocities that transport particles according to some PT, where PT arrives at X1, exactly. Then if we learned the expected or the marginal velocity, then and then plug that back into the transport equations,
Starting point is 03:59:46 we're going to get the distribution that we fit this marginal velocity to. And that's it. That's the recipe and seems to hold for the discrete setting as well. And here is my research collaborators at Meta. Some of them are also in the audience. So if you have any questions, feel free to ask them as well. All right, thank you. Now that we understand the flow matching objective, we turn now to its most famous application this year. this year in Stable Diffusion 3, presented in the paper Scaling rectified Flow Transformers
Starting point is 04:00:18 for High Resolution Image Synthesis. This paper also won a Best Paper Award at ICML, and here is Patrick Esser, one of the original co-authors of Stable Diffusion under Robin Rombach accepting it. Hello everyone. My name is Patrick Esser. I'm presenting our work, Scaling rectified flow transformers for high-resolution image synthesis. All of this is the result of a great team of people who are also here and who are all on this slide. So currently we observe a big hype about scaling, and it's really tempting to say that we can simply solve all of our problems by throwing enough money at it.
Starting point is 04:01:04 And indeed, we would say the effectiveness of scaling can't be denied. Increasing the model size, the number of training examples, and the overall compute resources that we put into the training consistently improves the model performance. We've first seen this for language models, but we also observe similar trends for image generation, which is what we actually consider in our work. But of course, scaling isn't free, right? It really massively increases both the development costs, because we have to put all the resources into training, but also increases the operational costs because sampling is getting more
Starting point is 04:01:45 demanding with the bigger models. So really, to avoid burning money, we constantly have to keep improving the efficiency of both training and sampling. The three key questions that motivated our work were essentially the first. Given that there are currently quite a few different formulations around diffusion models and flow matching variants, which of those are the most effective? The second question addresses the question of the architecture, since for the time, text to image synthesis, which we have in mind,
Starting point is 04:02:20 we really have to deal with two different modalities and it's unclear how to, which architectural design choices work best here. Finally, when we talk about scaling, we usually have to measure the progress and we also derive the scaling laws based on simple metrics such as a validation loss, but ultimately they're really only a proxy
Starting point is 04:02:44 for the downstream performance we're interested in, which might be, quality and we want to evaluate whether they are an accurate proxy for those properties. So let's start with flow matching and friends. The common goal of these methods is basically to learn a vector field, which will be parametrized by a deep neural network, and that should generate a probability path between two distributions. In our specific case, we then usually consider one of those distributions to be.
Starting point is 04:03:17 those distributions to be a simple known distribution, such as a standard normal distribution, and the other one to be a data distribution of images. The common starting point to learn such a vector field is then to define a so-called forward process, which essentially just defines a trajectory between two samples from the distributions we're looking at. From this process, we can then derive
Starting point is 04:03:44 a tractable regression objective, the so-called conditional flow matching laws, which allows us to recover a vector field that then actually generates a part between the distributions. The overall paradigm here is fairly general, and for specific choices of the forward process, we can actually recover a right range of existing formulations and variants, including EDM, EDPM, and others. One of those variants is the rectified flow formulation, and arguably this is to some degree the simplest choice you can make for the forward process because it's just a
Starting point is 04:04:25 a linear interpolation between the two samples. This then also leads to a very clean, conditional flow matching loss. And overall, it really makes it very elegant and easy to work with. Also remember that sampling in this framework, then essentially consists of integrating the learned vector field. And because of that, straight paths, as we define them in this forward process here, are actually really desirable because if they are straight, we could actually integrate them in a single step which would massively improve our sampling efficiency.
Starting point is 04:05:00 So the conditional flow matching objective actually does not recover perfectly straight path, even if we find a forward process like this. But at least empirically, we have seen results that compared to the diffusion, to the vector fields you derive from diffusion formulations, they usually have less curvature, which makes them more sample efficient. And they also come with other nice theoretical properties, like a straightening effect, which makes them really promising candidates
Starting point is 04:05:32 for further improving the sampling efficiency. So overall, this really just makes rectified flows attractive candidates for efficient text or image sentences. But so far, before the study, they have really mainly been considered in benchmark settings, and it remained a bit unclear how well they would actually perform in practice for more difficult tasks such as text to image synthesis.
Starting point is 04:05:59 If we look at the conditional flow matching objective, it always involves the time step distribution over the time steps of the trajectory. And the classical rectified flow formulation really only considers a uniform distribution over the time steps. But since we do during training, we do a Monte Carlo estimation of this objective,
Starting point is 04:06:23 it can really affect the optimization. that we perform. And if we look at the loss and also the forward process, how we defined it, then we'll actually really quickly see that at the endpoints of the trajectory, t equals zero and t equals one, the optimal solution really just involves an estimate of the mean of the two distributions. So we would expect this to be a very simple task in comparison. And because of this, we actually started considering different different time-step distributions, which put less weight on the endpoints of the trajectory and focus more on the interior. And similarly, this is actually also a big part of the success of the fusion models for modeling images, because it allows us to control where exactly
Starting point is 04:07:15 in that trajectory we're putting in the most weight, and that way we can actually focus on the parts of the trajectory where the perceptually relevant, um, aspects of images emerge. So to explore whether this, we can also benefit from this for rectified flow formulations, we explored various time-step distributions that allows to shift where that focus is. To then understand which of those formulations is the most efficient, we actually just performed a study across 61 different variants of all. And here we included many existing formulations, such as the epsilon prediction with a linear
Starting point is 04:07:59 which is the one used in stable diffusion, for example, a V prediction with linear or cosine schedules. We also include EDM and the existing rectified flow formulations. But then besides those, we also included variants, especially of EDM and rectified flow, where we varied the hyperparameters of the time step distributions that are involved. And if we then collect and evaluate those results,
Starting point is 04:08:24 we actually see that the classical rectified flow formulation formulation with a uniform time step distribution, does indeed perform very strongly in the regime of sampling with few steps. But if we, for example, one of the strong baselines that emerged from this was an abs linear predict scheme and compared to that, it actually performs worse when we sample with more steps.
Starting point is 04:08:49 And really in contrast, what we saw is that by introducing this particular time step distribution that log at normal distribution, we end up with with a variant of work-to-fight flow that actually performs better than all existing variants, both in the regime where we sample with few steps, but also resample with many steps. Then after looking deeper into the generative process, we also considered architectural choices for text to image synthesis. Overall, the goal was to focus on transformer-based architectures because of their good scalability
Starting point is 04:09:26 properties, but it wasn't directly clear to us how we can best integrate the two different modalities, text and images, which are required for our task. So in one of our ideas, we introduced the MMDIT block, which generally follows the design of a DIT block, but it actually uses two separate weights for each of the two modalities, but then to exchange information between the two modalities, we still use a full joint attention operation. A similar idea has also been used actually in vision language models. And what we observed from comparisons is really that this performed the strongest.
Starting point is 04:10:04 Some of the comparisons we performed this to a simpler approach where we use a DIT architecture and simply directly concatenate the two modalities. And we also considered UBIT and DIT variants where instead we use a cross-attention mechanism to incorporate text conditioning because that had been very successfully applied in unit based architectures. But overall, we saw that this multimodal design really often best performance. So after figuring out efficient formulations and different architectures, it is time to scale. To get a clean signal for the progress during the scaling, we evaluate the validation loss
Starting point is 04:10:51 at fixed time steps. Comparing such a metric is really only meaningful if we stay within a single formulation. But if we are in that case, it actually provides a very clean signal and also a very efficient way to evaluate the model and derive scaling laws from it. But ultimately, this validation loss really can only serve as a proxy performance, because ultimately we are interested in things like human preferences, prompt following, sample quality, etc. And at this point, it was really unclear whether we can just rely on this validation loss.
Starting point is 04:11:30 being an accurate proxy for those downstream performance measurements. While there have been more work in the language domain, this was not the case in the image domain. To answer this, we then performed the scaling study and evaluated the correlation between the validation loss and we consider both automatic image evaluation metrics such as gen eval as well as human preference ratings. Our results then showed that the improvements predicted by the scaling loss and the validation loss actually translate into quality improvements for text or image synthesis. We saw similar results also in different modalities like video synthesis, and overall it
Starting point is 04:12:14 makes us confident that further scaling will indeed improve content creation capabilities with generative models. So during model development and scaling, we also had a few additional learnings that we share in our work, which I will quickly go over. One of the things that emerge, one of the problems that emerge as you scale is our training instabilities. And here it was really helpful to learn from existing works. One thing we found particularly helpful was QK normalization was stabilized, which stabilized training.
Starting point is 04:12:49 Another point that I quickly want to mention is that we reiterated on the story that scaling improves the performance, but it's also important to note that blindly following this, is quickly becoming inefficient. One of those cases is if we again consider a time step distribution, we actually have to adjust that to different resolutions. If we don't do that, we lose a lot of performance, and you might say that we could simply scale up the scaling the base model, but that would, the cost you would have to pay for that would be tremendous
Starting point is 04:13:21 compared to if you properly fixed the problem. A similar result is related to aligning with human preferences, which gives a very quick, cheap, but effective boost in preference scores. With this, we then really obtained a high-quality model that worked well across different resolutions, aspect ratios, and pedagogic prompt understanding and the ability to spell. This was also reflected in human evaluations against other existing models. And with that, so long, and thanks for the attention.
Starting point is 04:13:56 One of the most underrated applications of diffusion is in speech synthesis. There was also great work at ICML this year on speech, and with the rise of ChatGPT's voice mode, there is a lot of demand in learning about the fundamental problems and techniques. Here, we will simply dip into two oral presentations that we would highlight on speech. Natural Speech 3, Zero Shot Speech Synthesis
Starting point is 04:14:23 with factorised codec and diffusion models by Jew et al, and speech self-supervised learning using diffusion model synthetic data by Gao Ed Al. Hello everyone, this is Zerangji from University of Science and Technology of China. Today I'm very delighted to share with you the exciting progress in the field of Zero Short TTS, Natural Speech 3, Zero Short Speech synthesis with fracturized codec and diffusion models. Let's begin by illustrating the difference between the short TTS and traditional media speakers. Through a practical example, in the scenario of traditional speaker DTS task, the user may request the model to read a transcript in the style of certain speaker that say speaker number
Starting point is 04:15:10 one. This implies that the model should mimic the voice characteristic of the speaker. Here speaker number one should be added into the training data. A significant limitation of this approach is that it's the inability to expand to on-syn speakers. On the contrast, the Rorschot TTS offers a more frequent flexible solution. The user can supply an audio clip as a reference to guide the generation process. For example, if the user submit her voice like this. Marine engineering proves especially authoritative.
Starting point is 04:15:51 Although this reference speech is short and unseen in the training, our robust ZEROTTS system can generate speech with a similar, convincing similar outputs. For the past 10 years, Kensile had gone with me wherever science beckoned. Zero short TTS is now revolutioning the way we think of voice synthesis. This advanced model, use vast variated datasets at training. capturing nuance of numerous speakers and a diverse range of acoustic environments. And as a result, the model can utilize knowledge gained at training to generate
Starting point is 04:16:33 to a certain speaker at inference through a prompt-based generation method. The key to zero-shod TTS lies in the concept of scaling up. Traditional TTS system here actually rely on clean data from the recording studio. This data sets typically consist of less than one thousand hours of recorded speech. Now zero-shot TTS system harness the vastness of internet, use large-scale core data from the web. This approach used a tideside with a total
Starting point is 04:17:07 duration of exceeding 60,000 hours speech. On the other hand, the scaling of acoustic model is also remarkable. Modest beginning with less than 50 million parameters and current Zeroshot TTS system has expanded to 300 million to 1,000 million parameters. And such as scaling up also promotes the transition in data representation. Previous TTS system, often you rely on human prior based representations such as male spectrum on the contrast. The other shot here now are using data-driven representations,
Starting point is 04:17:54 such as the representations derived from the codec. Here is an example codec. They use residual vector quantized virus that are recouped to generate multiple representations and in a core to fine manner. Also previous systems achieved great success. This deal for short in speech similarity, speech quality and speech prosody.
Starting point is 04:18:23 This limitations stem from the intrinsic complexity of speech. For example, for a short speech clip, while seemingly very simple, it is a rich capacity of information that contains timber, content, prosely, recording environments, and so on. This information is important to the overall naturalness of the speech.
Starting point is 04:18:48 Motivated by this, we signify the importance of fracturization, since modeling complex information is difficult such as for raw waveform or male spectrum. Besides, factualization is also non-trival, since RVQ structure fails to effectively disentangle information across RVQ levels. Natural Speed 3 applied fracturization in both data representation and speech generation. For data representation, we apply a factorized codec, which can decompose speech signal into different speech attributes while ensuring high-quality reconstruction. For speech generation, we apply a factorized diffusion model. It is a unified diffusion framework to hierarchically generate each speech attribute in each subspace.
Starting point is 04:19:42 For FAC codec, we consider four speech attributes, that is, timber, pro-examble, content and acoustic details, we first apply a timber extractor to obtain a global timber vector, then we apply three of facturized vector quantifiers to represent speech attributes in each subspace. For better disentanglement, we introduce the following techniques such as information bottleneck, which can limit the representationality, capability of each top. tokens and supervision to include the intended attributes, greeting reversal to remove redundant information and detailed vault to remove unnecessary information from the detailed
Starting point is 04:20:30 codes. For factorized diffusion models, we apply discrete diffusion in each subspaces to sequentially generate speech attributes in order of duration, prosody, content, and detail. The timber need not to be predicted since this global vector can be a global vector can be accessible through the prompt audio. In the forward process, we randomly mask out certain tokens in the sequence. And in the reverse process, the model
Starting point is 04:21:00 learns to recover tokens gradually under the guidance of context and conditions. To facilitate in context learning, we prepare the speech attribute the prompt to the sequence as a prefix. This prompt act as a as a prefix. prompt act as the condition and it remains unchanged during the diffusion process. And this prompt mechanism, this speech attribute prompts, in the scenario of the threshold
Starting point is 04:21:32 TTS, the speech attributes prompts are derived from the same audio and as a byproduct, this prompt mechanism also offers great controllability, since we can select different speech attributes from virus sources, tailor the output speech to meet specific requirements. And we evaluate the zero-shot TTS capability on liberal speech and emotional TTS data set relevance in terms of similarity and robustness and overall quality. The compelling, the impressive results demonstrate that natural natural speech three not only on perform strong baselines, but also achieves human level naturalness.
Starting point is 04:22:27 We also test reconstruction capability of our FAA codec with strong codec lines on the Libri speech test sets. It also demonstrated that our FAA codec can reconstruct speech with high fidelity using this disentangled speech attributes, Here are some demos. The first row is the three second prompt randomly cut off from an entire speech. And the second row is the natural spiritory output speech.
Starting point is 04:23:02 For case one. The standard made to hold another oil cup. So this is the prompt and our natural spirit tree can generate a similar sentence using this Three-second prompts. There was an average cost per lamp for meter operation of 22 cents a year, and each meter took care of an average of 17 lambs. For case two, this is the prompt. Is it not clear that there is just as much of the pencil left?
Starting point is 04:23:36 And this is the output results. There were only four stationers of any consequences in the town, and at each homes produced his pencil chips, and bid high for a duplicate. high for a duplicate. Our natural speech tree can also generate, that can also generate emotional TTS, you know, the airshot manner by prompting an emotional speech. If you prompt the natural, this natural speech with a sad audio that like this.
Starting point is 04:24:13 Dogs are sitting by the door. And the natural species can generate speech, a sad speech like this. Why feeds the lotus of the water? If you prompt the model with a palm audio like this. Dogs are sitting by the door. And the natural speed three output will like this. Why fades the lotus of the water?
Starting point is 04:24:43 And third one, if you prompt a disgust speech like this. Dogs are sitting by the door. And I'll put the sound like this. Why feeds the lotus of the water? Our model is also capable of manipulating attributes by manipulating the corresponding speech prompts. So here is a demo. The first column is the original setting,
Starting point is 04:25:19 that is both duration prompt and other prompts are derived from the same. same audit that is zero-shot TTS scenario. Had she enjoyed the experience? And other prompts is the same. Had she enjoyed the experience? And the generated speech will sound like this. The examination and testimony of the experts
Starting point is 04:25:38 enabled the commission to conclude that five shots may have been fired. If we just slow down the duration prompts, the prompt will sound like this. Had she enjoyed the experience? And the generated will sound like this. The examination and testimony of the experts enabled the commission to conclude that five shots may have been fired.
Starting point is 04:26:05 If we just speed up the prompts, it will sound like this. Had she enjoyed the experience? And the generated speech will sound like this. The examination and testimony of the experts enabled the commission to conclude that five shots may have been fired? We can also derive duration prompts from another new audio clip. This will only manipulate the duration attribute and also will not affect other attributes
Starting point is 04:26:35 since other prompts are kept saying. Dogs are sitting right at the door. And generally speech relaxes. The examination and testimony of the experts enabled the commission to conclude that five shots may have been fired? If you are interested in our work, please scan the QR for more samples. That's all for my presentation.
Starting point is 04:26:59 Thank you. Good afternoon, everyone. My name is Yang Zhang from IBM Research. Today I'm going to introduce our paper, Speech Self-Sulfice Learning, using diffusion model synthetic data, which is a joint collaborative work with UIUC and UCSB.
Starting point is 04:27:16 I'm from IBM Research. So let me start by briefly going over some background about speech self-supervised learning, or speech SSL. Speech SSL, just like SSLs in other domains, assumes that we have a large unannotated corpus. We can then use this corpus to portraying a speech representation network, which can then be fine-tuned to downstream tasks on a small annotated corpus. The key to the success of speech SSL is that we actually need to assume that we have a large unannotated corpus. In most cases, the number of hours of this pretraining data set should be at least 1,000 hours. However, in many cases, obtaining such a large data set is not as easy as it seems.
Starting point is 04:28:09 Here, I'm borrowing a figure of a recent research effort for collecting speech data sets, for over 1,000 languages. The horizontal axis shows the languages and the vertical axis shows the number of hours collected for each language. As can be observed, for most languages, the number of hours is below 1,000. This means that obtaining a large pertaining data set
Starting point is 04:28:35 is simply infeasible for many cases. So in cases where the number of pertaining data is limited, it becomes crucial. to maximize the information extracted from the limited data set. Therefore, we raised the following research questions. Do the existing SSL techniques extract enough information from the limited pre-training data set? Could we further extract the information that is possibly overlooked by the existing SSL techniques? So in this paper, we propose DivS4L,
Starting point is 04:29:10 which is the speech self-situvis learning method that augments the limited pretraining data set using a diffusion model. More specifically, assume that we only have a small unannotated data set, say less than 100 hours. So what DivS4L essentially does is that it augments the dataset using synthetic data, which is then used to perform the standard pre-training with the standard pre-training techniques. The entire data augmentation process consists of three steps. In the first step, we use this small dataset to portraying an initial speech representation network. Know that this initial speech representation network is of poor quality because of the limited dataset size,
Starting point is 04:29:57 but it is sufficient for our purpose. Once this speech representation network is trained, then for each speech utterance drawn from this small data set, we can obtain its initial speech representation. As the second step, we feed this initial speech representation together with the speaker embedding, which is also extracted from the original speech, to a diffusion model. We then train this diffusion model to reconstruct the original speech. The diffusion model is also trained only on this unannotated, a small unannotated corpus. So after this diffusion model is trained, we can then use the diffusion model to just,
Starting point is 04:30:43 generate synthetic data to form this large synthetic data set. However, rather than directly feeding the initial speech representation and the speaker embedding as is to the diffusion model, we first pass it to a modification module. In this way, we can ask the diffusion model to generate a good variety of speech that is different from the original speech. So the remaining question is how we actually modify the these speech representations. So note that speech is a rich information source.
Starting point is 04:31:19 It contains many levels of information, including content information, speaker information, prosody information. Therefore, the synthetic speech should also contain enough variations across all these dimensions. So we designed the following four levels
Starting point is 04:31:38 of variations in the synthetic speech. Level one is the original speech itself. Here is an example. And sharing her house, which was nearby. So this is the original speech. In the second level, we feed the initial speech representation and the speaker embedding as is to the diffusion model. In this way, the diffusion model would generate something that is almost the same as the original speech. However, since the conditioning does not control everything, the output speech would be slightly different from the original speech,
Starting point is 04:32:13 particularly in terms of prosody. So let me play this audio. Please pay attention to the prosody on which. And sharing her house, which was nearby. Now, this witch becomes a rising tone, whereas in the original speech, it was in a dropping tone. And sharing her house, which was nearby. So this is the second level.
Starting point is 04:32:37 In this third level, we change the speaker embedding to a different speaker. In this way, the output speech was still under the same content, but with a different speaker's voice. So here is the example. And sharing her house, which was nearby. It's now becomes a different male speaker. Finally, in Level 4, in addition to changing the speaker embedding, we also partially mask out some of the speech representations. In this way, the diffusion model is forced to fabricate some new content.
Starting point is 04:33:16 That's why we call this type of speech, novel content speech. And now, my Lord, bare heaven is thy. So, as you can hear, the output speech would almost like a nonsensical babble. This implies that the diffusion model is unable to fully capture the grammar structure or the word structure in the original language. However, as we will show, even this seemingly nonsensical babel would still help the performance of the portraying. To test the performance,
Starting point is 04:33:52 of DivS4L, we use the Libri Speech 960 as a pre-training data set, which contains 960 hours of speech in English. We consider two different pre-training settings. In the low-resource setting, we sample only 100 hours of real speech. We then augmented into 960 hours of speech in total by adding 430 hours of level 2 plus 3 speech, and 4.30 hours of level 2 plus 3 speech, 430 hours of level 4 speech.
Starting point is 04:34:24 In the high resource setting, we use all of the 960 hours of real speech. We then augmented to 2,400 hours of total speech. We use two standard pre-training techniques for training our initial speech representation, as well as the final formal speech pretraining, which is Wave 2-Vec2.0 and Hubert. And we compare four different data augmentation techniques. Number one is no data augmentation at all, Wave 2VAC OG, WaveLM, and our proposed DivS4L. We test this pre-training model in a number of downstream tasks.
Starting point is 04:35:03 In the first task, we try it with the English ASR, or English automatic speech recognition, where there are only 10 hours of label data for fine-tuning. The results show that DivS4L can significantly reduce the error rate for both Wave to VVVVAC2, and for Hubert. Moreover, Diffel S4L can achieve further error reduction when it is combined with Wave LM.
Starting point is 04:35:32 Finally, know that the results in these two boxes are both portrayed on 960 hours of speech. The only difference is that the results in Red Box was portrayed on 960 hours of augmented speech, whereas the results in blue are portrayed on 960 hours as a real speech. So as can be observed, the gap between them is already very small. To evaluate the performance beyond speech recognition, we chose this superb benchmark, which
Starting point is 04:36:04 contains eight different tasks other than ASR. And the result still shows that DIVS4L is able to achieve the best performance in almost all the tasks for both the low resource setting and the high resource setting. Finally, to test the performance beyond English, we chose 13 more languages, including some of the high-resource languages and some of the lower-resourced languages. And the results still consistently show that
Starting point is 04:36:35 DIFS Feral is able to reduce the error rates. The final experiment I would actually show is to investigate whether all four levels of variations are helpful. To test this, we go back to our low-resource setting, And then we're going to fix the original speech to 100 hours, and we're going to fix the total number speech to 960 hours, but we then vary the proportion between level 2 plus 3 speech and level 4 speech. And here is the results of English ASR under different dataset compositions,
Starting point is 04:37:12 where the leftmost point corresponds to no level 2 or 3 at all, and the rightmost point corresponds to no level 4 babbles at all. As can be observed, the best performance is achieved somewhere in the middle where all four levels of speech are present. That means all four levels, including the level four babbles, are beneficial to the pre-training. To summarize our findings, we find that the diffusion model is able to capture the information speech complementary to what SSL learns. Therefore, our proposed DIVS4L can significantly improve. the SSL performance in various downstream tasks and languages. We also find that the synthetic speech with different levels of variations are all conducive to SSL,
Starting point is 04:38:04 even the seemingly nonsensical babbles. With that, I'll close the talk today. Thank you very much for your attention. That was the end of part two of this pod. In part one, we explored video generation and world simulation, and in part two, we explored further diffusion and generative modeling methods across NERFs, flow matching, rectified flow transformers, and speech. In part three, we turned to the generative text to video paradigm on its head and check in on the state of computer vision. First, we have the OG Vision Foundation model, DeKaff, which this year won the most prestigious Test of Time Award,
Starting point is 04:38:46 being first presented at ICML 10 years ago. Here is UC Berkeley Professor Trevor Darrell, advisor of the decaf paper and also originator of the cafe deep learning library, accepting the award on behalf of the team. Thank you very much. It's quite an honor to be here and thank you for the generous introduction. And it's really quite exciting to be able to talk about the impact of the decaf work and the cafe work. And that introduction was very generous. I think really the main. the broadest claim we would make is the decaf democratized access to this class of tools, and that's what led to the transformational change in the field. Of course, as we'll mention and mentioned at the time in the talk,
Starting point is 04:39:34 this builds on the work, of course, of Alex Nett and other papers. Look, the paper's decaf, a deep, and the title was a mouthful, a deep convolutional activation feature for generic visual recognition. And what is an activation feature? I thought I would start off by translating that into 2024 speak. So today we would probably think of the F and the decaf as being foundation model. Now, I'm not sure we ever needed to define the term foundation model in our field, but since it's been defined and it's broadly used now,
Starting point is 04:40:11 I look back and think of the decaf paper as in fact perhaps one of the original or broadest foundation models in vision or deep foundation models. And so that's essentially the main retrospective point as we looked back to see the impact of this work. And we're honored and even pleasantly surprised that it was selected for a test of time award. In one slide, what was the decaf paper? Really, it was maybe one of the simplest papers we've ever published in my group. Essentially, we took the results of AlexNet, showed the effectiveness of this model as a pre-training model in vision,
Starting point is 04:41:01 showed that if you took frozen activations, frozen features, and computed activations on those features, you could have essentially state-of-the-art performance across a wide range of tasks. So this is in some sense the OG foundation model and vision. Take the output of convolutional layers, freeze it, train a linear classifier, maybe train even an SVM back in the day on those models, and boom, you get state-of-the-art performance.
Starting point is 04:41:32 I think the main things that the decaf paper did that were impactful and exciting was visualize and show why the model was working. I think that insight maybe is so common today. We all understand that, but the vision community especially didn't appreciate it. Alex Net was this amazing result, but people thought it was a special case. It's just going to only be for that one task. And the fact that it started to work for everything, and that our paper and other papers like it demonstrated that
Starting point is 04:42:08 and that it could be used in this pre-training slash fine-tuning way, that was what really revolutionized the community. And the last bullet here, of course, is probably the most important. I think the thing that made decaf important, and maybe this was a tipping point in the community, I think most papers that you would get accepted to ICML or CVP, PR prior to 2014, you would never get a paper that didn't have an algorithm or a model.
Starting point is 04:42:42 I mean, even data wasn't enough to get a paper accepted back in the day. And the fact that this paper, of course, was accepted and now recognized for its impact, the reason it had impact was really the open source release and the broad dissemination of the work through that channel. And that's now commonplace, perhaps even dominant in our field today. but it wasn't back prior to 2014. So I'd like to sort of acknowledge how the community has changed
Starting point is 04:43:11 and the community standards. DeKaff was part of the CAFE ecosystem, which really was a dominant force in deep learning, one of the dominant preeminent deep learning frameworks between 2013 and 2018. And again, the reason Cafe was important, it wasn't the first deep learning framework ever, There were some unique architectures, but really it was the democratization of the models and access to the models.
Starting point is 04:43:45 And the emphasis on heterogeneous compute, you could either use a CPU or a GPU. And really the industry standard code base that worked well in academia and in industry. and it was really the first widely deployed, the platform for the first widely deployed use of Nvidia GPUs, and that certainly has had a lot of impact. We had the first model zoo that we can find. And the timeline is shown here. Decaf came out in 2013, 2014.
Starting point is 04:44:20 Decaf was the frozen pre-trained sort of foundation model version of CAFE. Cafe itself was a deep learning frame, that eventually was merged into Pi Torch that allowed people to train their own models and have had great impact. And the impact is evidenced by the honor the team has. This is one of three Test of Time Awards the CAFE ecosystem has received this summer. And that really is a remarkable observation. The decaf paper here at ICML, the RCNN, paper at CVPR, and the CAFE system paper, which was at ACM Multimedia, also has had the honor of a Test of Time award that will be presented later this year. I think decaf might be the most
Starting point is 04:45:12 important of these three papers, which is maybe surprising in retrospect, because at the time, I'm not sure the decaf paper was viewed as important as these other two papers, or at least as the cafe system itself. But I'm going to walk through some of the old, slides and then tell you a little bit about some of the current observations and thoughts about why and how this work looks sort of from a historical light. And what does the pre-training paradigm mean for the present and the future? So these are the old slides from 2014, presented at ICML 2014 in China and Beijing. And this is what the world of the computer vision looked like, you know, in the 2010, early 2010's, 2000s,
Starting point is 04:46:04 really starting in late 1990s, 1998, 1999, the machine learning revolution took over in computer vision. But for a good decade, it was the pathway seen on this slide with handcrafted features, words like sift and hog and LLC that maybe aren't even, and surf that may or may not even be known to the community. community today. But then we had our wonderful Cafe Cat, which of course detecting cats on the internet was
Starting point is 04:46:37 the paradigm of its day. But for several decades, and not unnoticed by informed researchers at the time, but yet largely unappreciated by the CVPR and even ICML community, to be honest, was the progress in convolutional representation learning and deep learning, as it was later called, the work of that goes back to Fukushima and the Neocognitron, the Romo Hart-Hinton and Williams' seminal paper in the PDP book, which I encourage people to go back and look. And of course, the work of Jan LeCoon and then Alex Net in 2012, finally showing the this paradigm did scale and this paradigm was going to take over.
Starting point is 04:47:33 But I think even in 2012 and 2013, vision people were basically acknowledging this is working for object recognition, but certainly it wouldn't work for other things. It wouldn't work for, you know, fine-grained object recognition. It wouldn't work for complicated transfer learning. It wouldn't work ultimately for segmentation and other things. And as we see, and as the decaf paper helped convince people, that was not going to be the case. In fact, this paradigm was going to take over the field and did take over the field.
Starting point is 04:48:13 And the decaf paper was essentially the simplest foundation style model or pre-training paradigm you could advocate. And very simple, even in the day, which is let's just take a simple. the frozen AlexNet model, which we're going to provide for you, a user who downloads the code from the Berkeley website, and just take slices of the model and compute activation features, compute the representations that are formed from the pre-trained AlexNet model, and see how it does. And the Decaf paper did this,
Starting point is 04:48:57 and asked a series of questions about the quality of these representations and started maybe the first demonstration that these models are learning something more than they're trained on or more than the literal task that they were trained on, that they are actually learning the latent knowledge that's encoded in those tasks. They're learning semantic hierarchies and things like that. Again, those are concepts that we take as obvious and common sense today. but certainly in the vision community in 2012, it hadn't yet been accepted. And that because these representations generalized, they actually, sorry, because they capture
Starting point is 04:49:39 this latent semantic representation, they generalized to other tasks effectively and different layers had different performance. So the decaf paper was one of the first to show visualizations, for example using Tisney on these deep learned representations. And just these visualizations here, that if you looked at higher layers of the decaf representation or higher layers of the AlexNet representation, that the ability of these models to capture
Starting point is 04:50:10 latent semantic features, which were called super labels here in the paper, the models were never trained on these super labels, and yet there they are emerging as a part of the representation. And again, I think this is the thing that really surprised the vision community, that you didn't explicitly supervise the model on this signal, but it emerged from the representation.
Starting point is 04:50:39 And when you would then look to see what was the performance, of course, AlexNet was crushing object recognition, but there was a view that object recognition was now, okay, that's just machine learning. It's not part of computer vision. And computer vision are these other things, fine-grained part recognition or segmentation, domain adaptation and things like that. These representations just started to crush all the tasks, right, by moving from the prior
Starting point is 04:51:07 best feature, which in this era was called SURF, when looking at the state-of-the-art domain adaptation challenges in computer vision, which was the office data set that was released from our lab actually back in 2010. You could see the numbers double just by changing the underlying representation for some relatively fancy domain adaptation technique. But even more exciting from the perspective of this paper, perhaps less exciting from the perspective
Starting point is 04:51:40 of the domain adaptation researchers, the baselines went up by a factor of three or four. Just these underlying representations had an ability to translate that really crushed all the fancy mechanisms that people had been proposing for domain adaptation. So that was a sobering moment for many researchers. And across a number of different tasks in computer vision, for example, fine-grained recognition where you want to recognize individual species of birds, and there was a notion that you may want representations that
Starting point is 04:52:16 can also localize parts and have some interpretability. the decaf model out of the box did better than the prior art and when integrated into straightforward to techniques for localizing parts had further improvements in performance. And last example in the paper that I'm going to highlight here today on scene recognition. Here as well, the models never trained on these labels for outdoor, indoor, man-made, or natural and yet these super labels emerge in the representation when you visualize it with Tisney. So there are more results.
Starting point is 04:52:58 I encourage you to look back at the paper, but I'll just close with, close this part of reviewing the original talk by noting that I think the main impact here, those observations were important, but the open source dissemination turned out to be the impactful part. I mean, there were other papers that came out
Starting point is 04:53:19 shortly after decaf or maybe around the same time, that also were showing transfer performance. But ultimately, the decaf and the cafe open source release led to wide adoption of these techniques very quickly in the community. And there was this great website and a cute cap that you could look at, of course, back of the day. And at the time, it was just considered remarkable
Starting point is 04:53:45 that these techniques that just a year before, people imagined you needed 10,000 CPUs to try and run deep learning. It was a great paper in the New York Times. You can go find on how the Google data centers were taking thousands and thousands of CPUs to run certain deep learning algorithms. And there was just a belief that, like, no way anybody but Google can do that. And so people were just doing whatever they do. And then suddenly the next year, through Decaf and related efforts for democratization,
Starting point is 04:54:16 almost anyone could get most of the performance of these models, at least for inference. And then soon with CAFE and the advent of GPUs and GPU acceleration of these models, everyone could train a model of this size. And so that's an exciting point. And maybe we're going to have a similar moment in the future. Right now, there's a perception that you can only have tens of thousands of GPUs
Starting point is 04:54:41 to train LLMs. Who knows what the architectures will be in the future, and what the next iteration of a transformative architecture change like the CAFE will be a foundation model. So, DeCaf showed back then the surprising effectiveness of transfer using frozen or relatively frozen Alex Neck features. It was a pre-training, fine-tuning paradigm. DeCaf was the precursor to CAFE, which became the de facto standard for deep learning and academia and industry. But maybe, why did I say the decaf paper might be the most important one? I mean, CAFE had a lot of impact, but actually, the way I'm presenting the decaf paper now
Starting point is 04:55:27 as a kind of foundation-ish model isn't what people were most excited about in 2015 to 2020. In fact, if you wanted to get your paper accepted during that era, you had to put end-to-end in the paper or in the title somewhere, right? The cool thing was I could now back-propagate all the way from the task. And if you weren't doing that rapidly by around 2016, you weren't considered in fashion. So I don't think activation features the way they were, you know, explained in the decaf paper, or foundation models, if we relabel that today, were in vogue for several years. And the cafe system allowed you to now train your own model, get your own GPUs,
Starting point is 04:56:10 and do this sort of end-to-end training. but as we know, roughly in the early 2020s, pre-training returns. And actually, in some sense, is now the dominant paradigm. And we see this from Burke and Clip, and now a perception that if you keep scaling your data and the model, the underlying representations are just going to get better and better and better. And that's the way to go. We don't think it's the appropriate path to really fine-tune
Starting point is 04:56:43 from scratch for each task. We want to leverage everything we see across the underlying tasks. And maybe I'm oversimplifying what people were thinking in 2016 to 2020, but I think you get the gist. So we see now this pre-training paradigm very dominant in the field. Decaf was primarily pre-trained plus fine-tune approach as our contemporary LLM and Laura models. now prompt, pre-trained and then prompt, of course, is dominant in language and vision and language. And in vision, there's very early work along these lines as well.
Starting point is 04:57:24 Since I have a test of time talk, I'll plug my own group's work on this on visual prompting that we had at NEREPS 2022 and large vision models that we had at the CVPR. You can look at these approaches. No language in those models, but they're still foundation-ish models or pre-training, plus prompting approaches to vision. And we see the unreasonable effectiveness of pre-training continuing to this day. Many models that are having a lot of impact.
Starting point is 04:57:55 A lot of vision and language models coming out very fast from companies. I wouldn't want to now compete in building the next vision and language model from Berkeley. As I mentioned earlier in the talk, until there's the next revolution of architectures and maybe we're not going to need 10,000 GPUs,
Starting point is 04:58:12 and we only need a couple of these, whatever the next great model is. I'm looking forward to seeing what that will be. But even now, I think that's still an open playing field if we consider vision and action, or vision and action in language. And so I'll just maybe close, pointing to some of the work in that space.
Starting point is 04:58:32 There are many papers coming out from many different labs right now in this direction. I'll point to two in my lab. One that includes language, as it pre-trains a vision and language an action model and one that doesn't explicitly include language but also handles humanoid locomotion. If you're interested in the larva paper,
Starting point is 04:58:54 we've taken the llama and lava base and added action pre-training into it where we literally prompt the robot to have a particular control scheme, a particular task, and describe a trace of, of trajectories that are desired and that can then be performed. And the sort of fundamental approach of action or locomotion as next token prediction
Starting point is 04:59:22 is explicitly formalized in our paper humanoid control as next token prediction. No underlying language model here, but just pre-training on lots and lots and lots of human and humanoid action data, some of which are taken from the wild, some of which are generated in simulation, just straight up transformer, and the model can then walk around new environments, including San Francisco. With that, I think there's one or two minutes left for maybe a question or a conversation,
Starting point is 04:59:53 which I'm happy to engage in. And again, I think the team just wants to very much thank the community and the program chairs for this honor and looking forward to all the research in the future from the community. Thank you. Thank you so much. So I think there's time for maybe one question.
Starting point is 05:00:14 Well, there's time for one question. So if you have a question, you can come up to one of the microphones there. But then I think because we are, basically between the two post sessions, we want to make sure we get over there too. So maybe we'll have one person, if you want to come to one of the things. We'll give them a second because, you know, people always do it.
Starting point is 05:00:29 The thing to do normally as a chair is to ask it yourself. But I won't take that. I won't take that. I'll let someone from the audience do it. So to a certain extent, this sort of notion of prompting, I guess, okay, I'll put it this way. To what extent do you think fixing features and putting a linear head on top of those features, which we see is very different from prompting, in a sense, in current mechanisms?
Starting point is 05:00:55 To what extent do you think that that's just sort of a side? And prompting kind of is just the way we specify linear heads these days, or to what extent is language really something fundamentally different when it comes to vision language models that is going to, to enable another step change the way that deep networks enabled a step change so many years ago. I think you asked two different questions. One might have, one I took to be the difference between fine-tuning and prompting and the other I took to be language versus not language, or at least I'll try and answer those two. And I think, and I'll try and do it quickly.
Starting point is 05:01:39 I think I could also have added some slides at the end. I think there's an exciting time in the community and many cool papers coming out right now about the sort of mechanisms of in-context learning and prompting task vectors and function vectors and how we can interpret and then maybe even patch or extrapolate these models. So I think we're going to, I don't, I think that's unsettled. I think we're going to see in the coming years papers that define that paradigm that are going to show maybe more formal connections between fine-tuning and prompting in the architecture. That's very hot work right now. And then is language special?
Starting point is 05:02:19 I sort of don't think it is. I also think the word language is complicated because I don't know whether, I think in the community right now, if I say a language model to the broader, like the press, they're going to assume there's text in there. I'm not sure if I say the word language model to this community, whether you're going to assume that or not.
Starting point is 05:02:37 In the past, I didn't always assume that. I thought I could have a language model on a series of tokens that were just vision tokens. And that's why we call that large vision model, a large vision model. I think that's a large vision model. It's also a language model. So if language is just the process of having tokenized something and then predicting it, I think that general paradigm is going to be very high impact across all areas of intelligence,
Starting point is 05:03:02 including those that use text or what we normally call language, and those that don't like motor control. That last audience member question was an incredible, segue into our next talk, which is a retrospective on how LLMs and computer vision converged from Lucas Bayer, who was one of the lead authors of the Vision Transformer paper at ICLR 2020. This would count as yet another deep mind talk on this pod, prizes for counting how many we have featured this episode, except that the entire VIT team has just left Google to set up the new OpenAI Zurich office. Let's have a look, eh? Get it? Look at how
Starting point is 05:03:42 Lucas views the progress of the VLM field. I will talk about computer vision in the age of LLMs. And in this case or in this talk, I will focus on each part. I will focus a lot about the data side of things. However, yesterday evening I decided to completely redo my talk. And so I apologize if some parts are not smooth or if sometimes I'm surprised by my next slide. So one thing that happened recently in computer vision, or recently like four or five years ago now,
Starting point is 05:04:15 is that suddenly language has become the API for all vision models and things. And by API, I mean like input, output to the model, like how you communicated with it, basically. In the distant past, what a lot of vision and like classification is the most canonical task, but most tasks look like is that you pre-trained your model on a large database of images labeled with typically classes, because that's what's easy and quick to label. And then maybe there are a lot of classes, so you cover a lot of concepts,
Starting point is 05:04:47 but they are not really attached. Like it's the class ID number, and that's it. And this way, your visual model learns to understand a lot of things, or at least to classify them into these classes. And then you transfer it to any task of interest, which could be again classification, but maybe much more focused, like flower classification, or other things.
Starting point is 05:05:13 And again, there you had like a labeled data set, smaller, typically, much smaller. And then you fine-tune on that, and then you get your model that you actually care about in the end. Then with the appearance of clip and almost the same time align, things changed. Like they showed how to do pre-training on not class-labeled data, but pairs of images and text that you can find very easily. And then also not even like use this model, not even fine tune it, but just prompt it basically or like give it a few options in free form text and then it tells you which option is the most likely representing the image. And that way you don't even need to like
Starting point is 05:06:01 the API is what I mean here changed like it's not integers of classes anymore and it doesn't need to match or anything. And it's just freeform text. So that was very nice. And in terms of data, it means that it changed. Like vision data sets classically were like this list of classes. And then for each class, you go and collect typically via image search, a bunch of images. And that's it. So you have this very regular structure or information content. And even worse, who makes up these classes? Like is a PhD student just sitting there? I'm going to make a data set about blah, blah, blah. So let me think about the classes. Fun fact is Coco, which is like probably the second most widely used computer vision
Starting point is 05:06:45 data set from the time where we did it with classes. Who knows how the list of classes in Coco was created? Yeah. Very few old school vision people here. It was the senior professor and the project asked his teenage kid, American kid, what are common objects in your mind. That's why Coco classes are like frisbee, football, pizza and that kind of stuff. So yeah, you can see already like how bias comes into these data sets, right? It's just it. The model only learns about the things that the person creating the data set, which is usually one random person that just happens to be deciding things of. But this is what the data now in modern times when we learn from image text combined looks like.
Starting point is 05:07:45 It's like just random collection of images, typically from the web and text somehow attached to them, like typically alt text or the title of the page it's from or things like that. And then you have some random shit like this that is completely uninformative, thumbnail for version as of 21 blah blah blah. This is legible, yeah. So this is kind of useless supervision signal, but you also get very very very very very much. very detailed stuff that you would never come up with if you were to create a list of classes, like this one, Frankfurt Airport Skyline 2017, or London Barge Race or things like that, right?
Starting point is 05:08:21 So this completely changes what the models can learn, right? They are exposed to a lot more noise and useless stuff, but at the same time also to a lot more detail that you would never come up with in the classic way of creating datasets. All right, and then a little advertisement after clip came out, or clip was the first model doing this, and then a couple years later, our group made this C-C-clip, which is a variant of clip, and there's also open models, so you can download it and use it, which is just after a few years of experience with this, it's significantly better. And the cool thing is, again, like, as with Clip, but now even better, you can prompt it to
Starting point is 05:09:02 it free from text and ask about, become SD-Tex. as you basically want, as you can express with text. Here's a couple examples. Are these visible-ish? Here's a couple of cool examples of pictures we took ourselves, so they are not possibly in the training dataset, and the model doesn't possibly know about them. For example, this one is me and the colleague.
Starting point is 05:09:23 We have both a coffee-themed t-shirt. Mine, I think, said, I need coffee or something, and the colleague is just the molecule of coffee. And then the model fires 100% on the text, a photo of two guys in need of caffeine, but it fires 4% only a photo of two guys in need of water. So this stuff works nowadays. And a classical thing in computer vision, at least until recently, is a pet peeve of mine. People always say like, oh, computer vision models are not robust.
Starting point is 05:09:52 It will recognize a cow, but you put it on the beach. It will fail completely. Not at all. But this has been solved for years. Like your cow on the beach is 99%. cow also 36, but cow in Prairie is only 1%. So this stuff works with clip and Ciglip and models like that. When we did Ciglip, we released a separate checkpoint that we trained on all the languages.
Starting point is 05:10:20 And we basically tried to show that here just from some examples, we didn't really thoroughly evaluate it in the paper. We just released it and tried to show that not only does it learn multiple languages, like from this just web data, images and text, the web is international. So you just learn about all the languages for free essentially, if you use them. So we didn't do anything special like translation or anything, and the mother can learn, like not just the cow on the beach, but also in other languages like in Vash Sur la Plage or in a Kuomstrand,
Starting point is 05:10:51 and the other languages I cannot pronounce. But they all say the same thing, right? And here even we tried to show some cultural, specific, things. I think this was my Chinese colleague Shawa who came up with this. So I think only Chinese people will understand this. This dish in Chinese, it's called ants on the tree, I think, or ants on the branch or something like that. Ants climbing a tree, exactly. And so here, this is just the interesting thing. If you ask this model in English and like the literal translation of the dish, ends climbing a tree, it just doesn't get it. It just fires on the picture of
Starting point is 05:11:32 and climbing on a tree. And not on this dish. But if you ask, I cannot say this, but one of these is like ends climbing on a tree in Chinese. If you ask that, it totally gets that you're not talking about like literal ends climbing on a tree, but the dish and it's like that. All right. So, yeah.
Starting point is 05:11:55 So we release this separate model multilingual, but only a, back then we trained only a small version of the model. And recently we also trained the larger version. of the model and this is new, like we released this somewhat silently, just in this collapse. So now if you're interested in international SIGLIP or CLIP-like models, there is a large one available that is pretty good. But why did we release two separate models? Why not just the one international and that's it? Well, because it turns out that training on English only data helps a lot, scores that people
Starting point is 05:12:34 including some of my colleagues care a lot about, which is just ImageNet Zero Shot and a few other English benchmarks. So just here, an overview of a broad range of recent clip-style papers or that do clip and look at data. It typically looks like this. If we train on raw data, it's bad. English only subset of the data is better. Some more filtering, it's better and better.
Starting point is 05:13:00 And the measurement is done typically in ImageNet Zero Shot score. And this is across, like I intentionally don't write which paper because they don't want to blame any individuals. Except here I can say this is the original clip paper already. It says we get our queries from English Wikipedia, so English only. Then when you see papers using Lyon, they usually use Lyon 2B, but if you look at the citation, the Lyon paper is Lion 5B. So what is that? Well, Lyon 5B is actually 2B English, 2B non-English and 1B don't know. So typically people just use the to be English, a lion subset. And then here is another work where we can go through steps of filtering, right? We see first basic filtering.
Starting point is 05:13:43 Like the first basic filtering is the caption language being English. And then typical thing is filtering by clip score. So you keep only the data that clip already understands, which as we've seen before, as I mentioned, clip was trained on English only as English only stuff. And then there's more. Then there is like, keep only the data where in the text, one of the words in the text is from the ImageNet 21K list of classes. Or even better, not even the text.
Starting point is 05:14:18 Like, keep only images which are similar to ImageNet images. And the thing is, these more heavy English and ImageNet tailored filtering is what works best, as measured on ImageNet, but also other benchmarks, but which are similar to ImageNet. Having said all of these negative things, I need to call out one positive paper and not hide its name. Intern BL. I like this a lot.
Starting point is 05:14:43 Like they specifically show, okay, look, we use Lion, the English, but also Lion multilingual and also a Chinese data set. So that was nice. All right. So this was all about clip and about this specific filtering stage, one of the next talks from Orgeline will give more details about the effects of this. But a vision community, part of vision community has moved on and I think all should move on
Starting point is 05:15:12 past clip or beyond clip and even see clip because there's some things no matter how good your data is, how high quality the caption, how descriptive the caption, there is some things the clip contrastive loss just doesn't learn. Actually I just assumed everybody knows how clip works, but who knows? how clip works the training? Okay, more or less everybody. That's good. So take this example. You have the image of a cat and the dog and the caption is pretty much, well, it could be more detailed, but it's pretty much perfect, like a cat sitting left of a dog. Now these go through the encodors, right? And then they are trained to be most similar versus other, like,
Starting point is 05:15:57 captions or pairs in the mini batch. However, now let's think what does the model need to learn to perfectly satisfy this objective at training time? It depends on what else is in the batch. If there is no other picture of any cat or any dog in the same mini-batch, the model just needs to learn, for example, cat, and that's good. It matches it to this image, perfect, done. The loss is perfectly satisfied. Or alternatively, it just needs to learn dog, and then it's done. This is the only image that contains a dog.
Starting point is 05:16:30 It doesn't need to learn to match any more than that. And models are lazy, just like me. They learn the minimum amount necessary to solve the task. Now, if there happens to be another picture of a cat in the same mini-batch, if it's not sitting, then the mother now only needs to learn either cat sitting or probably easier cat and dog. Like it just matches the word cat and dog with this image. There's no other image with both a cat and a dog, and it's done.
Starting point is 05:16:58 It doesn't need to learn more. You see where I'm going, right? To learn left off what needs to be in the exact same mini-batch, the same thing, like a cat and the dog, but the other way around, with the perfect caption and no other shortcuts to match them. This is just not going to happen. So this is like an inherent disadvantage or a limitation of clip style learning. And Ciglip suffers from that too. So with some colleagues we set out to find a fundamentally better learning objective that fixes that and there is a quite simple one that does, which is just captioning.
Starting point is 05:17:37 So encode the image and then pass the image encodings into a decoder that should decode the caption. When you decode it's just like language model, loss is next token prediction. So here you do have a loss that says say left, don't say right, don't say right, don't say above, don't say below, don't say any of the other words in your words in your vocabulary, say left. So the model has to learn this. And then there is, I'm not going to go into any more details here, but original clip paper showed this in their figure one that this is so much less efficient to train this way. And in our paper, we also go into a lot of detail in that. It's not
Starting point is 05:18:20 much less efficient actually. Right. And then we evaluated this. So then we found that actually we are not the only ones thinking about this clip limitation and there were already multiple benchmarks that specifically measure this clip limitation. The first one we saw was ARO which stands for attribute relation and order like three things that clip models are not not really incentivized to learn much so they design a benchmark to test exactly that and ignore the numbers in the bottom. So when we train a clip, clip style model, we get these numbers. When we train the captioning style model on otherwise exactly the same setup, like both we optimize a lot and they're trained on the same data.
Starting point is 05:19:09 This is just so much better. This is worlds better. This is also words better than the bottom numbers are a few ideas of how to fix this in clip, which I call Band-Aids. And this is just so much better to train a captioner. Some of these, like ordering it's just perfect. perfectly nice. Like I actually know, not this one, next one. But here is an example from the paper, from this ARO paper. I am currently hiding. So the way the benchmark is constructed is you have an image and two possible captions. And you need to find which one is the correct one and which is the wrong one. And the captions are designed to specifically differ in either an attribute or relationship like left off, right off or ordering. This is an example from the paper itself. The horse is eating the grass or the grass is eating the horse. Can you guess without even seeing the picture which one is probably correct and matches the picture here, right?
Starting point is 05:20:11 So this is an issue. You probably hopefully guessed this way around. So this is an issue with the bench. I don't even need to reveal the image, but just for the sake of completeness. And this is just screenshot from the paper. So we identified this as an issue and so we also train a blind decoder, just a captioner that never sees the image on our same pre-training data set of images from the web or image text from the web.
Starting point is 05:20:38 And of course that needs the task too. So this is a shortcoming of this first benchmark, but again, we were not the only ones to notice that. Other people noticed that too and created a new benchmark that is supposed to measure the same things, which is called sugar crab. And it looks kind of like this. Here's an example. It looks to, it seems to not have these obvious shortcuts. Just one example is here, this picture and then a yellow tennis racket has a blue tennis ball on it or a blue tennis racket has a yellow tennis ball on it. Like both of them are quite plausible. And same with the cake and flowers and things like that.
Starting point is 05:21:18 And then it also goes into more details of what things are tested. But the same story here. So here we did, I think we didn't include it in the paper because the authors of the benchmark already did this blind baseline and showed that it got random accuracy. So we don't need to redo that. But here's same story. Like these captioning models are significantly better than the equivalent clip models or than even the best clip models on almost everything. Right. So I think this is the future of pre-training models.
Starting point is 05:21:53 or something like that, but we should move past crypt. Right. Even more. Vision models, just like language models, I know getting more and more or becoming more and more complicated systems. Everything I said before is like pre-trained one model and then we can use it, maybe zero shot or fine-tuning. But what most models do now is training in multiple stages.
Starting point is 05:22:22 And for VLMs, it's basically almost the same. same as for language models. From our side, we started this a few years ago with the series of papers and models called Pali. I'm actually curious who knows Pali. Okay, like half. Probably the people doing vision and Puehl and people don't. So the Pali model kind of looks like this, or it was a whole series of Papers and models and this is from the first paper in animation. So you just have an image and text as input and then text as output. And the text that is input basically is the task. Like, what do you want? It's often a question you want answered about the picture or an instruction, like generate a caption of this
Starting point is 05:23:09 image in Romanian or things like that. And then these just go to a transformer and are trained together. Then I will talk about how they are trained in a bit. And then with this kind of interface of model, you can do a lot more things than just with the clip or just with the caption or model, right? You can now ask questions and because it's free-form language and not the list of classes, you can ask quite pointed questions. Like, you can ask how many coins are there and then it will say 12, but you can also ask how many one dollar coins are there and it can say two and other things. And then, yeah, let's skip that. Right, but then it's only text. out and okay, language model people here will be happy with that, but vision people will
Starting point is 05:23:57 be like, no, there's so much more to vision than text out, right? But text is a bit more universal than you might think. For example, one classic vision task is detection to create bonding boxes with coordinates, right? This is very easily encoded as text. It's kind of legible. So how you encode a bounding box as text, well, just as the coordinates of the two corners and just as plain integer numbers, for example, right? And it doesn't need the integer numbers are not pixels
Starting point is 05:24:27 because then it's sensitive to the image size, but like fractions of the image, and then times a thousand such that you have integers. Right? So you can actually do a lot of classic computer vision tasks with this text out API too. Even more, I didn't put it on this slide, but you can also create segmentation masks as a text output.
Starting point is 05:24:49 How does that work? where you can train a mask encoder, typically VQVAE, that can compress a mask into a short code of a few tokens out of a small vocabulary and then can decode that. And then you just concad this vocabulary to your language vocabulary, and that's it. So it is actually a very universal API. You can do many vision tasks with it.
Starting point is 05:25:13 And again, because it's using language and not as in the classic vision segmentation and detection, the list of 80 Koko classes. you can be quite precise in what you want, right? Detect the right hand and it only gives the right hand. Detect the left hand. Only gives the left hand. Let's not discuss about what is left hand and right hand.
Starting point is 05:25:35 According to the training data. Okay. Yeah, one issue is that we had the whole series of three papers about party models showing all of this is possible and this can get better and better. but then times have changed and nowadays people are like, oh, nice paper, where is the model? Give me the model. Otherwise I will forget about it in a week. So yeah, this is a good question. And so we did the fourth Gemma model, which is called Pali Jema and this one is also open. So you can just go and download it and use it for almost all purposes. We had earlier like some
Starting point is 05:26:16 licenses say don't use it for evil stuff. We had to use such a license too. So, but you can essentially use it for anything. What it looks like is pretty similar to the previous one, just slightly different because now language models are all decoder only, so we use a decoder only language model, the Gemma 2 billion, and then image encoder. Yeah. And then let's go to the interesting part, the training. slide I copy-paste it from another presentation.
Starting point is 05:26:52 Let's just ignore the left-hand side. It's not important for this talk. The pre-training, like this works in multiple stages. And this is quite similar, I believe, to the language model pre-training also now. So first is the stage zero, which is unimodal pre-training. So the image encoder is pre-trained by itself. We did it with the SIGLIP image encoder. You can use a Kappa image encoder.
Starting point is 05:27:16 You could use a dyno image encoder. anything but a good general image encoder. The language model is trained by itself. Like in this case, we use Jemma because we are at Google. You can use Lama, you can use anything else. Then so for this you pay zero cost because you just download them like existing ones. Then you do stage one we call this is multi-model pre-training. So that's when you stick both of them together.
Starting point is 05:27:45 And then you train them on the mixture that looks like, and text in and then text out. And I will show you the mixture later. Then in computer vision, it's often important to also for the model to understand higher resolution images. So typically we train on 224 by 224 images for traditional reasons, but it's also a sweet spot. Like 2 to 4 images you can recognize a lot, but not everything and it's relatively efficient. But then there is usually a resolution increase stage, which is short-to-four images, which is shorter training at higher resolution, like 448 by 448, for example,
Starting point is 05:28:23 because it's more costly, but you can see more details, especially if you have images with texts, like pictures of documents or whatnot, then you may really need that. Right, and all of these are basically the pre-training, and then there is another stage, which is transfer. So the pre-training tasks, you will see shortly, are mostly designed to teach the model as many skills. as possible and as broad knowledge as possible.
Starting point is 05:28:51 It doesn't really, in this stage, you don't really care about the interface being nice, about it understanding user intent well or things like that, just about putting raw knowledge into the model. And then you have a transfer stage, which is usually also shorter, where you fine tune usually the model to what you actually want. And this can be different for different people
Starting point is 05:29:16 or companies or projects, And this can include training on mixtures of many things, like supervised fine tuning or instruction tuning is part of that too. But it typically doesn't have the goal to give new knowledge to the model, but just to make it focused on the thing you care about. So the pre-training mixture, in this case for polygermar, looked like this, basically a bunch of tasks that force the model to learn. some things. The one obvious one is like prefix means what is input, like prompt to the model or task description. And then so for example, we have a caption and then the language. So caption in Chinese, for example. And then the model needs to predict the caption in Chinese. From the raw data, raw collection of the image text from the web, we can just run language detection on the caption, right? And then we know
Starting point is 05:30:18 the language at training time and we can put it here. Or for example, if we have pictures that have text in it and we know about the text that's in the picture, which we can know for example with existing OCR systems, then we can ask the model to just read the text that's on the images. So the prompt would just be do OCR. That is one task. You see that teaches the model a different skill than describing the image in the caption, right? And then a question answering, including some specific questions which you can generate.
Starting point is 05:30:58 For example, if you have an existing pretty good classifier that tells you which kind of classes or objects are in your image, you can run that and then generate synthetic questions like these ones, like how many chairs, for example. or is chair in the image or things like that. And then there was another paper previously that showed you can also turn them around, like generate the question that would give this answer. And this is a different skill set that the model needs to solve this. So this is also good to add to the pre-training.
Starting point is 05:31:37 And then we also added the detection and segmentation, the detection labels and segmentation labels, are pseudo-labeled. So they come from a good detector model or from a good segmentar model. So this is kind of what the mixture looks like. But this is not really how you want to give the model to the user to use it, right? You don't want the user to first have to type answer, EN, and then the question of the user. So that's where then this fine-tuning step comes in.
Starting point is 05:32:12 And we don't need to go through this whole list, it's just to say, like, we fine-tune this on a lot of different data sets. It works really well, and for fine-tuning, you don't need a lot of fine-tuning data, because it's mostly about rewiring the syntax to be aligned with what the task needs. Okay. And then the final step, which from language you also know, but we actually did. this at the same time as the RLHF, but InVVision is to have a last step of RL tuning of the model to optimize for what you really, really want because the supervised fine tuning still usually doesn't optimize for what you really want.
Starting point is 05:32:58 Let's see, how can I give an example for that? Right, let's go back to this example here. If you would do supervised fine tuning on a dataset like that for detection, your training objective is to predict each of the tokens precisely, one after another, right? But when you do detection, so this task, for example, 298 here, or if you predict 299, it would be completely wrong to predict 299, right? It's like you predicted the wrong token, so you're wrong, that's it. But in detection, that's not really what we care about.
Starting point is 05:33:39 If the box is like one pixel more to the left, that's totally fine. All right. What we rather care about is, for example, to not have one extra box where there should not be any at all, which in terms of tokens would be the same amount of error as getting full of the box of the coordinates off by one pixel. All right. So what is trained typically in supervised learning is not what you really care about. Was that example kind of clear? Yeah.
Starting point is 05:34:07 I hope. Okay. So then also in vision what we can do is the last step of RL tuning. So first and yeah, this was almost exactly the same time as RLHF paper. So first you do the supervised training or the supervised fine tuning or pre-training because that just works really well and that does give you a reasonably good model, a reasonably good approximation of what you want. So that is the maximum likelihood training, which means basically just imitate the training data.
Starting point is 05:34:45 So you can also never get better than your training data or then the best part of your training data with this. But then once you have this model, that does reasonably good at your task, you can sample from it, predictions, and you can now define the reward. And the reward does not need to be differentiable. That's the nice part. You just need to give a number, like, is this prediction good or bad? And this can be arbitrarily complicated to get this number. It can even be gotten by asking a human to give a number.
Starting point is 05:35:17 Then you have RLHF, for example. Or it can be by going through a very complicated metric. Like the people familiar with the detection, they know MAP is the metric that pretty well describes what we want in detection, but it's definitely not differentiable and it's quite complicated, but you can just compute this and you give a score to the samples. And then you do RL, which basically means, okay, model give me like two samples and then I give a score to both samples and the one which is better, which scores higher. I say model, sample this more often and that one less often.
Starting point is 05:35:53 And you keep doing this. And this is the way you can align the model to do exactly the task or the part of the task that you actually care about. So not just copy what is in the data, which is what the pre-trained, the supervised training does. This was relatively clear in language that you can do this, because in language is super common now to have models where you can sample from. In computer vision, this used to be completely uncommon.
Starting point is 05:36:26 Like all of the classical computer vision models, like faster ICNN, Deep Lab, YOLO, and so on, are not models you can sample from. So you can not do RL on it, right? Because you cannot get two samples from the model and say which one was better, which one was worse. It's only recently with this unification of models and this style of models like Pali, and there have been a few others, like Unified I.O. is another good example.
Starting point is 05:36:50 That you can actually have vision models that can sample multiple reasonable solutions, and then you can do RL on top of. So that's why this only happened recently. Yeah, and here's just a few examples that it works pretty well. We did it for detection. So the left is the base model and for those who know detection, it gets a Cocoa MAP of 39, which is okay but not great. And then you do a little bit of RL tuning with MAP metric as the score and then you get much better MAP and indeed the detections you actually catch a lot more things. And 54 is a pretty good cocoa MEP.
Starting point is 05:37:32 And we also did it with panoptic segmentation. And just to demonstrate, you can really, like you just need to come up with, define clearly what you really want and then come up with some score of it. We just did this silly example of a colorization model, so gray scale image in, color image out. And it's also generative, so you can sample from it. And then we just arbitrarily define a, metric that computes the flashiness of the image and then we RL tune a bit towards that metric
Starting point is 05:38:03 and then indeed it generates flashier images. Right. Then one last thing about this to show that what kind of happens in the RL tuning. It's really not teaching the model any new things or anything. It's just making it sample more the things that you like, that you score highly and sample less. that you don't like that score badly. So here are a few plots. They are all a little bit hard to digest,
Starting point is 05:38:35 but I will try to work you through. So we have the model before means before RL tuning and after is after RL tuning and on the Y axis is the reward of the task, whichever it is. And we get a lot of samples from the model before and a lot of samples from the model after. I think here we got 10,000 samples. thousand samples and then we just sort them and what you see here is that before you had a lot of
Starting point is 05:39:04 samples of low quality samples and after RL tuning you told the model like this is literally what RL tuning is right this sample is bad less of that so you have way way way fewer low reward samples and you start off with sampling by default many more high reward ones however the The raw model before REL tuning also had a very few high reward samples like this little green dotted line, right? So it's not that the RL tuning makes the model better, it just makes it sample these good parts more often, right? The original model was able to be just as good as the RL tuned model, but just very, very rarely. Let's see. Oh, yeah, and this one is just that the likelihood of a sample is not enough.
Starting point is 05:40:02 Like, you really need to have your score that you define. So here is, what is? Right, here from left to right, we sample more and more samples. On the left, it's like just, let's say, two samples here. Then what the curve shows is the highest reward across these two samples. So what was the reward of one of the two samples which scored the highest? And we can see the same story again, basically. Oh no, wait.
Starting point is 05:40:35 Sorry, I misspoke. Let's rewind. Here you have two samples. And here you see the reward of the sample with the highest likelihood. And before REL tuning, that is not really good. The thing is even if you get many samples before RL tuning and 10,000 samples or 100 samples and you pick the one with the highest likelihood, you're not getting better samples in terms of the reward because the likelihood is not really aligned with your reward yet, right?
Starting point is 05:41:11 So you sample more and more things, but that are not in the high quality region and this is what the reward tuning does it it it reweights the likelihood of samples to sample much more high quality samples All right, and that was too much, I write too little data in the end. So this is the end of it. Thank you. We're going to bring part three to an early end here. Brittany had one more vision-related paper to highlight MLM as a judge.
Starting point is 05:41:44 Assessing multimodal LLM as a judge with vision language benchmark, which we appreciated for practical AI engineer use, but unfortunately we had to cut it for time. You can see their oral presentation in the show notes. Last but not least, we combine parts 1, 2 and 3 across world simulation, generative modelling and vision to check in on the field of reinforcement learning and robotics, which took almost as big of a stage as video generation at ICML this year. For a natural transition from vision to robots,
Starting point is 05:42:17 we turned to Ashley Edwards, who was on the Gato and Genie team at Google Deep Mind, but is now at runway, emphasizing the deep connection between generative video and the world simulation that is essential for diffusion and robotics. So yeah, today I'm going to be talking about how we can learn actions, policies, rewards, and environments from videos alone. So just as a little bit of a disclaimer, I'm going to be talking about a lot of my prior works, some of which I thought I would never talk about again, others I thought I would never talk about at all. But a lot of them have motivated me to the kind of research that I've been working on these days. So I thought it would be kind of fun just to go back and look at some of the history that led me here.
Starting point is 05:43:01 So I think we've probably seen iterations of this kind of slide throughout this entire conference. But I think we know by now that there's been a lot of progress made in text video generation. And one question we might be asking is like how the heck did we get here? I mean, I think just this past year alone, we've seen so many innovations. I hope that many people during this conference will be discussing this, but it won't be me. Instead, I'm going to be talking about how did I end up getting here? My research background is actually in reinforcement learning, but suddenly I found myself in the controllable video generation space. So this is why I wanted to talk about some of my older works, because I wanted to see, like, how did I end up getting here?
Starting point is 05:43:43 Maybe some of the things that I was working on are still relevant today. So in order to answer this question, I'm going to take us back to the summer of 2016, where I got to spend a summer in Japan. And so my main focus here was to actually work on this robot here. So I started off actually as a robotics major. And what I wanted to do here was essentially try to train this robot to learn sign language gestures from videos. And this is when I was really started getting interested in how we can train agents from because coming from a reinforcement learning background, I started to get kind of annoyed with having to always come up with a reward function for training our agents.
Starting point is 05:44:26 And every time we had a new environment, we had to come up with a new reward function. And so I was really interested in how we can come up with like a more sort of general way of representing tasks and that could be done through videos. And so when I arrived at the university, so this was at Waseda University, I realized that the hands on the robot weren't actually working. actually working. And so I wasn't actually going to be able to teach it hand gestures from videos. But this robot was actually like a very expressive robot. I think it was like a comedian actually kind of robot and so it had a lot of different facial expressions that it can make. And so instead of teaching hand gestures I decided that okay well fine I'll try to teach it facial expressions.
Starting point is 05:45:06 So if you think about if you if you look at what humans look like they don't look at anything like this robot looked like. And so the thing that I was trying to to figure out here was how we can actually teach a robot to mimic a, yeah, this robot in particular, to mimic a facial expression like this. When the features look very different, and again, this was like in 2016. And so, I mean, we had like a few examples, like one GPU and that sort of thing. So we didn't have a bunch of examples for trying to learn a representation here. And so what I wanted to try to do is figure out how we can put the space, the feature space of the robot,
Starting point is 05:45:42 to look more like the feature space of the human. And so one thing that we realized was that if you look at the sort of shape of motion over time coming from these facial expressions and in general any kind of motion, there is a bit of a structure. So this year is showing something called a motion template, which essentially takes a sequence of frames, sort of concatenates them and averages them over time. So you can see where the motion has happened as well as when the motion happened in time. So this is what this representation is showing. And the nice thing is that this representation is kind of domain agnostic. So you can see on the left, for example, you can see the motion of the robot. On the right of that one, you can see the motion of a human.
Starting point is 05:46:26 And then again, so we had two different tasks. One is smiling, one is surprised. Again, this was like a workshop paper back in the day of, you know, it's like whatever. I thought this was kind of cool. But it's like the most, like the best kind of results here. But essentially what you can see is that the shape is similar across the, you know, similar across these different tasks. And so it kind of learns how to smile and kind of learns how to make the surprise face
Starting point is 05:46:48 because we're trying to basically mimic the motion that you see here, rather than mimicking the actual features that you would see in a human versus like the robot, if that makes sense. So I guess one other thing about this work was, so we essentially had to sort of hand-specify our reward function. We were using the hog features to compare the humans, motion template to the robots motion template. And it was a single task.
Starting point is 05:47:15 So we were trying to learn a facial expression from a robot to a human. But after this, we started getting more interested in how we can learn sort of representations across multiple environments rather than focusing on this single task. And so this is when we started working here. So we were trying to actually learn behaviors from videos.
Starting point is 05:47:34 And so in this work, what we did was we essentially got, actually, yeah, we had like a giant data set publicly available internet videos back in 2017, but it was actually showing a video game play-throughs, mostly consisting of speed runs. But what we wanted to see was if we could try to infer the behaviors that were taking place across these environments, because you can imagine in these video games, you might see characters moving to the left, moving to the right, and that sort of thing. And so the idea was that if we could infer those behaviors, then we can use them for generating a sort of controller for Asians to say like when I see this new scene I want
Starting point is 05:48:14 to generate what I want you to do again this is the workshop so we didn't get to that second part but we did get to the actual trying to generate these motion templates so all of these are saying are showing given initial scene let's generate the motion templates that's that I can generate new new motion templates given unseen scenes and so this is showing some of the results here and so on the top you see the the video game generations coming from a coming from training on that data set, these are unseen environments. You can see it's kind of starting to extract the motion happening across these different scenes.
Starting point is 05:48:46 It's probably kind of hard to see to be honest. But the other interesting thing that we found was that we could use that same model that had been trained on video games and actually worked really well at segmenting out like animals from unseen environments and we had only trained on video games. But this was like kind of one of the emergent behaviors that you see by predicting motions, things that are going to change over time, you're actually able to sort of extract out these different characters.
Starting point is 05:49:17 So one other interesting thing here was that essentially instead of trying to predict a single mode, so instead of having your loss being on your next frame generation, we found it was useful to actually try to predict multiple features. So essentially what you see here is like on the left, this is our initial frame,
Starting point is 05:49:36 and on the right of that, you see all the different kind of generations that are happening. So if you squint enough, you can see, for example, that you can predict moving to the right or moving to the left or moving up or down for each of these different scenes. And we found that this was happening consistently. The way that we trained this was essentially to try to take each of these generations and minimize the loss between one of those,
Starting point is 05:49:59 the closest one to the ground truth generated frame. So we're trying to cluster over our different future predictions. But the interesting thing to take away here was that these different kinds of motions that we're seeing actually represent actions. And so I think the thing that we started to figure out was that actions are kind of a shared representation across these different scenes. And so rather than trying to explicitly represent those through like the motion template that we tried before, we wanted to see if we could actually just try to infer actions alone from the videos. So that was the motivation behind our work, ILPO, where essentially we're going to try to actually learn actions and policies from videos alone. So the way this worked was essentially, so imagine you have an initial frame like this.
Starting point is 05:50:51 You might see in your data set, again, we're going to be trying to learn from videos and train agents to learn to imitate from those alone without actions. But there you might see, for example, a transition that looks like moving to the right or looks like jumping in the air. And so what we were trying to learn here was something called a latent action. which is just going to be essentially the kind of notion of what calls this transition to occur. So we know that something calls them. We don't actually know the action labels of these. We're going to try to learn them from the data. And then we're going to have a latent policy that's going to be defined as the likelihood of the expert,
Starting point is 05:51:26 taking some latent action in any given state. So essentially the way that we learn this is imagine in our data set we see these two sequences here. So let's say the expert moved to the right, for example. What we're going to do is we're going to learn a generative model to again predict each possible next state, given your initial state here. And essentially what we're going to try to do, again, is we're going to try to cluster over all of those potential next frames by looking at the generation that's closest to the one that was actually shown in the data.
Starting point is 05:52:02 So we're going to again see this sort of men loss that says, let me look at all my latent actions. I'm going to find the one that looks closest, or the generation that looks closest to the ground-shoothed one. So we're clustering over our future frames here. And then what we're going to do is try to learn a policy over all of those different transitions that we can see. And so the way that we can do this is, let's say, for example,
Starting point is 05:52:24 in our data set, we observe that half the time, for example, the expert moves to the right, half the time the jump in the air, or they never stay still. So we're going to try to learn a policy that in sub-lover, looking like this. So if you're going to average over all of those future frames, you might see something that looks like that. And so we're going to try to learn a policy that effectively weights all of the different features coming from our generative model. So that if we were to
Starting point is 05:52:49 take the expectation under that policy, you would end up having a generation or an average generation that looks like the expected generation coming from our expert, or the expected future coming from our expert generations. And so that's essentially how we can train the policy. So each of these different weightings over the future is actually saying, what is the likelihood that I would take latent action zero in the state, take a latent action one in a mistake, for example, and we can train it in this way. So yeah, so this is actually showing just after 200 steps of interacting with the environment that our model is able to adapt really quickly. And the reason for this is that we're actually learning this policy from the videos before we have replaced the agent in the environment.
Starting point is 05:53:34 And so we can take some steps from the environment samples and use those for actually adapting our latent actions to the real world, real ones that you can take in the world. So one thing to take away from that work is that we can actually represent our actions through the next frame generations that are taking place. Of course, it's assuming that your dynamics are deterministic, well, let's say that they are. But basically, each of these next frames are representing the kinds of actions that you can take in the world. So we took this idea sort of in a different direction where we could say essentially, let's say we have a reward function. We can now try to learn a value function, an optimal value function, from videos alone, even if you have suboptimal data. So for example, if you have demonstrations like this coming from videos where the expert isn't really an expert, but they're running into things and doing suboptimal things, but sometimes they run into the right,
Starting point is 05:54:35 run it to the goal. And so the idea here is that usually in reinforcement learning, you can learn an optimal policy from suboptimal data, but it gets a little bit trickier in videos because you don't have access to actions. So the idea here was to instead of learning an action, or sorry, a policy over actions, but you would typically see in reinforcement learning.
Starting point is 05:54:54 If you do RL, I know this is a video generation sort of environment here, but if some of you might be familiar with a diagram like this, where essentially you have an agent running around the world, it's taking action, trying to maximize this long-term expected reward has a policy over states. The idea behind this work was instead of having a policy over states, sorry, yeah, instead of learning a value function over states,
Starting point is 05:55:17 you would learn a value function over state-next-state pairs. So essentially what we have is this value function. Sorry, I think I messed that part up. You would usually have a value function over state actions. Now we're learning a value function over state-next-state pairs. We're learning a policy now over states rather than a policy that's going to tell you which action to take. And then the benefit of that is that you can actually learn this in an optimal way when you have suboptimal data. So this is a lot of different stuff that I'm showing on the screen.
Starting point is 05:55:47 But the main takeaway, again, is that we're learning this policy over states, learning a value function that says what is the value from transitioning from one state to the next rather than what is the value of taking an action in a given state. And then we can essentially try to train this policy, that's telling us what state we want to transition to by maximizing our value of moving from one state to the next. So the other thing that we need to do eventually during, when we're actually interacting with the environment, is that we're going to have to figure out where the actions come from so we can also learn an inverse dynamics model. So that's what that's showing there.
Starting point is 05:56:23 So again, what we're learning is like given suboptimal data, we can actually learn optimal generations. So this is showing plans coming from our policy over states here. saying, what state should I move to, but it's going to maximize my value. And one interesting thing here to take away is that this is actually basically like a video generation model. Like we're trying to generate next frames that tell us how are we maximizing our value. And this is given just random generations, like random rollouts from behaviors, we're actually able to generate optimal trajectories.
Starting point is 05:56:59 This also works in reinforcement learning. But yeah, I'll skip over that because we're doing. video generation. But the other thing is, so this required us to actually have a reward function. So one other thing we are interested in is like how we can actually learn from videos when we don't have a reward function. Can we actually get agents to learn from these, from this sort of data? And so I guess one of the things that we can observe is that usually when you have videos, there's a sort of ordering to how like the trajectories are happening, like, Typically, you would have expert data that's telling you good things to follow.
Starting point is 05:57:39 And so what we can do is say, at the end of the video, we're going to say that that's a reward of one. And everything that you, if you backtrack in time, it gets sort of discounted, just like you would see in like a reinforcement learning trajectory. And so we can use this sort of idea to learn a value function that tells us how good behaviors are in our videos. And that's essentially what we do. So given a sequence of frames, we can say you get a reward of one. at the end and then we can backtrack that over time and that's our value function. We can use that for basically training or reinforcement learning agent again, basically replace the sort of bootstrapping step with our learned value function
Starting point is 05:58:17 and then essentially try to train your policy in a supervised way here. But you can see essentially we trained this model over a bunch of different videos of pouring. You can see over time that the values increase. And so this is essentially telling you you can learn a value function in this way. You can even use this for training reinforcement learning agents again because, okay, fine. I have a reinforcement learning background. We do this sometimes. But you can see the agent is able to actually learn even though it was trained over videos alone.
Starting point is 05:58:47 Okay. So essentially what we showed is that we can actually learn actions and rewards and policies from videos. And so I guess what's left and this is sort of what led me into this sort of controllable video generation regime where we're now trying to learn environments from videos. was the idea behind Jeannie, where we're going to try to learn a generative interactive environment from videos alone that's playable from both humans and AI agents. And so I guess a lot of the previous work that I was doing was really interested in, like,
Starting point is 05:59:19 how we can use these videos for training the agents themselves. But I was lucky actually to meet people like Jack and Tim from a team who had an open in in this background. And they said essentially, well, we don't only need to learn policies, we can actually learn entire environments and we can place agents within those environments and get them to learn from that. And so this is what led to our genie work, which we represented here. And so essentially, the idea behind this work was that we can learn three different main things. One was a tokenizer over our video, so we represented those using a discretized VQ, VIEE model. We had a latent
Starting point is 05:59:59 action model, I think this was probably the most important component where we could essentially take in sequences of frames and try to infer the changes such that you could predict the future using that latent action representation. And then you can plug that into a dynamics model for predicting the future. And this is where the control ability is coming from. It's coming from a latent action model that's telling you how things are going to change over time. And this is what led to our final results where we essentially
Starting point is 06:00:29 we found that if you take some text generated images, you can plug them into our model and interact with them as if they're a real environment. And again, we were training over a giant data set of platformer games here. So I guess the reason that I actually didn't spend too much time talking about Jeannie, because I know there's been a few workshop talks already, and we talked about it already
Starting point is 06:00:52 at the conference. But I was wondering, like, how did I end up getting into this kind of research? And I think the idea is that you can actually use these environments for training agents of the future. And hopefully we can potentially like learn policies, learn latent policies, learn reward functions in the way that we discussed before. So yeah, I think that's the main thing that I have. I wanted to also talk about, I mean, point out all my collaborators here. There's been a lot of really great researchers that I've got the opportunity to work with.
Starting point is 06:01:23 But yeah, that's all. Thanks. I think I probably have a lot of time for questions. In more complicated environments, actions alone wouldn't be able to represent all of the dynamics. How do you think we can disentangle actions without supervision in this case? Without supervision? So I think you can probably, if you have like a notion of reward, for example, or a notion of, or if you can try to like learn a policy, for example, you might be able to extract like what are the most likely kind of actions that are going to happen versus the dynamics.
Starting point is 06:01:59 But I think it's hard without supervision to disentangle these. Like in our case, you can probably control the crowds if you wanted to. But I think maybe you can use something like text or that sort of thing to add in additional information. But I think it also scale. Yeah. So if you wanted to scale, let's say, Jeannie to Real World videos, what would be the major architectural and kind of ideological changes to do that? Yeah, that's a good question.
Starting point is 06:02:31 So the genie model was pretty general. So there wasn't anything in there that said that we were explicitly training on 2D platformer games. We also had experiments where we got it to work on robotics data. So I think probably just scaling the architecture size, the bitter lesson as usual, and adding in more data would hopefully enable it to learn from that. I think that you could probably also change different components of the architecture itself, using the current state-of-the-art techniques. It is surprising, or rather not at all, surprising,
Starting point is 06:03:05 how many of the answers to workshop questions are just this one word. Scale. We challenge you to go through a day at New Reeps without mentioning the bitter lesson once. As for the audience member's question about action generation and behavior cloning, Brittany was walking the poster sessions
Starting point is 06:03:23 and found a possible answer from NYU. I'm here with Sung-Jung-Ju. Jay Lee, also known as Jay Lee, to talk about his poster work on the VQ Bet Model, which is actually one of the spotlight posters being featured here at the ICML Conference. The description is a scalable behavior generation model for efficient multimodal behavior prediction in complex tasks. That is quite a mouthful, so it would be very helpful to have you maybe explain for us a little bit more what exactly it is that you've worked on here.
Starting point is 06:03:54 Okay, nice to meet you. and actually our poster is like, our posture, our work is started from a question, how we could use a very powerful LLM-like token prediction framework for behavioral generation tasks. So the main concern about this question is that the action data is in the continuous space. It's not similar with the language that we use, which is really easy to tokenize. So what we do is we use VQVAE, vector quantizer, to quantify the continuous action data into a discrete representation and use that discrete representation as a tokenizer of the LLM-like architecture, so that we can predict the behavior based on current observation. Very, very interesting.
Starting point is 06:04:50 And how did you arrive at this area of research? What is kind of the background or the origin story for this project? Yeah, actually, my personal background is more close to reinforcement learning. But after, I mean, after, I think nowadays, there are many accessible large action data. And I found that it is really hard to train a behavioral cloning agent using a very traditional way. I mean, it is really hard to train good policy with a traditional way with a large data set. So we need a better architecture, which can leverage the LLM-like architecture. So that was the starting point of our research.
Starting point is 06:05:38 And how did you handle the data set collection problem? Because I know with a lot of the applications we're seeing on the robotic side of things, it seems today that data is a bit of the bottleneck more so than it. anything else? Yeah, actually it is really good question since getting data set is really expensive and I mean it's a really important point in robotics so most of our environments was I mean those kind of environments are open-sourced environments so you can just download most of the data set and some of them is the the data set was collected by VR equipment with humans and for our
Starting point is 06:06:18 real-word experiments We gathered with a very small manipulation equipment by ourselves with an iPhone. So, yeah. So you bootstrapped the data set in part yourself, and then it looks like you did a bunch of work maybe on the simulation side of things as well? Yes, actually we first validated our framework on simulation, and then after some with some consolidated results, we moved on the real-world experiments. And the strong point of our model is that our model is really lightweighted, so it does not be a large data set.
Starting point is 06:06:56 We only need 45 demos for each task in real world scenario. So it only takes one or two hours by gathering by humans, so it's not that difficult, yeah. And can you talk a little bit about the performance results that you've received seen with this model, since the listeners at home don't have the benefit of the poster in front of them? Actually, you mean the performance of our model? Yeah, I would say that, you know, there was a very famous diffusion-based models. I would say that our performance is quite similar with those kind of diffusion-based model, but the inference time is really fast, about 20% of the division model.
Starting point is 06:07:29 So, you know, inference time is really important in robotics. So we could say that you could do more than 100-hertz control with GPU, and more than 20 health on C. Oh, no. It's only hard to own CPU. So, yeah. So the performance is good enough compared to the diffusion-based policies, but inference time is much better than those kind of base-lays. Got it.
Starting point is 06:07:59 And you mentioned that you published this toward the end of last year. Have you continued to work in this problem area, or how has your research evolved since the publication? Actually, we believe that the future direction should be scaling up this architecture. architecture. I mean, for the more generalizable agents. For example, the agents that could do some tasks based on the language instructions. So our objective would be scaling enough. Very exciting. And you did this through your work at Seoul National University, and then you went over and worked at NYU folks as well? Yeah, actually, I was a master's degree at Southern National University. And I mean, I emailed to the people in NYU.
Starting point is 06:08:44 We started collaborating from last summer. Very exciting. Well, thank you so much for the time walking through this. I appreciate it. Thank you. That was a great spotlight poster from Shangjai Li, and we also recommend his professor, Leryl Pinto's talk on Building General Purpose Robots,
Starting point is 06:09:01 which we link to in the show notes. Brittany had one more robotics paper to highlight. Pivot. iterative, visual-prompting elicits actionable knowledge for VLMs. But we are skipping it in the industry, interests of time and to not keep adding to our already overflowing Google Deep Mind publication counter. By far one of the biggest names in reinforcement learning and robotics is Professor Chelsea Finn,
Starting point is 06:09:27 now founder of the $2 billion startup Physical Intelligence, who gave not one, not two, not three, but four talks at ICML on her lessons on robotics. We are highlighting her keynote here, but we also recommend checking out her colleague Serg Levin's talk on robotic foundation models. My name is Chelsea, and I do research on both machine learning algorithms, as well as on applications of machine learning to robotics. And because I work on both of these two things, I think that robotics has provided a perspective on my machine learning research that's a little bit different than the average machine
Starting point is 06:10:06 learning researcher. And today I'd like to share a little bit about that perspective and what that perspective has brought to my machine learning research. So the first thing that I'll mention is that I think that my robotics work, even though it's not necessarily exactly aligned with core machine learning algorithms, it's often indirectly led me to problems that are relevant in applications beyond robotics. So for example, about 10 years ago, I started working on end-to-end neural network training for robots. This included things like training a robot to put a block into a shape sorting cube or to use a spatula to lift an obvious. into a bowl. And in both of these cases, we were training a neural network to map from images from the
Starting point is 06:10:49 robot's cameras to torques applied to each of the motors of the robot. We were training neural networks that had an entire 92,000 parameters. Well, this might seem not particularly interesting or not particularly new. At the time, this is something that was actually quite different from the typical approach to robotics. And after I started working on training these policies, to control robots with neural networks, I was a bit frustrated by the fact that we had to train a neural network from scratch
Starting point is 06:11:20 every time we wanted to train the robot, even though we were typically training the robot to do lots of different tasks rather than just one task. And this led me to be interested in this question of whether robots could learn a new task more quickly by leveraging their previous experience instead of training from scratch. That led me to work on Fushot Learning and meta-learning, which ended up actually having quite irrelevant relevant use cases in other applications like an education and in drug discovery.
Starting point is 06:11:51 And there's another example of robotics work leading me to relevant problems. In this initial work, the robots were learning policies that were specific to one spatula or one shape of Turing cube or one environment. And I became very interested in whether we could leverage broad data sets to improve the generalization of robots. And this led me to be thinking about how can we develop machines that can generalize broadly and potentially even be able to generalize beyond their training distribution. And this led me to work on datasets, but also to work on robustness to distribution shift,
Starting point is 06:12:22 which led to a benchmark that we developed called Wilds, that actually studies distribution shift in a wide range of real applications and has been used quite widely in the machine learning community. So from there, in this talk I'd like to share a little bit about what working on robotics has taught me about machine learning. And to start off, let's talk about a few facts. about machine learning in the context of robotics. The first is that machine learning is quite data-hungry,
Starting point is 06:12:51 and at the same time, we don't have existing data sets on the internet of robots controlling themselves to do different tasks. We don't have the equivalent of Wikipedia for how to control motors to tie shoelaces or to open a water bottle. Furthermore, we don't have an easy way to interpret or ensure the safety of machine learning policies applied to a robot.
Starting point is 06:13:15 applied to robots, and this has serious implications when robots have a real possibility of directly harming humans in a physical world. And lastly, compared to other leading approaches to robotics like optimal control, we lack formal guarantees of what a machine-learning-based policy would do. And so because of these shortcomings of machine learning in the context of robotics,
Starting point is 06:13:41 you might expect me to say that maybe machine learning isn't solving real application, real applications like robotics and it's fundamentally problematic. But is that actually true? Let's look at an example. So say that we want a robot to tear off a piece of tape and put it on a box. This may seem like a fairly simple task, but this is actually a task that is incredibly difficult
Starting point is 06:14:05 for traditional robotics approaches, because traditional approaches will typically try to model the entire scene, including how the tape will adhere to the canister into the fingers of the robot, how it will tear when spread across the metal part of the canister, and how to control all 14 of the motors on this robot in order to accomplish the task. It turns out that for this task that is seemingly extremely difficult
Starting point is 06:14:33 for traditional approaches, we can actually use machine learning to address it. So we can develop a teleoperation interface, specifically Tony, a student in my lab, developed a teleoperation interface that we call Aloha that allows you to puppeteer the robot to solve a wide range of different tasks. And once you develop this teleoperation interface,
Starting point is 06:14:53 it means that you can collect data to train a machine learning based policy to solve a wide range of different tasks, including the really challenging task of tearing off tape and putting it onto a box, as well as other tasks like putting on a shoe. In this case, it's a machine learning policy that's mapping the images from the robot's cameras
Starting point is 06:15:12 to all the 14 joints, And it's doing so with a transformer trained end-to-end on demonstrations collected with teleoperation. And we can use machine learning not just for these fairly complicated tasks, but we can also do it for mobile manipulation. So we can develop a teleoperation interface for an entire mobile robot with two arms, use that to collect data, and again use a transformer-based architecture to train the robot to do challenging tasks like on the top,
Starting point is 06:15:42 make a piece of shrimp by pouring oil on the pan, putting the shrimp into the pan, flipping the shrimp, and serving it. And on the bottom, putting a pot into a cabinet. And so again, we're finding that machine learning is able to solve fairly complicated robotics tasks. And beyond these kinds of robots, we can also do something like this for surgical robots.
Starting point is 06:16:04 So surgical robots are incredibly difficult to control. This is the Da Vinci surgical robot, and we can use machine learning in a fairly robust way, to, again, train policies for complicated tasks like tying a knot and picking up a needle and handing it over to the other surgical tool. And finally, we can also do this with full-size humanoid robots where if we develop a teleoperation interface, which is a little bit harder to do in this case,
Starting point is 06:16:30 but we can train a shadowing-based teleoperation approach. And then use this to train, again, transformer-based policies, in this case, to control robots, do pretty challenging tasks that involve controlling all of the different degrees of freedom, including both the arms and the legs of the robots. And so going back to my question before of whether machine learning is solving real problems, I do think that machine learning has been making real advances that advance applications and really useful problems in the real world.
Starting point is 06:17:11 Supervised learning works really well. We've seen significant advances in architectures, learning algorithms, and optimizers. And we also have reliable engineering practices for debugging if something isn't working, debugging if a policy isn't working or if another model is not achieving the performance that we want and ultimately improving the performance. So now you might ask if machine learning is making real advances, why don't we have robots out in everyday environments solving real problems yet? And a lot of people for that question will refer you to Morvex Paradox, which states that the things that are most intuitive for humans, like basic motor control, are the things that are often most challenging for machines.
Starting point is 06:17:56 And this could explain why robotics is further behind than applications like debugging complex code or translating between two pieces of text. But in my work, I've actually found that this isn't perhaps quite the, the main. the most direct explanation. I think the explanation is actually that the things that lack abundant data are often the things that are most challenging for machines. And this is because scenarios that lack abundant data
Starting point is 06:18:25 were not able to directly apply machine learning and directly try to identify patterns from large amounts of data. And this can include both data scarce applications as well as just scenarios that are novel that aren't represented well in the training data.
Starting point is 06:18:42 So this isn't just things like robotics that don't have a corresponding Wikipedia and so forth. It's also, even within applications that do have a lot of data, there's scenarios that they encounter that aren't represented well in the data, and that's exactly where machine learning algorithms often struggle, and as a result, our machines often struggle. And so perhaps instead of trying to, I don't know, take some sort of approach that tries to combine traditional methods or machine learning or something, I think that actually robotics just needs more of what makes machine learning thrive. Essentially, we need to find more ways to get data for applications like robotics.
Starting point is 06:19:22 And so this is really the core question that I want to talk about today is, how can we get good data for a wide range of problems in a cheap and inexpensive way? So how can we basically handle data scarcity without skimping on data? And I'll talk about a few different ways to do this. The first is finding ways to augment data with cheap and natural to provide supervision. The second will be to leverage data sources beyond the particular target application. And the third will be to incorporate data from test time in addition to the typical training data set. And I'll spend the most time on this first point because it's a little bit different from some of the ideas that have become more commonplace in machine learning.
Starting point is 06:20:12 Great. So to start out by talking about kind of cheap and natural to provide supervision, let's look at how we currently supervise machines. So we currently will take a training data set, train a model, evaluate that model, and to evaluate it will actually, ideally actually look at how it does in a real situation by talking to it or by running a robot and so forth. And inevitably the model often won't work well in some scenarios,
Starting point is 06:20:42 And the best course of action, assuming that you've optimized it well, and the architecture is well tuned, is to collect and label more data. And specifically collect and label more data in the scenarios that are struggling. And so this would involve going out, getting examples, getting labels for those examples that cover those scenarios that is not working well. And this is really expensive and very human intensive. And if it were cheaper, we would be able to iterate on the cycle more, on the model, and we probably end up with a stronger model.
Starting point is 06:21:17 So that's one shortcoming with kind of a typical supervised learning approach. And the second is that input-output pairs are also a little bit weird in some settings. Say that we wanted a robot to cook a meal. The way to apply supervised learning in this case would be to collect examples of how to move the arms of the robot, how to move the motors as a function of the inputs. And this is a little bit weird compared to just trying to teach the robot naturally, the kinds of things that it should do, like making sure that the water is hot enough before putting pasta in, or setting a timer to make sure that it's been cooked for long enough.
Starting point is 06:21:54 Or as another example, say that we want to train a system to make a medical diagnosis. The typical supervised learning way to do this would be to have examples of symptoms and then have kind of examples of the diagnosis as a result. But instead, perhaps the more intuitive way would actually be to teach the machine about, about how diseases actually manifest in humans and patients. And so this is kind of bringing us to the idea that perhaps we might be able to train machine learning models in a more data efficient way
Starting point is 06:22:30 if we were able to incorporate natural to provide supervision. And so one thing you might think about here is instead of providing labels, what if we use human feedback? Reenforcement learning from human feedback has been quite successful where instead of providing input-output pairs, we'll look at an input-output pair as a set of them and say, this is better, this diagnosis is better than this one, or this pasta taste better than this pasta. And this can require a lot less supervision because you don't actually have to write out or actually provide the exact motor torques. But it still requires many labeled examples, many examples of an outcome and what is preferred. And so is it possible to give machines far less supervision, but still allow them to improve?
Starting point is 06:23:19 So we're going to look at this both in a robotics example as well as a more standard image classification example. Let's start with the robotics example. So we're going to be looking at Long Horizon by manual tasks. The goal, for example, might be to put all the objects into the bag. And it's really expensive to collect demonstrations that cover all of the possible scenarios that the robot might end up in. And the form of natural supervision that we're going to be considering here is just verbally telling the robot how it might handle or how it might improve in situations rather than trying to collect a ton of demonstrations for the scenarios that it's struggling in. And so specifically, say the robot is going about the task, and it's struggling on this part of the task of putting the sponge in the bag. What we'd like to be able to do is we'd like to be able to tell the robot at this part, you should use the sponge to open the bag wider, because right now,
Starting point is 06:24:13 the bag is not kind of open very widely. And ideally, it'd be able to use just this verbal snippet of text to both improve on the fly to be able to figure out how to solve the task in that scenario, as well as how to then take that data and actually improve the policy and improve its ability to handle new situations like that in the future. So we'd like to be able to use this high-level language supervision, both on the fly and for future improvement.
Starting point is 06:24:48 So how do we do this? If we want a robot to be able to improve from high-level language corrections, we need a way to connect what the robot is doing with language. And so to do this, we're going to train a hierarchical policy, a high-level policy and a low-level policy, where language is the interface between those two policies. And so more specifically, we'll take the observation.
Starting point is 06:25:12 this will be fed into a high-level policy that then predicts language corresponding to a skill, like pick up the sponge or put the Sharpie into the bag. And then this language command will be fed into a low-level instruction-following policy that takes as input the robot's observations and outputs how to move the motor commands. This kind of hierarchical approach is not new.
Starting point is 06:25:37 It's actually been done in a wide variety of prior works, and so it's not what we're introducing here. The key inside of what we're going to do here is that we can actually update the high-level policy only with language supervision, because its output space is language. It's kind of a skill that the robot should do next. And because of this, if the low-level policy can follow a wide range of instructions, then we can actually improve this full system just by updating the high-level policy and just by giving it language feedback. Specifically, we can do something like the dagger algorithm, the data set aggregation algorithm, on the high-level policy and freeze the low-level policy. So specifically what this is going to look like is we'll intervene, we'll tell the robot what we want it to do.
Starting point is 06:26:23 In this case, maybe it should kind of rotate the tape in order to put it into the bag. And this intervention, this language command, will override the high-level policy, and that intervention will be fed into the low-level policy instead of what the high-level policy is predicting. And then that will allow it to on the fly be able to leverage these interventions. And we'll also aggregate these interventions into a data set and use this to update our high-level policy. So it actually also learns how to improve from these corrections in the future. And so we're freezing the low-level policy and updating the high-level policy by supervising it just on the language corrections that the human is providing. We gave this a fun name,
Starting point is 06:27:11 yell at your robot or yay robot because you can articulate your corrections or frustrations with the robot to help it improve. And what can this do? So let's look at some videos of of fully autonomous policies on the robot. And we'll start just with the base policy before doing any language corrections. And so this policy is trying to put the objects into the bag and it'll make mistakes. Like in this case it, instead of putting the Sharpie into the bag, it put it underneath the bag. And it struggles to be able to recover from that. It'll also make mistakes. And so it also make other mistakes. So here it's trying to pick up the Sharpie. And the high-level policy output is shown here on the top left. And we're
Starting point is 06:27:53 actually finding that the high-level policy isn't ever issuing corrections like go lower or maybe rotate the gripper in this case. It just keeps on telling the policy to try to pick up the Sharpie. Now, after we fine-tune on language corrections, we find that it's able to autonomously correct for mistakes. So here it's making the same mistake as before by putting the Sharpie under the bag. and then it's trying to self-correct. It then makes a mistake again, and then it's self-correcting again to try to move towards the camera,
Starting point is 06:28:19 go higher, and then put the Sharpie into the bag. And by self-correcting, it's able to solve that part of the task successfully. It also learns to self-correct for grasping, where it'll self-correct to move to the right after it made a mistake of grasping too far to the left. And when trying to put the sponge into the bag, we'll also see it just change strategies completely.
Starting point is 06:28:42 So here it's trying to, in some ways, kind of shove the sponge into the bag and is doing so unsuccessfully. And now the high-level policy is going to tell it to instead try to release the sponge and sort of kind of poke it into the bag instead. And this helps it get it into the bag more successfully. And as a result of the robot's ability to self-correct from just this language supervision, we find that the robot is better overall at doing long horizon tasks. And so this video is pretty long because the task is, the task is, quite challenging, so I won't play all of it. But we get a sense that despite this task being quite challenging
Starting point is 06:29:20 and having all sorts of scenarios that we don't necessarily have demonstration data for, we find that by leveraging this very cheap language supervision, the robot is able to perform the task a lot more successfully, even though this task is quite long. Cool. Then there's one more thing I wanted to highlight from this system, which is that we can't, instead of just correcting after the robot has made a mistake, We can also actually proactively correct the robot
Starting point is 06:29:50 when we think it might make a mistake in the future. So this is a different task that we train the robot to do, which is to make trail mix. The grad students were quite happy about all the trail mix that ended up in the lab as a result of this. And we see that right here, I pause the video, the robot is, it looks like it's actually about to accidentally pour a whole bunch of peanuts onto the table
Starting point is 06:30:13 because the scoop is behind the bag instead of inside the bag. inside the bag. And right here, because we kind of notice that it looks like it might be about to make a mistake, we can intervene, and instead of telling it to continue by moving the scoop into the bag and presumably then trying to pour into the bag, we can interrupt the robot
Starting point is 06:30:31 and correct it and tell it to move the left arm to the left, go higher, move the scoop into the bag, and then allow it to continue autonomously to pour into the bag. And so this is an example of how in real time we're able to improve the performance by proactively
Starting point is 06:30:47 preventing the robot from making a mistake. And after fine-tuning, we find that it also learns this sort of proactive, corrective behavior where it notices that, in this case, with cranberries, it was about to make a mistake there. It didn't successfully get the scoop into the bag, and it then corrects itself to move the scoop into the bag successfully. Cool. So those are a number of qualitative examples. Quantitatively, we also see a large gain in performance just from verbal corrections.
Starting point is 06:31:17 So the dark orange bar here shows the success rate on average after fine-tuning on just language data, whereas the gray bar shows the policy before language corrections. We see a 20% improvement in performance. And this closes a lot of the gap to this light orange bar, which is the performance if we use human corrections on the fly to override the high-level policy. Lastly, it's worth mentioning that the performance of this still has room,
Starting point is 06:31:45 there's still a lot of room for improvement for even when we're using kind of Oracle, high-level human corrections. And so they suggest that the low-level policies have room for improvement. So to summarize, you can productively yell at your robot to help it actually accomplish tasks. But more importantly, the robot can improve just with language feedback without demonstrations. I find tuning this high-level policy.
Starting point is 06:32:13 And this is a lot more data-efficient. It's a lot more data-efficient to simply tell it to pick up the sponge or move to the right, then to actually collect demonstrations with teleoperation. And then, of course, this approach relies on a performance instruction following policy, and so you're not completely out of the woods in terms of having to collect some low-level data on the robot. Great. So this is an example of how we can use natural supervision to augment data and get much better performance in a very cheap way. Can we do something similar for other machine learning systems beyond reverse?
Starting point is 06:32:49 robotics. So say that we want to perform a classification task, an image classification task, based on the species of the bird. And we train a model to do this, and here I'm going to be visualizing the predictions that the model is getting right and the predictions that the model is getting wrong. And if we contrast the correct predictions from the incorrect predictions, one thing we might notice is that a lot of the incorrect predictions, there's a little bit of a pattern, which is that a lot of the incorrect predictions, not all of them, but a lot of them have trees in these examples.
Starting point is 06:33:25 There's a lot fewer trees on the predictions on the left. And so it would be nice if we could just verbally tell the model to pay less attention to trees. So just like how we told the robot kind of corrections, like go lower in these situations, or take a different strategy in these other situations, if you could simply verbally tell the model to correct its behavior here, we'd be able to correct the model far more efficiently
Starting point is 06:33:50 than collecting additional images and labels for those images. And so we tried to develop an interface that would allow humans to verbally correct machines in that way, where we first train an initial model, we allow people, including non-experts, to describe failures using natural language of that model, and then correct for those model failures just by using the language feedback.
Starting point is 06:34:20 So how this works is first we'll present the correct predictions and the incorrect predictions just like I showed before. And so in like a didactic example where we're trying to classify squares versus ovals where the model might be paying attention to color when it shouldn't be. This would look something like this
Starting point is 06:34:37 where you would contrast the examples on the left and right and then try to describe verbally how the model is making a mistake by paying it too much attention to the color or red or color blue. More in the water birds days that perhaps the model is paying attention, too much attention to the trees in the background.
Starting point is 06:34:53 And then once we visualize these model failures, we'll then ask a person to describe the model failures and then also help them understand whether that description is something that the model can actually understand and use to improve itself. And so we developed a web interface to allow users to look at these examples. We're using clip in the background
Starting point is 06:35:13 to help understand if the model is actually able to connect that verbal concept to the images that are in its data set. We can then compute the similarity between the text prompt in each image to figure out whether or not that text prompt is separating the correct examples from the incorrect examples.
Starting point is 06:35:32 And if it is separating those examples, then we can then use that to improve the model. And so if it gets a high error score, then we can then directly start to use that for training if the user finds that they aren't able to describe something that the model understands and can separate these concepts, then the user can iterate on their description to then try to describe it in a way that the model can interpret. And so then once we have this text feedback, we'll take a very simple approach that was presented in a previous work called DFR, where we're simply just going to balance the data across these different groups. So for example, if we're finding that the model is paying too much attention to trees, then we're going to balance the images that have trees in them with the images that don't have trees in them.
Starting point is 06:36:15 And this will decorrelate the data such that it no longer is incentivized to pay attention to trees. And then once we decorrelate the data, we'll then kind of retrain or fine-tune on that decorrelated data to get a model that stops paying attention to that piece of feedback. And then, if desired, you can in principle also iterate on that process to then identify any new model errors that popped up. Great. So in our experiments, we tried to identify first if non-experts could actually identify and describe model errors in a way that led to improved robustness of the model.
Starting point is 06:36:51 And second, whether or not we could scale this sort of approach, even to very large data sets, be able to cheaply provide supervision that can identify model failures in these large-scale situations. And so in the first case, we recruited 26 participants on a crowd platform that had very minimal qualifications, the native English speakers and so forth,
Starting point is 06:37:14 and so likely people that are not machine learning practitioners. And we had them interact with water birds and slobe. And what I'm showing here is I'm showing all of the different verbal descriptions that each participant identified for describing model failures. And the black line is showing Yunho, the lead student researcher on this project, his performance or the error score that he was able to get for his reference phrases. And what we can see here is that in a lot of the cases, the human non-expert participants are able to identify
Starting point is 06:37:47 model failures fairly accurately compared to Yunho. In these examples, they're able to identify the correct concept underlying the model failure, but they might have some suboptimal wording compared to the wording that Yunho used. There's a number of examples where they find basically the same phrase that Yunho used. And then there's also examples where actually the non-experts provided better descriptions of model failures than Yunho's reference prompt. Then lastly, there's also a few cases, specifically four cases where participants struggle to identify the correct model failure. And then once we have these descriptions, then the question is, how cheap is this supervision? And can we use this cheap supervision to provide, to actually improve the model?
Starting point is 06:38:35 So first we found that on average, these non-experts were providing two to three minutes to give feedback to the model. So this is pretty fast and a lot faster than collecting additional labeled data. And second, we found that if we use their descriptions to rebalance the data and retrain, we were able to get a model performance shown in yellow that is a lot more robust than simply training on the original data set or zero shot prompting approaches. And so we see, in this case, specifically,
Starting point is 06:39:07 a 7 to 10% improvement over training on the initial data set, just with two to three minutes of additional supervision from a non-expert. And then beyond these somewhat simple data sets, what about data sets like ImageNet? We didn't run a user study on this specific thing, but we found that using this interface, Yunho is able to fairly quickly identify model failures
Starting point is 06:39:34 on ImageNet. And so, for example, here are some examples all from the same class, actually from the sliding door class, where the model is doing very well on these images and very poorly on these images. And as you might notice, these examples on the right have a high similarity with cars
Starting point is 06:39:52 and a lower similarity with cars on the left. And so it's kind of struggling to classify sliding doors if they're sliding doors on cars. And he's able to identify model failures on 31 different classes in ImageNet and able to do so in a relatively short period of time. And with data-rewaiting, he was able to improve the performance of the model
Starting point is 06:40:16 on the minority split of the data while preserving overall performance. And so with this sort of approach, we found that we're able to give verbal feedback based on an initial trained model, and because it's based on the model that's already been trained, similar to the robotic setting,
Starting point is 06:40:34 it's actually easier to target model failures efficiently rather than simply trying to out-of-the-blue guess the kinds of supervision that the model might need. And second, the verbal feedback that we're giving, it's no longer data point level. It's actually at a more global concept level. And this means that the verbal feedback is especially cheap because with a single sentence or a single phrase,
Starting point is 06:40:56 we're able to actually address a broader class of model failures rather than providing individual examples. And then importantly, I also mentioned that in this most recent work, we're only identifying and correcting one kind of model failure. This means that the scope of the work is quite limited, But it would be really exciting to see if we could use this kind of high-level verbal feedback for other kinds of model failures and develop a more general approach for improving models just with verbal feedback.
Starting point is 06:41:27 And so kind of the takeaway from both the Yay Robot work and the Clarify interface is that natural supervision like language supervision, if the model can use it well, that supervision can be far cheaper and sometimes even more informative than collecting a large number of labeled examples. And so it's a useful tool to have when we don't have a great deal of initial training data. Great. And so now another example of how we might try to, basically another example of some data that's essentially out there, but we just need algorithms to be able to use it well,
Starting point is 06:42:05 is data from other sources beyond the target application. And so specifically one natural thing to do to improve generalization for a particular application is to leverage internet data, to leverage models trained on text and images. One very common way to do this is just to use, for example, an encoder pre-trained on ImageNet. And we find that, at least in robotics applications, it does improve performance somewhat compared to just training from scratch. And especially we can do well on tasks and scenarios that are seen in the training data set. But when evaluating on generalization to unseen objects, backgrounds, and environments,
Starting point is 06:42:50 there's still a really substantial gap compared to the things that it saw during training. And yet the internet has really vast training data, and so we expect that maybe we could do better than this. Specifically, maybe if we could more closely connect the pre-trained model with the downstream task, we might be able to more effectively leverage all of the rich knowledge that exists in internet data. So specifically what we're going to do is we're going to take a visual model. Instead of taking a model trained just on ImageNet classification, we'll take a model trained for visual question answering.
Starting point is 06:43:26 And we can formulate the downstream tasks, specifically the robotic control problem, as a visual questioning answering problem. And so instead of having it output continuous values, we're going to frame it as a question, what should the robot do to do a task, like to pick up the chips or to move a bottle upright? And then we'll likewise also frame the output of the model as a series of tokens, similar to the output of a VQA task. And these tokens will correspond to different language actions, like how to translate and rotate the gripper of the robot. And if we formulate essentially this downstream task just like the tasks that are seen during pre-training,
Starting point is 06:44:09 perhaps it will be able to leverage the pre-training data more effectively and understand how to generalize robotics tasks similar to how it generalizes these VQA tasks. And so once we have this data, we'll use the same architecture, specifically a pre-trained vision language model. And you can either fine tune it just on the robot VQA tasks. or a combination of the robot tasks and the existing internet EQA data that the vision language model was pre-trained on. And so it will output these language tokens that will then be converted into robot actions to be run on the robot. So essentially we're posing robotic control
Starting point is 06:44:50 as a visual question-answering problem and defining tokens corresponding to robotic actions. And we'll refer to this kind of fine-tuned model as no longer a vision-language model, but a vision-language action model in the sense that we're now having actions, kind of some of the tokens are representing actions. So now if we go back to this example of how the, if we're just using a pre-trained
Starting point is 06:45:15 image net encoder, how well that does, we find that models that use this sort of vision language action recipe, we find that they're actually able to generalize far better than the model that is pre-trained just on image net classification. So essentially by connecting the pre-trained model in the downstream task, we're able to get and generalization. Now, what does this look like for more recent stay-of-the-art models? So we can also compare state-of-art models that use standard pre-training or no pre-training to recent vision-language action models like RT2X and Open VLA.
Starting point is 06:45:53 And we'll be doing this on evaluations that focus on generalization. And what we find is on two different robot platforms, the vision language action models shown in red and green do substantially better on average than the models that don't use this vision language model pre-training and don't use this formulation that formulates the downstream task very similarly to the pre-trained task. So again, we're kind of, even with these state-of-art models, we again kind of see this trend that generalization improves significantly if we connect the pre-trained model with the downstream
Starting point is 06:46:30 task. And so going back to trying to handle data scarcity without skip beyond data, we're going to, and we can leverage data that already exists, data from the internet, that's easy to get, and we can leverage it much more effectively if we connect the pre-trained model with the downstream task. Great. And then lastly, I want to talk about incorporating data from test time. So specifically thinking about whether,
Starting point is 06:47:00 if we are in a new situation that's not represented well in our training data, can we adapt on the fly? And I think this is a really important problem, because when machine learning systems are faced with the real world, there's a vast number of objects, vast number of configurations and scenarios that these machine learning models will be faced with. And I don't think we can even hope to anticipate
Starting point is 06:47:20 every possible scenario that these machine learning models are faced with. And because we can't anticipate it, then maybe instead we can just adapt after the fact when we see more data from that situation. So for example, say that we're trying to open a door, Maybe this is a new door that we haven't seen before. If we're trying to do this, we might make a mistake and might need a retry. And it turns out that this is a video of a human opening this door, and it was quite subtle,
Starting point is 06:47:51 but the human actually did make a mistake and adapt very quickly. So let's replay the video, and specifically we see that the human puts the key into the door, actually puts it in the wrong place right here, and then continues by taking the key back, and then putting it in the correct place. And so even humans are making mistakes in adapting. And so even humans, which in many ways, are sometimes even a gold standard compared to machine learning. If even humans are adapting,
Starting point is 06:48:21 can we develop machines that can adapt in a similar way. So let's look at this in the context of robotics problem. This is a scenario that's unseen to the robot. And so the robots here, its goal is to get to over here. And if it's trying to approach this problem, and it makes a mistake, can it actually retry? So the robot only gets this first-person observation right here. Without any context, if it hasn't actually attempted the task,
Starting point is 06:48:48 maybe from this observation it will try to crawl under and see where it gets from there. And then maybe if it tried to crawl and then realize that it was very close to an obstacle, then maybe it should try a different strategy. So with this context, with this previous history of what it's tried in the past, maybe it should try with the same current observation to do something different like turning left or turning right.
Starting point is 06:49:15 And so this is exactly what we'll do. We'll take these recent attempts and we'll combine them with a model that's known to be fairly good at adapting from recent attempts. Specifically, in this case, we'll use a vision language model. We'll pass these recent attempts and the robot observation into the model. We'll then have this select a skill for the robot to do and then output actions.
Starting point is 06:49:37 And ideally the vision language model should leverage what the robot has tried before and pick appropriate skills after it's made some mistakes. And so if we do this, we find that exactly on this scenario before, which is unseen from the robot, if we don't use history and don't allow it to adapt from its mistakes, it often makes the same mistake over and over again. Whereas if we do using context learning is able to try something different and adapt on the fly based on what it has seen in this test environment. And likewise, here's another setting and outdoor setting.
Starting point is 06:50:13 This is actually quite challenging because there's this step that is quite unstable in front of the robot. And at this point in the video, the robot actually can't even see that its back legs are stuck on the step. And so it's trying to walk forwards. And if it doesn't have history, it doesn't know that walking is being unsuccessful in this scenario. But with history, it's able to figure out that it should go backwards and instead try to climb over the step instead of just trying to walk over it. walk over it. And we also see quantitatively that allowing, like leveraging test time information, leveraging these images that the robot sees at test time, improves the robot performance by more than 50%, both in terms of success rate and in terms of the time it takes to complete
Starting point is 06:50:54 a test scenario. Cool. So the takeaway is that in context learning greatly improves the adaptability of the robot, and in turn this improves its resilience and performance in unseen situations. There's also limitations and future work with this, as with any research and, like, all the research that I presented, which is that in this case, it's unclear necessarily the best way to ground language to the low-level locomotion policies. And also, in many cases, we might not want to use the language abstraction as the way to retry, and as the way to connect with vision language models, and so there might be interesting ways to expand on that. Great. So for this last part, we found that incorporating data and information,
Starting point is 06:51:39 from test time can make up for lack of representative training data. And all of these are examples that I covered in the talk are examples where more data is out there, more data is either out there, it already exists, or it's pretty easy to get. And we just need algorithms that can leverage things like natural supervision, pre-trained models, and test time data in order to effectively handle these new situations
Starting point is 06:52:04 or these situations that aren't covered well by the training data. Now, I also mentioned that along these three directions, is also, I think, exciting directions for future work. I talked about one way to leverage cheap natural language supervision. But I think that in the future, maybe we can operationalize entirely new learning regimes that leverage natural supervision in a general purpose way. Moreover, I showed how we can connect pre-trained models with downstream tasks by making the downstream task look more like the pre-training problem.
Starting point is 06:52:36 But maybe in the future we could actually change pre-training in a way that makes it easier to connect with all sorts of downstream tasks. And then lastly, I showed how we can adapt at test time in a robotic scenario to make up for lack of representative training data. But there's all sorts of examples in applications in machine learning where we're interfacing at the end of the day with a human or with some other environment. And so can we also allow machines in non-robotics examples
Starting point is 06:53:03 to adapt on the fly and retry when they're interacting with a person or interacting with some other environment like a web environment? Great. And so then the last thing I'll also mention is that these are, I discussed a number of different kind of creative ideas for leveraging different sources of data and different sources of supervision.
Starting point is 06:53:24 There's also this question of what if we also have broader training data? I think that all of these are quite interesting even when you have broader training data. And we've seen from the regime of large language models that there's a lot of things that are quite exciting to try and do when you also have a large training data set. In the context of robotics, we've also been starting to study this problem. And so back in March of this year, I co-founded a company to help actually try to see what happens when you do try to scale up data and models in the context of robotics to try to tackle a broad range of real-world use cases and robot platforms.
Starting point is 06:54:02 And some initial results are here where we find that we can do actually pretty cool tasks, even with data that hasn't been data that was collected since March. of this year. And then the last thing that I'll mention is that we, I talked a lot about finding new forms of data like natural supervision or data at test time, and these are things that are actually quite widely applicable and make the overall problem easier. But a lot of our machine learning benchmarks
Starting point is 06:54:31 actually aren't necessarily designed for these kinds of ideas or these kinds of algorithms that leverage different forms of supervision or data. And so it may actually be the case that in some scenarios, those benchmarks might actually be harder than the problems that they're trying to represent. Because they don't necessarily allow for you to use other forms of supervision or data. And so perhaps by understanding the context surrounding different real applications that we're trying to study, we might find new and interesting ways to find data or new and interesting
Starting point is 06:55:01 problem settings and also make more progress as a whole. Great. So I'll leave you with that. I'd like to mention that all the work that I presented was done with a really fantastic set of collaborators. I'd especially like to highlight the students that led the work that I presented. Yunho led the Clarify work. Lucy led the Yeh Robot work. Annie, Alec, Andy and Govind led the test time adaptation work. And Mujin, Carl and Sid led the Open VLA project, and happy to take questions. The last thing we want to highlight in this epic seven-hour coverage
Starting point is 06:55:35 of ICML 2024 is the new position paper track that encourages researchers to step back from individual papers to make arguments relevant to their entire field. Here is Yanghio Park, arguing that automatic environment shaping is the next frontier in RL, which we think has been the implicit argument we have been developing through the papers and talks we have been exploring this episode. Hello, everyone. Thank you for being here. My name is Yang Hio Park, and I'm excited to present our position, automatic environment
Starting point is 06:56:06 shaping is the next frontier in RL. This is joint work with my colleague Gabe and Polkett from the Improbable AI group at MIT. To give us some context before we start, me and Gabe both come from a robotics background. And as a grad student working on robotics, I always dream about a magical box that can automatically create a robotic controller for me by simply specifying the robot environment and task I want. And I call this magical box automatic behavior generator. And before I move on, I want to emphasize the word automatic here. It means that this box should only be powered by time and compute, not by human effort.
Starting point is 06:56:45 This magical box, if realized, will serve as a core tool, enabling robots to autonomously generate behaviors on the fly, even after its deployment to people's houses. But I want to ask you all, do you think we're being a bit overly ambitious? Is our dream, this magical box, too good to be true? Well, if you think about it, this is what reimbursement learning is promising us in some sense. Reversion learning in theory is a generic purpose, automated, optimal control solver that can produce working controllers for any MDP setting. However, from a practical viewpoint who is trying to use RL as a tool to train robots, this claim is not necessarily true.
Starting point is 06:57:28 Although RL itself does not require human effort during its training process, we want to point out that there is a very heuristic, labor-intensive process that are required to make RL work in practice. And that is what we call environment shaping. When an oral algorithms fails to find a solution in practical scenarios between the choice of fixing the RL algorithm and shaping the environment to make it work, practitioners typically tend to choose the latter. The core problem of such practice is that it heavily relies on human effort. Domain knowledge for the task, intuition, and sometimes a bit of luck, is crucial to get things right. A very well-studied example of environment shaping that you might already know about is the reward-shaping problem. We all know that RL agents love to hack the reward when they can, so engineers typically go through the process of shaping the reward to prevent it.
Starting point is 06:58:24 In fact, I would say that this is the biggest reason why some people in our community hate RL so much, and I completely understand the process of reward-shaping, it's definitely not fun to do. Unfortunately, I want to point out today that reward is not the only thing that we usually shape. Robotics engineers carefully shape nearly every component of the environment to make RL work in practice. And again, the only optimizer that is currently known to work the best for this problem is graduate student descent, a process entirely relying on human effort. All those being said, what am I arguing today? First, I argue that the community should start prioritizing research to automate the heuristic process of environment shaping.
Starting point is 06:59:11 At the same time, we also need better RL algorithms that doesn't require heuristic environment shaping in the first place. And to do that, I argue that we should be benchmarking our RL algorithms on unshaped environments without any task-specific heuristics included. To better back up our argument, from now on, I'll try to give you some examples of the heavy heuristics that are involved in popular robotics RL environments and show how crucial they are to make RL work. As an example of environment to analyze, we chose Isaac Jim Ems, one of the modern benchmark environment containing diverse robotics tasks. Let's first talk about
Starting point is 06:59:48 action space shaping. In the context of robotics, action space shaping is a process of choosing how to convert the action, predicted by the policy, to an actual command that can be sent to the the motor. An unshaped action space will thus look very simple. We are just letting the policy to directly predict feasible motor commands. However, most RL environments apply a bunch of task-specific heuristics to shape the policy outputs before it gets passed into the motor. This example code you just saw, for instance, applies diverse scaling, clamping, moving average filters, and PD controller at the end to finally convert the policy outputs to motor commands. The problem of this kind of shaping process is that it not only is very task-specific, but it also introduces
Starting point is 07:00:35 a bunch of extra knobs and hyper-premiers to tune. Unfortunately, this action space shaping is a necessary evil for RL algorithms. We have tested that PPO, for instance, completely fails to solve these tasks if we remove such shaping. And our findings are similar for observation space as well. Observation space shaping is basically a feature engineering problem, selecting the relevant states from what's available from the simulation to create an observation for the policy. For instance, for the task of opening the door using a manipulator, an unshaped observation space will be a symbol concatenation of every raw simulation states that are available. However, typical oral environments go far beyond the simple concatenation. They introduce multiple hands engineered
Starting point is 07:01:21 task-specific terms, and they often convert certain states with unique properties like rotations to a different representation that are known to be better for neural networks to process. And such processes are also very crucial to make RL algorithms work in practice. We can break the RL by just removing those handcrafted terms from the observation. Although I'm skipping the other examples of environment shaping due to time constraint, you can take a look at our paper for more comprehensive examples. Now that we learned about the details of environment shaping and how it affects the RL performance, let's talk about how we can automate this environment shaping process.
Starting point is 07:01:57 Automating environment shaping is a challenging problem for many reasons. One of the major problems is that there is no compact way of parameterizing the vastly diverse ways of doing environment shaping. If you assume a fixed functional form for everything, we can try extracting the coefficients and do some classical hyper-parameter optimization on top of it. But this is a very limiting way of representing these shaping functions. Therefore, people have recently started to think about a more flexible way of representing these shaping operators. One of them is to use Python code itself as a way of representing these functions. This allows us to view environment shaping as a code optimization problem using large language models.
Starting point is 07:02:40 This paper called Yurka is a good example of using large language models as a sampling-based optimizer to automate the rear shipping process. So we have conducted some experiments to see whether the proposed automation method using LLMs can be extended to other shaping components. And as you can see over here, models like GPD4 was able to successfully shape action and observation space with similar to human performance. However, interestingly, when we asked GPT to shape multiple components jointly at the same time, the performance dropped dramatically. And this can be a critical problem since our experimental findings suggest that optimizing individual components one by one in a sequential manner often leads us to locally optimal performance. All that being said, I believe we still have a long way to go to fully automate the process of environment shaping. Now that we have discussed all the aspects of environment shaping, let's discuss about path forward.
Starting point is 07:03:35 Recall that I was advocating for the research focused on either automating the environment shipping or developing better RL algorithms. To support both direction of research, we have created a code base, which basically contains a collection of unshaped robotics environments that people can test their oral algorithms on, with nice little APIs and tools to facilitate the research of automating environment shaping. And before I wrap on my talk,
Starting point is 07:04:02 I want to discuss about possible counter arguments that people might have against ours. Going back to the beginning of my talk, I shared about the dream I have, creating a magical box that can automatically generate closed-loop controllers for robots. And then I kind of implied that, reinforcement learning will be powering this magical box in the future.
Starting point is 07:04:23 However, I think some people might disagree with this. Especially considering the resurging popularity of doing manual data collection and retention learning, some people might think that our dream of magical box will be realized not by automating RL, but by training some huge foundation model that consumes all this data sets collected by all these companies. However, I still believe in the power of RL as a tool. tool to generate robust, generalizable, and especially superhuman behaviors that cannot be easily achieved the limitation learning. And also, the behaviors generated by RL pipelines can also be used to
Starting point is 07:05:01 train those foundation models as well. Therefore, I argue that making RL easier to use will enable a virtuous data cycle for training better embodied intelligence. And with that, I would like to wrap on my talk today, and I'm happy to engage in exciting discussions about our Thank you. And that's a wrap for ICML 2024 Part 1. Our coverage on Generative Video World Sim, diffusion, vision, reinforcement learning and robotics. We're busy preparing for latent space live at Nereep's 2024 in Vancouver, so grab your tickets at loo.must slash LS Live and see you there.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.