a16z Podcast - Text to Video: The Next Leap in AI Generation

Episode Date: December 20, 2023

General Partner Anjney Midha explores the cutting-edge world of text-to-video AI with AI researchers Andreas Blattmann and Robin Rombach. Released in November, Stable Video Diffusion is their latest ...open-source generative video model, overcoming challenges in size and dynamic representation.In this episode Robin and Andreas share why translating text to video is complex, the key role of datasets, current applications, and the future of video editing.Topics Covered: 00:00 - Text to Video: The Next Leap in AI Generation02:41 - The Stable Diffusion backstory04:25 - Diffusion vs autoregressive models06:09 - The benefits of single step sampling09:15 - Why generative video?11:19 - Understanding physics through AI video12:20 - The challenge of creating generative video15:36 - Data set selection and training17:50 - Structural consistency and 3D objects19:50 - Incorporating LoRAs21:24 - How should creators think about these tools?23:46 - Open challenges in video generation 25:42 - Infrastructure challenges and future research Resources: Find Robin on Twitter: https://twitter.com/robrombachFind Andreas on Twitter: https://twitter.com/andi_blattFind Anjney on Twitter: https://twitter.com/anjneymidhaStay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Transcript
Discussion (0)
Starting point is 00:00:00 When I first sampled this model, I was actually shocked that it works so well. The pure improvements in performance in text understanding of these models is it possible to derive something like a physical law from such a model? I think a really important part of this is the fact that these models have been accessible to everyone. It learns a representation of the world. Today, many people are familiar with text-to-text and text-to-image AI models. Think chatyvety or mid-tourney. But what about text to video?
Starting point is 00:00:33 Well, several companies are working to make that a reality. But for many reasons, it's a lot harder. For one, their size. Just think, you'll often find text files in the kilobytes. Images? Maybe a few megabytes. But it's not uncommon for high-quality video to be in the gigabytes. Plus, video requires a much more dynamic representation of the world that incorporates the physics of movement, 3D objects, and more.
Starting point is 00:00:57 Imagine the hand challenge and text image, but in this case, it's hand squirt. But this is not stopping the researchers behind stable video diffusion, which as of November 21st was released, as a state-of-the-art open-source generative video model. And today, you'll get to hear directly from two of the technical researchers behind that model, Andreas Blatman and Robin Rombach. Robin, by the way, is also the co-inventor of stable diffusion, one of the most popular open-source text to image. image models. So in today's episode, together with A16Z general partner, Ajne Mida, you'll get to your first hand what makes text a video so much harder, the challenges like selecting the right datasets that enable realistic representations of the world, applications where this technology is actually already being put to use, plus what the video editor of the future
Starting point is 00:01:47 might look like, and how constraints continue to spur innovation and ultimately keep this field moving. And if you like this episode, our infrastructure team is coming out with a lot more AI content in the new year, but in the meantime, you can go to A16Z.com slash AI for our previous coverage. All right, the first voice you'll hear is Anjane, then Robin, then Andreas. Enjoy. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security
Starting point is 00:02:20 and is not directed at any investors or potential investors in any A16C fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see A16C.com slash disclosures. This is a conversation I've been super excited about for a while. Maybe we can start with just a brief overview of your team, your research lab, and for listeners who are unfamiliar, or maybe spend just a couple minutes talking about what stable diffusion is and what stable video diffusion is. Absolutely.
Starting point is 00:02:58 Thank you for having us. What is stable diffusion? Stable diffusion is a text to image model, a generative model that means you type in a text prompt and it generates an image based on that. In particular, stable diffusion is, as the name suggests, a diffusion model. Diffusion models are a type of generative models, which has been super successful recently for image generation. and it's based on a technique that we developed while we were still at the university.
Starting point is 00:03:25 So me and Andreas and Patrick and Dominic, all in the same team now at Stability. We are a multimodal company and our specialty is to produce and publish models, try to make them as accessible as possible. That includes publishing weights and making foundation models for all kinds of modalities, not only images, but also video available and enabling research on top of that. So we have seen that stable diffusion was super successful, I would say much more successful than we initially anticipated. And there are like hundreds, if not thousands of papers that are building on top of that.
Starting point is 00:04:06 We in particular, our group is focused on visual media. So that is images, that is videos, and stable media diffusion that you just introduced. It's kind of the next iteration. It's our first step into the video domain. when we published a model that can take in an image and turn that into a short video clip. Maybe we could spend a couple minutes on a brief overview of diffusion models.
Starting point is 00:04:29 That might be helpful. How do diffusion models differ from other types of generative models and techniques like auto-aggressive models, if you could just give us a little bit of context before we dive in? Fusion models are really the to-go models right now for visual media images and videos.
Starting point is 00:04:44 They're kind of different to auto-aggressive models because they don't represent data as a sequence of tokens, which we know from auto-aggressive models. And since images and videos are composed as a pixel grid, this is really a good beneficial property. Also, they favor perceptually important details, which is inherently baked into these models because their learning objectives are tuned to favor these important aspects of images
Starting point is 00:05:11 as we perceive it as humans. And that is what we actually want, right? but they also have some commonalities with auto-aggressive models. Both order-aggressive models as well as diffusion models, they are iterative in their nature, but as opposed to order-aggressive models, which iteratively generate token by token, or word-by-word for language,
Starting point is 00:05:34 these models gradually transform in small steps, so they gradually transform noise to data. One point to add to the difference, Maybe in diffusion models, you train the model on, initially you used like a thousand different noise levels between data and like pure noise. But the interesting thing is that at sampling time, you can actually use less steps. You can use like 50 steps. We have published a distillation work a week ago that actually shows that you can go as low as one sampling step, which is, I would say, a big advantage of these diffusion models. For folks who may not be familiar with why a single-step sampling breakthrough is important,
Starting point is 00:06:15 could you say a little bit about what benefits that leads to for creators or users of the model? Oh, yeah, absolutely. I think the most intuitive thing is that you actually see what happens while you type in your text prompt. So think of like this text image model. You type in your prompt. One and a half years back, you had to wait for like a few seconds, maybe even up to a minute. Now you see what happens. And the quality is even better than what we had like with the first iteration of stable diffusion.
Starting point is 00:06:41 So super exciting actually to see that kind of trajectory, these kind of developments. Like when I first sampled this model, I was actually shocked that it works so well. To keep pulling on that thread for a bit, if we rewind the clock back to a year and a half ago, which is when you guys first put out stable diffusion, between then and now, what has surprised you most about image models that you didn't expect. The pure improvements in performance in text understanding of these models in spatial compositionality
Starting point is 00:07:15 of what these models can do just by typing in a single prompt you can describe a scene really, really fine-grained and it gives you a highly detailed visual instantiation of it. The developments has been huge we published SDXL in June
Starting point is 00:07:30 and even then it was a huge improvement in visual quality in prompt following. also other models which we see right now is like most reason DALI 3 is like a huge improvement still but also as Robin said that there has been a lot of different Sampras proposed to make these models faster
Starting point is 00:07:46 and faster and faster and right now we're getting really close to 50 steps performance and even one step. This is a huge improvement and I think a really important part of this is the fact that these models have been accessible to everyone. So open sourcing
Starting point is 00:08:02 a foundation model as stable diffusion initially that led to a whole lot of research on these models, which was, in retrospect, extremely important to do this. I think otherwise we wouldn't have seen the improvements we saw until now. Even before that, I was surprised that text image with diffusion models work so well just before we published a model. Like when I first saw this myself, we had this latent diffusion approach that we developed at the university.
Starting point is 00:08:27 I mean, we got a machine with like 80-gibite A-100s just after we put it in archive, and then, yeah, immediately you started working on hey, we want to have this text image model, but not train it on one GPU. Let's use all of, like, let's use our little cluster with 80 gigabyte A100s. We trained this latent diffusion model on 256, but 256 pixels. It was the first time that we had to deal with large scale data loading and these kind of things. And then using this model, combining it with classifier-free guidance, which is a sampling technique that further improves sample quality at basically no cost.
Starting point is 00:09:02 I was like really surprised that we could do this. on our own and achieved a pretty good model, I would say. And then, like, two days later, Open AI published Dolly 2, and all the hype was gone, but it was a pretty nice experience. You know, something you mentioned, Andreas, is that the fact that you guys chose to release stable diffusion as an open source model resulted in this crazy ecosystem exploding around your research, which is just something that doesn't happen as quickly or as fast with models that aren't open source.
Starting point is 00:09:36 And so in the last year and a half, one of the things that's been really fun, at least for me to watch, is all the really surprising things that developers and creators have done with the base model that you guys put out. You've provided folks a set of Lego blocks that they can mix and match in different ways, things like ControlNet that give people more controllability, allowing your community to build their own front end.
Starting point is 00:09:59 And out of all of that, I'm sure, came a ton of requests. As you guys were prioritizing all those asks that came in from the world in the community, why was stable video the thing that you guys decided to prioritize? Video is an awesome kind of data because to solve that task, to solve video generation, a model needs to learn much about like physical properties of the world or the physical foundations of the world. There is so much without knowing about, for instance, 3D scenes, you cannot generate a camera pan around an object or you cannot make an object move. If a person turns around, the model needs to hallucinate how this person looks from behind, right?
Starting point is 00:10:42 So to know so much about the world by just including that additional temporal dimension, this is what fascinated me most on working on videos. It's also really next level of computational demands because you have an additional dimensionality, which makes everything much harder, I think. And yeah, I think we like challenges. That's why we probably focused on doing that. Yeah, something that's not known about you guys is by background, originally, I believe you're physicists. Yeah, I'm a physicist, but I haven't done much physics in a while, unfortunately. I'm originally a mechanical engineer, but that is also really related to physics, and I was always inspired by physics and really fascinated by it.
Starting point is 00:11:19 Well, both of your backgrounds academically were spent studying the physical world. And I just think it's poetic that your primary interest in generative modeling came from trying to understand at some deeper level of the physical world. And it seems like that seems to have motivated at least some of the intuition and the research around your approach to stable video. Yeah, absolutely. I fully agree. And we're just scratching the surface with the kind of video models that we have right now. Having something like we are seeing in language modeling, but trained on pixels on videos will probably give like super interesting downstream behavior. not only like generating videos, but also understanding of the world. Like, is it possible to derive something like a physical law from such a model?
Starting point is 00:12:02 I don't know. Or such a model is also always predictive. So you can start with an image or with a sequence of images and try to predict what happens next, of course. And then I think like also coupling this with other modalities such as language will maybe provide a way to ground like these models more in the physical. the world. I think that's a good segue into what is the main focus of today's conversation, which is generative video. To folks who are early users of stable diffusion, stable video was a much awaited sort of natural progression from the original model. Just take us back a little bit
Starting point is 00:12:40 to the original sort of conception of the project. How long have you guys been working on video modeling? I would say roughly half a year. And like for this model that we just put out, I think the main challenge was that we actually had to scale the dataset and the data loading. So if you train a video model on a lot of GPUs, you suddenly run into problems that you did really have had before. Like loading high resolution videos is just like a difficult task if you do it at scale. Also only decoding videos is really hard. Like a data loader has to transform like spites that loads into a suitable representation for the model. And to do so, you have to do a lot of computational work to transform it into a suitable input.
Starting point is 00:13:23 sample for the generative model. And this is competitionally really expensive. And since we have so fast GPUs right now, it was really like the CPUs, we're just like in the beginning too slow. Building an efficient data pipeline for video was really a challenge, which we, I think, solved really well, which probably took most of the time we spent on building and scaling these video models. And actually, there's like interesting bugs that you can encounter during training.
Starting point is 00:13:50 So we had one where you have your data and then you have. add noise to that data that the model tries to remove, right? And if you do that on a video, you add noise to each frame of the video, and then we had a bug where we added, like, different amounts of noise to different frames in the video, which just complicates the learning task unnecessarily. Things like this, it's just like one line of code that can go wrong. What was the biggest difference between the image model research and your video work? Because noise sampling and noise reduction, these are sort of different.
Starting point is 00:14:23 diffusion techniques that are shared across images in video, but it would be helpful to understand what were unique to the video challenge. First of all, the pure dimensionality of videos. I mentioned that before with this like additional dimension. This introduces, of course, a higher GPU or memory consumption. And this was really a challenge. So for diffusion models, it's really important to have a high batch size because you can approximate the gradient, which thrives, the learning much better if the batch sizes is higher, especially for diffusion models,
Starting point is 00:14:56 it's like a really an important thing to have a really high batch size, but have to increase your number of GPUs, which again introduces new challenges in terms of scaling, in terms of like redundancy in your training pipeline, if something breaks somewhere
Starting point is 00:15:14 in one GPU, it will just like throw down the entire training. And the more GPUs you add to your cluster, And the more chip you use you train, the higher the probability will be that somewhere there's just like a, say, a hardware failure, even, which also happens. This additional dimensionality just introduces these new scaling challenges, which are really, really interesting to come by, I would say. Well, that's very helpful. I think one of the most valuable things that your guys' research has done for the industry is that you often share in very excruciating detail some of the infrastructure challenges that came with training. And I think since scaling models at the magnitude that you guys are as a relatively new infrastructure challenge, I think it's very, very helpful for other researchers to be able to hear the sort of nuts and bolts that you had to figure out. out, right, to get these models out. Then there's a whole other set of data related challenges
Starting point is 00:16:13 that aren't about the data pipeline per se, but it's about the representation of the data set curation, the data set mixture. Could you guys just talk a little bit about how you approached picking your data set for this release and what was your intuition and what kinds of data were most important, how you wanted to filter it, and ultimately what ended up being your most important learnings around the data set when it came to training stable video? We actually spent a lot of time talking about this in the paper that we just put out. What we also define in this paper is that we can divide this training process into three stages. And the first is that we actually train an image model. So for training video models, it's usually just helpful to reuse
Starting point is 00:16:54 the structural spatial understanding from image models when there are like powerful image models. We should reuse for then training the video model. And then there's next steps. So having a image model, like Stability Fusion, for example, you have to get this additional knowledge about like the temporal dimensionality and about motion, right? So for that, we train on the large data set that we still have to create a bit. So we don't want, let's say, optical characters. We don't want like text in the video. We want nice object motion. We also want nice camera motion. So we have to filter for that. And yeah, we do this in like two regimes. We train on A lot of videos in the first stage, and in the second stage, we train on a mock-rated,
Starting point is 00:17:38 very high-quality, smaller data set to really refine the model. And it's similar to image models where you retrain on a large dataset and then refine on a high-quality dataset. There was a paper recently that Meta put out that also describes just this process for image models in detail. One of the largest open questions in video generation for a while has been structural consistency, right, of 3D objects. when the camera is spanning around a person or a car or any subject to make sure that it stays
Starting point is 00:18:08 and looks like the same subject from various angles has been a challenge for regenerative video. How did you guys approach that? You mentioned in the paper that 3D data and multi-view data was important. Actually, I think the main point we want to make in the paper is the one that we talked about earlier. Like having a foundational video model actually gives us much more than just a model that you can generate nice looking clips or videos, right? It learns a representation of the world. And one aspect of that is that we tried to demonstrate in the paper, given a video model, which has seen a lot of
Starting point is 00:18:43 objects from different views, lots of different camera movements, it should be much more easy to turn that into a multi-view model. And that's kind of the main message. So we take the pre-trained video model, which has seen a lot of different videos, a lot of different camera movements, and we fine-tune that on very specialized multi-view orbits around 3D objects and turn the video model into a multi-view synthesis model. And that works pretty well. So one of the dominating approaches before that was that you would take like an image model, like stable diffusion, and turn that into a multi-view model. But yeah, we showed that it's actually helpful to incorporate this implicit 3D knowledge that is captured in all of the videos into the model, and then the model can learn
Starting point is 00:19:25 much quicker than if you start from the pure image model. So that's kind of the main message. But you're right, you can also try to use this explicit multi-view data in the video training or maybe even something that we do in the paper, train Loras explicitly on like different camera movements and then put this Loras back into the video model. So you get control over the camera for your very general video model, which is quite cool. Yeah, so this I found was one of the coolest pieces of the paper was incorporating Loras for fine grain control in the creation process. Could you maybe give us a quick overview of what Loras even are conceptually, intuitively, and what led you to intuition that Loras would be an important part of the architecture?
Starting point is 00:20:11 Loras are just like really lightweight adapters which fine-tuned onto an existing base model, which adapt the attention layers. and by that you can just like on a small really highly specialized data set, you can tune in a really, really lightweight way different properties into the model. And in this case, we just like tune different kinds of camera motion into our video model. So if we use a small dataset which only contains like zooms or pannings to the left or to the right, we can actually tune such a Laura as a small adapter to the potential layers of our model. to just like get exactly this behavior and this is a really awesome way
Starting point is 00:20:52 to just in a really lightweight way fine tune these foundational models and it has shown to be like really effective and accordingly it's like really highly appreciated in the community I would say to get these kind of easy fine tunes. Yeah and I think for image models it's like extremely popular there's so many different doras that people plug into these models
Starting point is 00:21:12 for video models our goal was just to demonstrate that this is something that's possible. It's just like at the, the beginning and there's much more that should be possible like very specialized kind of motions so I think there's a lot of creative possibilities that's actually worth exploring for a little bit one of the windows that you guys have into the future is by understanding where the research is going you get to time travel and kind of get a glimpse into the future of creativity and so having seen how effective lauras are at least at a few set of tasks like motion control right so in the paper you
Starting point is 00:21:46 propose using Laura's for camera control, banning, zooming, et cetera. The history of video creation has usually required creators to have a ton of different knobs and dials in their software that they use, right? Whether it's an Adobe After Effects or some other professional software, you literally have hundreds of dials and buttons that you can use to control and edit these videos. And conceptually, should people think about Laura's as mapping to these controls in the future will a director or creator of videos basically be relying on hundreds of different lauras to express the control they want over the video or do you think fundamentally lauras will hit some scaling sort of limit and that's the wrong analogy to use how should creators think
Starting point is 00:22:31 about these new tools that you've given them yeah i think you actually said it writes maintaining like a library of hundreds of loras is maybe not like the most scalable approach Actually, if you look at the model that we put out now, it's just like taking an image and animating that, right? Then we can do some stuff like with these lowers, but what you actually want, I think, is given the image and some text prompts do exactly what I describe in a text prompt. There's already some work that explores that. But yeah, giving like more control over what happens in the video, be it through lores, but maybe through a text prompt or through like spatial motion guidance, like in runway's motion brush. There are different ways of doing that, but you definitely want more control over this whole creation process. And then I think you're at the stage where you can really start to generate personalized individual content.
Starting point is 00:23:23 Like especially probably for video creation, we want something like with the image models, you want like very fast synthesis. Because then this will become more like, I don't know, sometimes I think about this as like a video game, right? You type your prompt and you immediately see what happens given your input view. And I think this might be a super nice user experience, actually. So we want this additional control, and we want fast rendering, fast sampling, fast synthesis. You know, you said earlier that you're hoping that the community explores more things. Now that you've actually put the model out there, they're going to be developers and creators who listen to this podcast. What would you like them to explore first and most intensely?
Starting point is 00:24:01 Well, I think just trying out the model, rendering some awesome stuff, of course. Also further exploring maybe the representation we built. in the paper we mentioned that we trained this model on a whole lot of data and this has just seen really, really much motion and the representation is really fruitful. We showed that by our 3D fine tuning. By the way, this was completely surprising for me, seeing that model after 1,000, 2,000 iterations,
Starting point is 00:24:28 like already getting 3D reasoning. This was really, really nice. So as we saw that, it will be extremely interesting to see other such approaches. The model is open source. People can try it and give it another couple of weeks and then we will see what happens. But I'm excited for it. You know, my personal favorite for what people did on day one
Starting point is 00:24:47 was obviously animating memes. I'm sure you guys have seen all that. That was really funny. What are your guys' favorite creations so far that you've seen? Anything that jumps to mind? I think the one I have always forget the name of that meme where the man is looking after another woman.
Starting point is 00:25:04 The man looking behind, right? The man looking behind, yeah, that one. We'll try to put a visual of it up. Yeah, exactly. I think it just visualizes very nice additional experience that a video model can provide, right? People are used to like 2D memes, but then I can actually try to animate this and see what happens. Also, if you think about famous artworks or something, just bringing them to life is a really, really nice property. And it's now enabled.
Starting point is 00:25:29 Everyone can just like poke around a bit of Mona Lisa and see what she's looking from the side. Oh, that's cool. I haven't explored that one. but you're saying prompting the model with an image of a notable art piece. Just like making Van Gogh's starry night, make the stars shine and glimmer. And I think it's really cool. The world is pretty lucky that you guys have gifted the model to the developer and open source ecosystem. It's already such an incredible sort of step, leap forward, right?
Starting point is 00:25:55 In what you can do with images and with video? What do you think are the two or three biggest sort of open challenges that you guys want to prioritize next that are still limitations? video generation. I think a really important thing is to get these models to generate longer videos, to process longer videos in general, not only generate them, also see them. Because I think eventually processing longer videos is key to understanding what we talked about earlier, fundamental aspects of this world, better physical properties. And so this is a really important part to enable these models to generate longer content, more coherent content, also with other kinds of motion.
Starting point is 00:26:38 And what Robin already said, I think, making them fast. We'll just like unlock so much more exploration. So this is really a nice thing. And we're actually working on it. Yeah, and there are simple things like thinking about like multimodality, adding an audio track to your generated video that is in sync with the action that is rendered. I think there is a lot of stuff to explore. So you're talking, Andrea, I said earlier about the infrastructure challenges.
Starting point is 00:27:03 If you had a magic wand, what infrastructure improvements do you wish the industry could solve for you? I mean, you could ask for more GPUs, more CPUs per GPU, and this would solve much of the data loading issues. Also, not only GPU memory is always good, but also CPU memory. But I think, like, hitting these limits is just some form of natural way. You always want to try to improve your efficiency. you always want to try to train faster and at some point you will face a bottleneck, a limit and you have to come up with a nice algorithmic way maybe
Starting point is 00:27:41 or with another way of overcoming this. For instance, for many years, data loading was not a big thing because GPUs were too slow. But now we have extremely nice accelerators with the newest H-100. It's insane how fast these GPUs are actually running, not how fast you can train models on those. And then you will just like hit the next bottleneck It's actually good to see this, that we hit limits, we have to overcome this,
Starting point is 00:28:06 and then you improve, and this is how you learn, and this is how you can make things much more efficient in the end. Yeah, it's actually, if you only rely on more compute, it's a bit boring. I think, like, compute constraints can also drive innovation, right? So, for example, the latent diffusion framework, we developed it at the university because we had, like, single GPUs where we train on, and that kind of naturally, leads to some kind of innovation. And in this case,
Starting point is 00:28:33 this is something that everyone uses right now. This is actually crazy to see. Dolly 3 uses a model that, like the auto encoder that was trained on a single GPU. This is, I think, how intelligence also arises. If you have a constraint environment, you have to come up with a smarter way of doing things. And this is how, without any limitations,
Starting point is 00:28:55 there wouldn't be those nice solutions for many problems we have right now. Yeah, no constraints, no creativity. Right. Exactly. I do think one of the underappreciated parts of your guys' group ever since your university days has been just how compute efficient a lot of your research has been. I certainly have talked to so many university level researchers, grad students, postdocs, who saw that research that you guys put out a year and a half ago with stable diffusion and felt really inspired because university and academic environments are somewhat compute constrained. And so I think even though now you have access to tons of compute,
Starting point is 00:29:31 the sort of self-imposed compute constraints, it makes me very happy to hear that those constraints are something you guys think are a feature, not a bug. I will probably keep the open-source ecosystem pretty vibrant. But in addition, you're often sort of racing and responding to other labs as well in the field. Some of these labs are much better funded than you, are bigger than you.
Starting point is 00:29:50 So how do you think about prioritizing your research pipelines and your timelines? And how would you say that's different than labs that are largely academic. That's a very good point. I think actually this whole competition, it also drives the field of AI. It's probably very important
Starting point is 00:30:05 to not get distracted, but it's too much. But of course, since one and a half years, everyone is doing something with diffusion. It can actually be quite fun to work in this competitive environment. Everyone here enjoys doing that. It's quite fun that we have like this lab here in Germany. Actually, we compete with Open AI, Google, other research labs across the world.
Starting point is 00:30:25 It's intense, definitely. but it's a lot of fun. I think we're not a big lab, but I think we're really like all having kind of the same spirit and we're feeling like we're working on something which in the end gives not only us something. For us, it's also really cool, but also we can give something back to the community
Starting point is 00:30:44 to other researchers, which might have not the resources we have. What I love about the lab and the group you guys have put together is the philosophy of the rising tide lifts all votes, right? because you guys publish your research for the world to use. And I thought one of the coolest things about the Dali three paper was the citations list, which included your work. They were sort of thanking you for the work that the Stable Diffusion Group had put out. I think your work ends up benefiting all kinds of labs across the industry.
Starting point is 00:31:15 And so while the competition can be intense, it's also one of the best, most inspiring examples of an industry helping each other out. Unfortunately, I'm not sure the default direction of the industry is, collaborative, right? A few years ago, if you guys remember, everyone would openly share their research with each other. Luckily, for creators and developers, you guys are still bearing that torch, and I hope you continue doing that. And it comes up all the time in conversations with researchers that many of the labs we just talked about that they're very grateful for the research you guys do. So I hope you keep doing that. Yeah, me too. I just chose that it is super
Starting point is 00:31:48 important to have this kind of contribution to open and accessible models and everyone in our team is super motivated to contribute to that. So I couldn't imagine doing anything else right now. If you liked this episode, if you made it this far, help us grow the show. Share with a friend, or if you're feeling really ambitious, you can leave us a review at rate thispodcast.com slash A16c. You know, candidly, producing a podcast can sometimes feel like you're just talking into a void. And so if you did like this episode, if you liked any of our episodes, please like
Starting point is 00:32:24 us know. We'll see you next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.