a16z Podcast - Text to Video: The Next Leap in AI Generation
Episode Date: December 20, 2023General Partner Anjney Midha explores the cutting-edge world of text-to-video AI with AI researchers Andreas Blattmann and Robin Rombach. Released in November, Stable Video Diffusion is their latest ...open-source generative video model, overcoming challenges in size and dynamic representation.In this episode Robin and Andreas share why translating text to video is complex, the key role of datasets, current applications, and the future of video editing.Topics Covered: 00:00 - Text to Video: The Next Leap in AI Generation02:41 - The Stable Diffusion backstory04:25 - Diffusion vs autoregressive models06:09 - The benefits of single step sampling09:15 - Why generative video?11:19 - Understanding physics through AI video12:20 - The challenge of creating generative video15:36 - Data set selection and training17:50 - Structural consistency and 3D objects19:50 - Incorporating LoRAs21:24 - How should creators think about these tools?23:46 - Open challenges in video generation 25:42 - Infrastructure challenges and future research Resources: Find Robin on Twitter: https://twitter.com/robrombachFind Andreas on Twitter: https://twitter.com/andi_blattFind Anjney on Twitter: https://twitter.com/anjneymidhaStay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.
Transcript
Discussion (0)
When I first sampled this model, I was actually shocked that it works so well.
The pure improvements in performance in text understanding of these models
is it possible to derive something like a physical law from such a model?
I think a really important part of this is the fact that these models have been accessible to everyone.
It learns a representation of the world.
Today, many people are familiar with text-to-text and text-to-image AI models.
Think chatyvety or mid-tourney.
But what about text to video?
Well, several companies are working to make that a reality.
But for many reasons, it's a lot harder.
For one, their size.
Just think, you'll often find text files in the kilobytes.
Images? Maybe a few megabytes.
But it's not uncommon for high-quality video to be in the gigabytes.
Plus, video requires a much more dynamic representation of the world
that incorporates the physics of movement, 3D objects, and more.
Imagine the hand challenge and text image, but in this case, it's hand squirt.
But this is not stopping the researchers behind stable video diffusion, which as of November 21st was released, as a state-of-the-art open-source generative video model.
And today, you'll get to hear directly from two of the technical researchers behind that model, Andreas Blatman and Robin Rombach.
Robin, by the way, is also the co-inventor of stable diffusion, one of the most popular open-source text to image.
image models. So in today's episode, together with A16Z general partner, Ajne Mida,
you'll get to your first hand what makes text a video so much harder, the challenges like
selecting the right datasets that enable realistic representations of the world, applications where
this technology is actually already being put to use, plus what the video editor of the future
might look like, and how constraints continue to spur innovation and ultimately keep
this field moving. And if you like this episode, our infrastructure team is coming out with
a lot more AI content in the new year, but in the meantime, you can go to A16Z.com
slash AI for our previous coverage.
All right, the first voice you'll hear is Anjane, then Robin, then Andreas.
Enjoy.
As a reminder, the content here is for informational purposes only, should not be taken as
legal, business, tax, or investment advice, or be used to evaluate any investment or security
and is not directed at any investors or potential investors in any A16C fund.
Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast.
For more details, including a link to our investments, please see A16C.com slash disclosures.
This is a conversation I've been super excited about for a while.
Maybe we can start with just a brief overview of your team, your research lab, and for listeners who are unfamiliar,
or maybe spend just a couple minutes talking about what stable diffusion is and what stable
video diffusion is.
Absolutely.
Thank you for having us.
What is stable diffusion?
Stable diffusion is a text to image model, a generative model that means you type in a text
prompt and it generates an image based on that.
In particular, stable diffusion is, as the name suggests, a diffusion model.
Diffusion models are a type of generative models, which has been super successful recently
for image generation.
and it's based on a technique that we developed while we were still at the university.
So me and Andreas and Patrick and Dominic, all in the same team now at Stability.
We are a multimodal company and our specialty is to produce and publish models,
try to make them as accessible as possible.
That includes publishing weights and making foundation models for all kinds of modalities,
not only images, but also video available and enabling research on top of that.
So we have seen that stable diffusion was super successful, I would say much more successful
than we initially anticipated.
And there are like hundreds, if not thousands of papers that are building on top of that.
We in particular, our group is focused on visual media.
So that is images, that is videos, and stable media diffusion that you just introduced.
It's kind of the next iteration.
It's our first step into the video domain.
when we published a model that can take in an image
and turn that into a short video clip.
Maybe we could spend a couple minutes
on a brief overview of diffusion models.
That might be helpful.
How do diffusion models differ from other types
of generative models and techniques
like auto-aggressive models,
if you could just give us a little bit of context
before we dive in?
Fusion models are really the to-go models right now
for visual media images and videos.
They're kind of different to auto-aggressive models
because they don't represent data as a sequence of tokens,
which we know from auto-aggressive models.
And since images and videos are composed as a pixel grid,
this is really a good beneficial property.
Also, they favor perceptually important details,
which is inherently baked into these models
because their learning objectives are tuned to favor these important aspects of images
as we perceive it as humans.
And that is what we actually want, right?
but they also have some commonalities with auto-aggressive models.
Both order-aggressive models as well as diffusion models,
they are iterative in their nature,
but as opposed to order-aggressive models,
which iteratively generate token by token,
or word-by-word for language,
these models gradually transform in small steps,
so they gradually transform noise to data.
One point to add to the difference,
Maybe in diffusion models, you train the model on, initially you used like a thousand different noise levels between data and like pure noise.
But the interesting thing is that at sampling time, you can actually use less steps.
You can use like 50 steps.
We have published a distillation work a week ago that actually shows that you can go as low as one sampling step, which is, I would say, a big advantage of these diffusion models.
For folks who may not be familiar with why a single-step sampling breakthrough is important,
could you say a little bit about what benefits that leads to for creators or users of the model?
Oh, yeah, absolutely.
I think the most intuitive thing is that you actually see what happens while you type in your text prompt.
So think of like this text image model.
You type in your prompt.
One and a half years back, you had to wait for like a few seconds, maybe even up to a minute.
Now you see what happens.
And the quality is even better than what we had like with the first iteration of stable diffusion.
So super exciting actually to see that kind of trajectory, these kind of developments.
Like when I first sampled this model, I was actually shocked that it works so well.
To keep pulling on that thread for a bit, if we rewind the clock back to a year and a half ago,
which is when you guys first put out stable diffusion, between then and now, what has surprised you most about
image models that you didn't expect.
The pure improvements in performance
in text understanding of these models
in spatial compositionality
of what these models can do
just by typing in a single prompt
you can describe a scene
really, really fine-grained
and it gives you a highly detailed
visual instantiation of it.
The developments has been huge
we published SDXL in June
and even then it was a huge
improvement in visual quality in prompt following.
also other models which we see right now
is like most reason
DALI 3 is like a huge improvement still
but also as Robin said
that there has been a lot of different
Sampras proposed to make these models faster
and faster and faster
and right now we're getting really close to
50 steps performance and even
one step. This is a huge improvement
and I think a really
important part of this is the fact
that these models have been accessible to
everyone. So open sourcing
a foundation model as stable diffusion initially
that led to a whole lot of research on these models,
which was, in retrospect, extremely important to do this.
I think otherwise we wouldn't have seen the improvements we saw until now.
Even before that, I was surprised that text image with diffusion models work so well
just before we published a model.
Like when I first saw this myself, we had this latent diffusion approach
that we developed at the university.
I mean, we got a machine with like 80-gibite A-100s
just after we put it in archive, and then, yeah, immediately you started working on
hey, we want to have this text image model, but not train it on one GPU.
Let's use all of, like, let's use our little cluster with 80 gigabyte A100s.
We trained this latent diffusion model on 256, but 256 pixels.
It was the first time that we had to deal with large scale data loading and these kind of things.
And then using this model, combining it with classifier-free guidance, which is a sampling
technique that further improves sample quality at basically no cost.
I was like really surprised that we could do this.
on our own and achieved a pretty good model, I would say.
And then, like, two days later, Open AI published Dolly 2, and all the hype was gone,
but it was a pretty nice experience.
You know, something you mentioned, Andreas, is that the fact that you guys chose to release
stable diffusion as an open source model resulted in this crazy ecosystem
exploding around your research, which is just something that doesn't happen as quickly
or as fast with models that aren't open source.
And so in the last year and a half,
one of the things that's been really fun, at least for me to watch,
is all the really surprising things that developers and creators have done
with the base model that you guys put out.
You've provided folks a set of Lego blocks
that they can mix and match in different ways,
things like ControlNet that give people more controllability,
allowing your community to build their own front end.
And out of all of that, I'm sure, came a ton of requests.
As you guys were prioritizing all those asks that came in from the world in the community,
why was stable video the thing that you guys decided to prioritize?
Video is an awesome kind of data because to solve that task, to solve video generation,
a model needs to learn much about like physical properties of the world or the physical
foundations of the world.
There is so much without knowing about, for instance, 3D scenes, you cannot generate a camera pan around an object or you cannot make an object move.
If a person turns around, the model needs to hallucinate how this person looks from behind, right?
So to know so much about the world by just including that additional temporal dimension, this is what fascinated me most on working on videos.
It's also really next level of computational demands because you have an additional dimensionality, which makes everything much harder, I think.
And yeah, I think we like challenges.
That's why we probably focused on doing that.
Yeah, something that's not known about you guys is by background, originally, I believe you're physicists.
Yeah, I'm a physicist, but I haven't done much physics in a while, unfortunately.
I'm originally a mechanical engineer, but that is also really related to physics,
and I was always inspired by physics and really fascinated by it.
Well, both of your backgrounds academically were spent studying the physical world.
And I just think it's poetic that your primary interest in generative modeling came from trying to understand at some deeper level of the physical world.
And it seems like that seems to have motivated at least some of the intuition and the research around your approach to stable video.
Yeah, absolutely. I fully agree.
And we're just scratching the surface with the kind of video models that we have right now.
Having something like we are seeing in language modeling, but trained on pixels on videos will probably give like super interesting downstream behavior.
not only like generating videos, but also understanding of the world.
Like, is it possible to derive something like a physical law from such a model?
I don't know.
Or such a model is also always predictive.
So you can start with an image or with a sequence of images and try to predict what happens next, of course.
And then I think like also coupling this with other modalities such as language will maybe provide
a way to ground like these models more in the physical.
the world. I think that's a good segue into what is the main focus of today's conversation,
which is generative video. To folks who are early users of stable diffusion, stable video was
a much awaited sort of natural progression from the original model. Just take us back a little bit
to the original sort of conception of the project. How long have you guys been working on video
modeling? I would say roughly half a year. And like for this model that we just put out,
I think the main challenge was that we actually had to scale the dataset and the data
loading. So if you train a video model on a lot of GPUs, you suddenly run into problems that
you did really have had before. Like loading high resolution videos is just like a difficult
task if you do it at scale. Also only decoding videos is really hard. Like a data loader has to
transform like spites that loads into a suitable representation for the model. And to do so,
you have to do a lot of computational work to transform it into a suitable input.
sample for the generative model.
And this is competitionally really expensive.
And since we have so fast GPUs right now, it was really like the CPUs,
we're just like in the beginning too slow.
Building an efficient data pipeline for video was really a challenge, which we, I think,
solved really well, which probably took most of the time we spent on building and scaling
these video models.
And actually, there's like interesting bugs that you can encounter during training.
So we had one where you have your data and then you have.
add noise to that data that the model tries to remove, right?
And if you do that on a video, you add noise to each frame of the video, and then we had
a bug where we added, like, different amounts of noise to different frames in the video,
which just complicates the learning task unnecessarily.
Things like this, it's just like one line of code that can go wrong.
What was the biggest difference between the image model research and your video work?
Because noise sampling and noise reduction, these are sort of different.
diffusion techniques that are shared across images in video, but it would be helpful to understand
what were unique to the video challenge. First of all, the pure dimensionality of videos.
I mentioned that before with this like additional dimension. This introduces, of course,
a higher GPU or memory consumption. And this was really a challenge. So for diffusion models,
it's really important to have a high batch size because you can approximate the gradient, which
thrives, the learning much better
if the batch sizes is higher,
especially for diffusion models,
it's like a really
an important thing
to have a really high batch size,
but have to increase your number of GPUs,
which again introduces new challenges
in terms of scaling,
in terms of like redundancy in your training pipeline,
if something breaks somewhere
in one GPU,
it will just like throw down the entire training.
And the more GPUs you add to your cluster,
And the more chip you use you train, the higher the probability will be that somewhere there's just like a, say, a hardware failure, even, which also happens.
This additional dimensionality just introduces these new scaling challenges, which are really, really interesting to come by, I would say.
Well, that's very helpful. I think one of the most valuable things that your guys' research has done for the industry is that you often share in very excruciating detail some of the infrastructure challenges that came with training.
And I think since scaling models at the magnitude that you guys are as a relatively new infrastructure challenge, I think it's very, very helpful for other researchers to be able to hear the sort of nuts and bolts that you had to figure out.
out, right, to get these models out. Then there's a whole other set of data related challenges
that aren't about the data pipeline per se, but it's about the representation of the data set
curation, the data set mixture. Could you guys just talk a little bit about how you approached
picking your data set for this release and what was your intuition and what kinds of data
were most important, how you wanted to filter it, and ultimately what ended up being your most
important learnings around the data set when it came to training stable video? We actually spent a lot
of time talking about this in the paper that we just put out. What we also define in this paper
is that we can divide this training process into three stages. And the first is that we actually
train an image model. So for training video models, it's usually just helpful to reuse
the structural spatial understanding from image models when there are like powerful image models.
We should reuse for then training the video model. And then there's next steps. So having a image
model, like Stability Fusion, for example, you have to get this additional knowledge about
like the temporal dimensionality and about motion, right? So for that, we train on the large
data set that we still have to create a bit. So we don't want, let's say, optical characters.
We don't want like text in the video. We want nice object motion. We also want nice camera
motion. So we have to filter for that. And yeah, we do this in like two regimes. We train on
A lot of videos in the first stage, and in the second stage, we train on a mock-rated,
very high-quality, smaller data set to really refine the model.
And it's similar to image models where you retrain on a large dataset and then refine
on a high-quality dataset.
There was a paper recently that Meta put out that also describes just this process for
image models in detail.
One of the largest open questions in video generation for a while has been structural consistency,
right, of 3D objects.
when the camera is spanning around a person or a car or any subject to make sure that it stays
and looks like the same subject from various angles has been a challenge for regenerative video.
How did you guys approach that?
You mentioned in the paper that 3D data and multi-view data was important.
Actually, I think the main point we want to make in the paper is the one that we talked about
earlier.
Like having a foundational video model actually gives us much more than just a model that you can
generate nice looking clips or videos, right? It learns a representation of the world. And one aspect
of that is that we tried to demonstrate in the paper, given a video model, which has seen a lot of
objects from different views, lots of different camera movements, it should be much more easy to turn
that into a multi-view model. And that's kind of the main message. So we take the pre-trained video
model, which has seen a lot of different videos, a lot of different camera movements, and we fine-tune
that on very specialized multi-view orbits around 3D objects and turn the video model into a
multi-view synthesis model. And that works pretty well. So one of the dominating approaches
before that was that you would take like an image model, like stable diffusion, and turn that
into a multi-view model. But yeah, we showed that it's actually helpful to incorporate this
implicit 3D knowledge that is captured in all of the videos into the model, and then the model can learn
much quicker than if you start from the pure image model.
So that's kind of the main message.
But you're right, you can also try to use this explicit multi-view data in the video training
or maybe even something that we do in the paper, train Loras explicitly on like different
camera movements and then put this Loras back into the video model.
So you get control over the camera for your very general video model, which is quite cool.
Yeah, so this I found was one of the coolest pieces of the paper was incorporating Loras for fine grain control in the creation process.
Could you maybe give us a quick overview of what Loras even are conceptually, intuitively, and what led you to intuition that Loras would be an important part of the architecture?
Loras are just like really lightweight adapters which fine-tuned onto an existing base model, which adapt the attention layers.
and by that you can just like on a small really highly specialized data set,
you can tune in a really, really lightweight way different properties into the model.
And in this case, we just like tune different kinds of camera motion into our video model.
So if we use a small dataset which only contains like zooms or pannings to the left or to the right,
we can actually tune such a Laura as a small adapter to the potential layers of our model.
to just like get exactly this behavior
and this is a really awesome way
to just in a really lightweight way
fine tune these foundational models
and it has shown to be like really effective
and accordingly it's like really highly appreciated
in the community I would say
to get these kind of easy fine tunes.
Yeah and I think for image models it's like extremely popular
there's so many different doras that people plug into these models
for video models our goal was just to demonstrate
that this is something that's possible.
It's just like at the,
the beginning and there's much more that should be possible like very specialized kind of motions so
I think there's a lot of creative possibilities that's actually worth exploring for a little bit
one of the windows that you guys have into the future is by understanding where the research is
going you get to time travel and kind of get a glimpse into the future of creativity and so having seen
how effective lauras are at least at a few set of tasks like motion control right so in the paper you
propose using Laura's for camera control, banning, zooming, et cetera. The history of video creation
has usually required creators to have a ton of different knobs and dials in their software
that they use, right? Whether it's an Adobe After Effects or some other professional software,
you literally have hundreds of dials and buttons that you can use to control and edit
these videos. And conceptually, should people think about Laura's as mapping to these
controls in the future will a director or creator of videos basically be relying on hundreds of
different lauras to express the control they want over the video or do you think fundamentally lauras
will hit some scaling sort of limit and that's the wrong analogy to use how should creators think
about these new tools that you've given them yeah i think you actually said it writes
maintaining like a library of hundreds of loras is maybe not like the most scalable approach
Actually, if you look at the model that we put out now, it's just like taking an image and animating that, right?
Then we can do some stuff like with these lowers, but what you actually want, I think, is given the image and some text prompts do exactly what I describe in a text prompt.
There's already some work that explores that.
But yeah, giving like more control over what happens in the video, be it through lores, but maybe through a text prompt or through like spatial motion guidance, like in runway's motion brush.
There are different ways of doing that, but you definitely want more control over this whole creation process.
And then I think you're at the stage where you can really start to generate personalized individual content.
Like especially probably for video creation, we want something like with the image models, you want like very fast synthesis.
Because then this will become more like, I don't know, sometimes I think about this as like a video game, right?
You type your prompt and you immediately see what happens given your input view.
And I think this might be a super nice user experience, actually.
So we want this additional control, and we want fast rendering, fast sampling, fast synthesis.
You know, you said earlier that you're hoping that the community explores more things.
Now that you've actually put the model out there, they're going to be developers and creators who listen to this podcast.
What would you like them to explore first and most intensely?
Well, I think just trying out the model, rendering some awesome stuff, of course.
Also further exploring maybe the representation we built.
in the paper we mentioned that we trained this model on a whole lot of data
and this has just seen really, really much motion
and the representation is really fruitful.
We showed that by our 3D fine tuning.
By the way, this was completely surprising for me,
seeing that model after 1,000, 2,000 iterations,
like already getting 3D reasoning.
This was really, really nice.
So as we saw that, it will be extremely interesting to see other such approaches.
The model is open source.
People can try it and give it another couple of weeks
and then we will see what happens.
But I'm excited for it.
You know, my personal favorite for what people did on day one
was obviously animating memes.
I'm sure you guys have seen all that.
That was really funny.
What are your guys' favorite creations so far
that you've seen?
Anything that jumps to mind?
I think the one I have always forget the name of that meme
where the man is looking after another woman.
The man looking behind, right?
The man looking behind, yeah, that one.
We'll try to put a visual of it up.
Yeah, exactly.
I think it just visualizes very nice additional experience that a video model can provide, right?
People are used to like 2D memes, but then I can actually try to animate this and see what happens.
Also, if you think about famous artworks or something, just bringing them to life is a really, really nice property.
And it's now enabled.
Everyone can just like poke around a bit of Mona Lisa and see what she's looking from the side.
Oh, that's cool.
I haven't explored that one.
but you're saying prompting the model with an image of a notable art piece.
Just like making Van Gogh's starry night, make the stars shine and glimmer.
And I think it's really cool.
The world is pretty lucky that you guys have gifted the model to the developer and open source ecosystem.
It's already such an incredible sort of step, leap forward, right?
In what you can do with images and with video?
What do you think are the two or three biggest sort of open challenges that you guys want to prioritize next that are still limitations?
video generation. I think a really important thing is to get these models to generate longer
videos, to process longer videos in general, not only generate them, also see them. Because
I think eventually processing longer videos is key to understanding what we talked about earlier,
fundamental aspects of this world, better physical properties. And so this is a really important
part to enable these models to generate longer content, more coherent content, also with
other kinds of motion.
And what Robin already said, I think, making them fast.
We'll just like unlock so much more exploration.
So this is really a nice thing.
And we're actually working on it.
Yeah, and there are simple things like thinking about like multimodality,
adding an audio track to your generated video that is in sync with the action that is rendered.
I think there is a lot of stuff to explore.
So you're talking, Andrea, I said earlier about the infrastructure challenges.
If you had a magic wand, what infrastructure improvements do you wish the industry could solve for you?
I mean, you could ask for more GPUs, more CPUs per GPU, and this would solve much of the data loading issues.
Also, not only GPU memory is always good, but also CPU memory.
But I think, like, hitting these limits is just some form of natural way.
You always want to try to improve your efficiency.
you always want to try to train faster
and at some point you will face a bottleneck, a limit
and you have to come up with a nice algorithmic way maybe
or with another way of overcoming this.
For instance, for many years, data loading was not a big thing
because GPUs were too slow.
But now we have extremely nice accelerators with the newest H-100.
It's insane how fast these GPUs are actually running,
not how fast you can train models on those.
And then you will just like hit the next bottleneck
It's actually good to see this, that we hit limits, we have to overcome this,
and then you improve, and this is how you learn, and this is how you can make things much
more efficient in the end.
Yeah, it's actually, if you only rely on more compute, it's a bit boring.
I think, like, compute constraints can also drive innovation, right?
So, for example, the latent diffusion framework, we developed it at the university
because we had, like, single GPUs where we train on, and that kind of naturally,
leads to some kind of innovation.
And in this case,
this is something that everyone uses right now.
This is actually crazy to see.
Dolly 3 uses a model that,
like the auto encoder that was trained on a single GPU.
This is, I think, how intelligence also arises.
If you have a constraint environment,
you have to come up with a smarter way of doing things.
And this is how, without any limitations,
there wouldn't be those nice solutions for many problems we have right now.
Yeah, no constraints, no creativity.
Right. Exactly. I do think one of the underappreciated parts of your guys' group ever since your
university days has been just how compute efficient a lot of your research has been. I certainly
have talked to so many university level researchers, grad students, postdocs, who saw that research
that you guys put out a year and a half ago with stable diffusion and felt really inspired
because university and academic environments are somewhat compute constrained. And so I think
even though now you have access to tons of compute,
the sort of self-imposed compute constraints,
it makes me very happy to hear that those constraints
are something you guys think are a feature, not a bug.
I will probably keep the open-source ecosystem pretty vibrant.
But in addition, you're often sort of racing
and responding to other labs as well in the field.
Some of these labs are much better funded than you,
are bigger than you.
So how do you think about prioritizing your research pipelines
and your timelines?
And how would you say that's different than labs
that are largely academic.
That's a very good point.
I think actually this whole competition,
it also drives the field of AI.
It's probably very important
to not get distracted, but it's too much.
But of course, since one and a half years,
everyone is doing something with diffusion.
It can actually be quite fun to work in this competitive environment.
Everyone here enjoys doing that.
It's quite fun that we have like this lab here in Germany.
Actually, we compete with Open AI,
Google, other research labs across the world.
It's intense, definitely.
but it's a lot of fun.
I think we're not a big lab,
but I think we're really like all having kind of the same spirit
and we're feeling like we're working on something
which in the end gives not only us something.
For us, it's also really cool,
but also we can give something back to the community
to other researchers, which might have not the resources we have.
What I love about the lab and the group you guys have put together
is the philosophy of the rising tide lifts all votes, right?
because you guys publish your research for the world to use.
And I thought one of the coolest things about the Dali three paper was the citations list,
which included your work.
They were sort of thanking you for the work that the Stable Diffusion Group had put out.
I think your work ends up benefiting all kinds of labs across the industry.
And so while the competition can be intense,
it's also one of the best, most inspiring examples of an industry helping each other out.
Unfortunately, I'm not sure the default direction of the industry is,
collaborative, right? A few years ago, if you guys remember, everyone would openly share
their research with each other. Luckily, for creators and developers, you guys are still bearing
that torch, and I hope you continue doing that. And it comes up all the time in conversations
with researchers that many of the labs we just talked about that they're very grateful for the
research you guys do. So I hope you keep doing that. Yeah, me too. I just chose that it is super
important to have this kind of contribution to open and accessible models and everyone in
our team is super motivated to contribute to that.
So I couldn't imagine doing anything else right now.
If you liked this episode, if you made it this far, help us grow the show.
Share with a friend, or if you're feeling really ambitious, you can leave us a review at rate
thispodcast.com slash A16c.
You know, candidly, producing a podcast can sometimes feel like you're just talking into a void.
And so if you did like this episode, if you liked any of our episodes, please like
us know. We'll see you next time.