No Priors: Artificial Intelligence | Technology | Startups - The Future of AI Artistry with Suhail Doshi from Playground AI
Episode Date: April 18, 2024Multimodal models are making it possible to create AI art and augment creativity across artistic mediums. This week on No Priors, Sarah and Elad talk with Suhail Doshi, the founder of Playground AI, a...n image generator and editor. Playground AI has been open-sourcing foundation diffusion models, most recently releasing Playground V2.5. In this episode, Suhail talks with Sarah and Elad about how the integration of language and vision models enhances the multimodal capabilities, how the Playground team thought about creating a user-friendly interface to make AI-generated content more accessible, and the future of AI-powered image generation and editing. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @Suhail Show Notes: (0:00) Introduction (0:52) Focusing on image generation (3:01) Differentiating from other AI creative tools (5:58) Training a Stable Diffusion model (8:31) Long term vision for Playground AI (15:00) Evolution of AI architecture (17:21) Capabilities of multimodal models (22:30) Parallels between audio AI tools and image-generation
Transcript
Discussion (0)
Hi, listeners, and welcome to another episode of No Pryors.
Today, we're talking to Suhail Doshi, the founder of Playground AI, an image generator and editor.
They've been open sourcing foundation diffusion models, most recently, Playground 2-5.
We're so excited to have Sue Hale on to talk about building this model in conjunction with the playground community and the future of AI pixel generation.
Welcome, Suhail.
Thanks for having me.
So this is your third company.
You started Mix Panel, Mighty, now you're working on Playground.
How did you decide this was the next thing?
I think back in April of 2022, I think that was just to place that time, it was like GPT
5 kind of came out and then Dolly 2 came out.
And I was actually working on the second company, Mighty.
And at that time, I was trying to figure out how to like do something with AI inside
of a browser address bar.
But when I saw Dolly 2 came out, it was just this like very big.
strange eye-opening moment where I think a lot of people didn't think that we'd be able to do
like weird interesting art things so soon and so and I think then soon after that I think
stable diffusion came out around June or July of that same year and I got early access maybe a couple
weeks access to early to SD14 and I just kind of blew my mind what what people could do with
that and I just thought that it seemed odd that all of this was being done in a Google co-lab
notebook, shouldn't there be like a UI that makes it really easy, that sort of thing?
From the start, were you just thinking we will open source, we will train our own models
from scratch? Did you think about other modalities? Yeah, I was, I mean, there have been a lot of
people that thought I should do like something in music, but I just, I, because music has been
like a huge hobby of mine for like six years or so. I like produce music. But I just couldn't
wrap my brain around like what useful thing I would end up making for people.
although now there's like a lot of very interesting, cool, useful things for music.
And then it seemed like a lot of people were very focused on language.
And I had really enjoyed, I already worked with lots of creative tools.
Like when I was in high school, I used to make logos or I would make music or whatever.
So I was excited that finally I could find something where it was a combination of creativity, tooling.
Images have like really amazing, like built in distribution.
People want to share those kinds of things.
So it ended up just like being this perfect.
thing that I was excited to work on.
How do you think the overall landscape for competition is different in language versus images
versus music, right?
Like, how did you think about in what ways you guys would want to, like, build advantage
and stand out?
I think with language, there's like, I don't know, I don't know how many language companies
there are.
You guys would probably know better than me, but it seems like there's like over 20.
And then maybe like five or like eight of them have a billion.
dollars worth of funding. I also didn't want to work on something if they were already extremely
passionate people really working hard at that thing. People that I like really respected that were
working on that thing. And so at the time with with images, there was just sort of, I think there
was mid-jury, there was opening I doing some dolly stuff. And then you saw sort of stable
diffusion. But for some of these companies, it didn't seem like there was going to be a longstanding
concerted effort to keep making them better. It was sort of unclear like who was doing this as like
a fun demo versus who was doing this as something they would like spend and invest tons of
their time in. And so once I had kind of figured out, you know, to what extent opening I was
going to invest in it or to what extent seemed like the folks that stability AI were like sort
of focused on like seven different kinds of things. And I just thought like, hmm, there's just
not enough people that want to do this one thing and do it really, really great. So I think for me,
it was just about, were there enough capable people that wanted to do this?
Can you talk a little bit about the specific direction you decide to take with playground as well?
I know you thought really deeply about some of the applications or use cases for it.
So I was just curious if you could share a bit more about that.
Yeah, I think one thing that has sort of been surprising, and it hasn't changed too much, actually,
from maybe around June or July of August 2022, was that like a lot of people think about,
like text to images just, it's right now it's kind of not, it's not even text to image.
It's like text to art.
Sorry, what's the difference?
The difference is that these models have, they don't.
they don't, they don't, they haven't quite reached the potential of like what, maybe what
its utility could be. Right now, for the most part, we take, we formulate a prompt,
which is really just a caption of what the image is, and then it diffuses into an image,
a set of pixels, but a lot of those pixels are primarily used for art. But what we haven't done
is we haven't done anything beyond that. We haven't really done something like editing, for
example. Like, why can't we take an image that you already have, and why can't we like sort
of insert something into that with like the correct lighting and stuff?
Why can't we like stylize an existing thing?
Why is there not a blend of like real and synthetic imagery into a single image?
That could then be used for a lot more things than just pure art.
And so right now it's a lot of just people making art, but not a lot of people.
Sometimes that reduces its like practicality or its utility.
Yeah, that makes sense.
I think one of the things that you focused it as well that I thought was really interesting is
you built your own models, right?
Or you train your own models.
And a lot of people in this space just take stable diffusion and fine tune it
or do other approaches like that.
And you've gotten,
you just launched 2.5,
the model is performing incredibly well.
Like it creates really beautiful imagery
and it's super high quality
and I'm a little bit more about
how you went about training your model
and hiring a team specifically for that purpose
and how you thought about it and approached it.
Yeah, it turns out that, you know,
I think when, you know,
instead of strong engineers, like their first thought
is that like you just take a model architecture,
you find a lot of data,
you fund yourself with enough compute and you just sort of like throw these things into like a mixture of
sorts and like out comes out like something like dolly two or dolly three it turns out that it's just
way more complex than that and way more complex than i even imagined i had a sense that it was more
complicated than that but then it's still further it's more complicated than even that um you know so i
think there are a couple things that we did uh one of the things that we were really focused on with that
model was that we wanted to see how far we could push the architecture of something that already
existed. This was mostly like a test. It was a test to see whether how far we could get as a
research team before like the next model change. And so we wanted to take something that we knew
was a recipe that worked already, which was stable diffusion Excel's architecture, which is like
a UNET, right? And clip and the same VAE that Robin Rombach trained all this stuff. And then we sort
of said, okay, what if we try to get something that's just at least better than SDXL, better than
the open source model? And we weren't really sure.
sure by how much. And so our only goal was to just be better and try to deliver on the number
one save-of-the-art open source model that we could release. And so we kind of learned two things.
One is that when we looked at some of the images from something like SDXL, we noticed that there
was sort of this average brightness. It was really confusing. It didn't quite have like the right
kind of color and contrast. And in fact, I became so used to this. I became so surprised about
the average brightness when comparing it to the images of everything.
our model that I thought it was a bug during eval.
I literally was like looking at the images and I was like, these cannot be the right
images.
And my team was sort of like, hey, I think you're actually just getting used to the images
of the new model.
And so we employed this thing called like this EDM formulation, which like samples the noise
slightly differently.
And it's like a really clever kind of math trick.
And there's a paper that you could probably read on it.
But it's surprising how this like one little like very clever trick, what can produce images
that have like incredibly.
great color in contrast.
The blacks are really vibrant
with like a bunch of different colors
and this average brightness kind of goes away.
So that's like one thing.
You know, that's a really interesting example
of really optimizing for one aspect
of creating, you know, aesthetically pleasing imagery.
And there's a few other aspects like that.
So I'm just curious,
how much do you have to sort of hand-tune different parameters
versus it's just something that you get
as you, you know, train a model
or post-training model?
yeah i mean there's just so many there's different dimensions of these models i mean one is just like
there it's understanding of knowledge but then for like aesthetics it's really tricky um i honestly i
think like the field itself is just so nascent that like every month there's like a new trick there's
like a new thing that we all sort of develop or find find out i think there are there's an element
of some of that being like a lot of different trucks like there's like this new trick that hasn't been
like um well employed or well exploited yet by this guy named teo uh
Harris. And he basically does this weird thing called power EMA. Anyway, like basically helps
converge training really fast. And so that's like one trick. And then there's this EDM trick.
And there's this thing called offset noise. And so there is a lot of tricks for things like color and
contrast. There's even a trick called DPO for like, I think works in the language model world and
also the image world. Right. So I think there are all these, there are like lots of tricks that sometimes
get you like 10, 20, sometimes 2x improvements. But I think the number one trick is like really just like
that last phase of, you know, a supervised fine tune where you're finding like really great
curated data. And it's hard to say how much of that is a trick because it's actually just a
lot of meticulous work. So I think there's a kind of a combination of some of these things
being tricks and techniques. And then there's just like this other thing that's just like
really hard, meticulous, like there has to be deep care. And with images, maybe more so than
language, there has to be like taste and judgment. Yeah. How do you think about that from the
perspective of the e-value do? Because not that many people have amazing taste aesthetic.
right? And so I'm a little bit curious how you end up determining what is good taste or is it just
user feedback, you know, thumbs up, thumbs down. Like, how do you think about that? One thing that I've
noticed is that every time we do an e-vow, we try to make our e-vals better and we try to make them better than
the predecessor e-val. And so one thing I always noticed, though, with each successive run is that I find
out much later after eval that the model has like all these gaps. So an example of a gap that we
recently had was like, we did well on our e-val, but one area that I thought we did poorly
that I wish we had done better on was photorealism. Sometimes it would make faces look like
they hadn't gone to sleep for three days or something. And so I think that most evals in
the industry are relatively flawed. Like a lot of them are doing like benchmarks on things that
maybe are valuable from the purposes of marketing, but are not necessarily well correlated
with what maybe users care about. And so like a simple example,
would be like with large language models like there's probably a good reason there's a reason why
they're probably good at homework it's because like a lot of the evals are like related to things that are
related to things that could be like homework like solving like an LSAT or a biotest or a math test
and so some of these evals just don't have like the necessary coverage so i think with things like
judgment and taste my feeling is that overall the evals need to get like way stronger and so one
thing that we tend to do is we just tend to like really look at like a lot of images across a lot of
grids and we're really like being exacting about like what thing could be off but you have to look
at like thousands of images across that you know lots of lots of different grids across different
checkpoints to basically find and pick like sort of release candidates but I still think that our
own e-bails are not sufficiently strong enough and they could be better at like world knowledge
whether that's like its ability to reproduce a celebrity if that's what you want or paintings
sometimes paintings are like difficult or like 3D or like illustrations or logos those kinds of
things are all like overall I think coverage is like pretty tricky so one of the things that
you guys do is have like voting schemes or user studies within the product itself so I don't know
if it's grids but you're asking users to you know express preferences more so than I think
perhaps other research efforts are can you talk about just like generally
your data curation strategy if there's some sort of overall framework or if community is a big
piece of it. Generally, we try to keep something like very simple because we know that users are
they have like, they're there to make like images. They're not there to like necessarily like help
us label images. And so or annotate things or tell us everything about their preferences. And so
we kind of have like a very sophisticated process of how we sort of curate images and like how
we're collecting, collecting data from these users to help us kind of like,
rank and sort of like make sure we're choosing the right sort of things that we want to curate.
And so I think these things are like very, they might seem like very simple when you encounter
it. But like beneath that is like something very, very complex. But yeah, it's a little
tough to go into it too deeply because yeah, it does feel like a little bit of a secret sauce,
I suppose. Yeah. Well, at least I'll feel good that my guess as to what's interesting is right.
Can you just characterize like where playground does really well today, like where you stand out
and what sort of use cases you're focused on like winning?
We're maybe like probably like number two, I suspect, I guess, at like text to art at the moment.
Just because we're training these models from scratch and we're closing the gap really rapidly as rapidly as we can around all the various like kind of use cases.
But I think that we'll probably diverge from some of the other companies in part because we care.
I think we start, we're going to care a lot more about editing.
You know, people just have like a lot of images on their phone or they want to take some image that they love, whether that's made as art or something.
um, that they found and they want to like tweak it a little bit. It's a little annoying that like
you make this image and then you can't really change too much about it. You can't change the
likeness of it. Maybe there's like a dog or like your face or like something, character
consistency issues. It feels a little like a loot box right now. And I think that because it's so
much of a loot box, it feels, it feels like it's too much effort, I guess to get something that you
really, really want. So I think more where we're navigating is like, how can we help you
take an image that you love? Maybe it's your logo, maybe or incorporate something like your logo
or put it in some sort of situation that you would prefer. Tech synthesis is like something that we
want to do, for example. So those are some areas that we want to head towards where there's like
higher utility and less like you make an image and you just post it to Instagram or something
like that. Where do you want to take the company and the product over the next few years?
Like what is a long-term vision of what you're doing? If people are out there kind of like
working on scaling text, we're basically trying to focus on like scaling pixels in the first
area that we're basically started on is just images. And the reason why we're working on images
instead of say something like video or 3D or something like that is one part, one issue with
3D is that it tends to be better to work on 3D if you're like making the content, like you're
making Pixar movies. The tools in 3D tend to not like make as much money. And then the other thing
with video is like videos is just extraordinarily computationally expensive to do inference or even
training on. And a lot of the video models first train, like pre-trained with like a billion images
first anyway to like have a rich semantic understanding of like pixels. We just think that video,
that images is like maybe the most obvious place to start because A, the utility is quite low. And B,
and be it's like actually somewhat efficient computationally like to do. So long term, I think that
we're trying to make a large vision model. There's not really like a word, I guess. Like we have
LLMs, but I'm not really sure what the word is for vision or pixels. If you're trying to make
like a, you know, a multitask like vision model. And so the goal would be to try to like do three
areas of a large vision model would be to be able to create things, edit things, and then
understand things. So like understanding would be like,
gbt4v or if you're using something open source like cog vlm or there's all these amazing vision
language models that are happening and then editing and creating are things that we've kind of talked
about but it would be really amazing at some point that if you made this like really amazing large
vision model that it could do things like not just like create things like art but like maybe like
help some kind of robot like traverse like some sort of path or like maze and then there's like
things in the middle that are sort of like you know maybe you have like a video camera or
surveillance system or something, and it's like able to understand what's going on in that.
But I think right now it's really focused on graphics.
And then how do you think about the underlying architecture for what you're doing?
Because, you know, traditionally a lot of the models have been diffusion model based,
and then increasingly, you know, you see people now starting to use transformer-based architectures
for some aspects of image gen and things like that.
How do you think about where all this is heading from an architectural perspective?
and what sort of models will exist in the next year or two.
My kind of controversial take, perhaps, is that, you know, there's this thing called DIT,
which people allegedly believe, like, SORA is based on, and then there are variants of DIT.
There's this thing called, like, I think, M.M.D.T., which I think Stable Division III is supposed to be
based on by that research team, that Stability AI.
And my overall feeling is that transformers are definitely, I think transformers are definitely,
like, the right direction. But I don't think that we're going to,
to get a lot of you enough utility if we're not like somewhat trying to figure out a way to
combine the great amazing knowledge of like a language model and or and then just like using
you know something like d it which is completely trained from some kind of video caption or
image caption to an image because there's not enough like interpretable knowledge i suppose like you're
not able to interpret anything about the input which a language model is really great at but then
there's like these models that are just trained on these captions that emit image image
And it's kind of unclear, like, how we might marry these two things.
And so it sure would be nice if, like, somehow we could combine these two things.
So I think the architecture is, like, mostly going, is most likely going to change.
I don't think that DIT is, like, the right architecture, but transformers, certainly.
And just for people in the, if we're listening, DIT just stands for diffusion transformers.
So, in case people are wondering.
One belief held by some of the large labs focused mostly on language today is,
in the end, we end up with like one truly multimodal general model, right?
That it is like not like we don't end up with a language model and a video model and an audio model and an image model.
It is any modality in gigabrain knowledge, reasoning, long context, and any modality out.
Like, do you believe in that worldview or how do you see it differently?
I definitely think the model is going to be multimodal.
And in fact, like, that's kind of what I'm.
mean about like some of these models that are just like strictly trained through like,
you know, a division transformer. Like a diffusion transformer that's only taking like caption
image inputs is just like completely lacks sort of somewhat some knowledge. And then conversely,
if you look at just like the language models, you know, we know that language is that as a much
lower dimensionality than say like an image which has like all these pixels that like sort
of tell us about lighting or physics or spatial relationships or size and shapes.
So, you know, for example, if you were to, like, take a glass and, like, shattered it on the floor and then I asked you to describe it and I asked, and then I described it, we would both come up with, like, completely different descriptions if Elad had to go and, like, draw it, right? So we know that, like, pixels have an enormous amount of high information density compared to language. And language is just really between me and you. It's like a compressed way that you and I can converse with each other at a higher, somewhat of a higher bandwidth, right? Like, we have an abstract view of, like, the, like, the,
is what those words mean.
So I think that the models,
there has to be something,
like language is really great
because it's compressed information.
And then like,
and then like vision is really great
because it's so information rich,
but it's been hard to annotate until recently.
It's only because vision language models exist
that it's now suddenly a lot easier
to like sort of label or annotate or understand
what's going on in an image.
So I think that these two things are very,
are going to be very likely married.
The only question is,
is like, does language, to me, it's kind of a question of, does language, language has this
wonderful trait where it's like you can use language to control things, which is pretty cool
because of its low dimensionality.
But my question would be like, I wonder if language will hit a ceiling, it has like a lower ceiling
than say vision because it's very easy to get lots of pixel data.
And that pixel data is like very, very high density.
Very easy to get more pixel, like additional pixel data to the already collected data for
from the internet that's gone into these models.
Yeah, I mean, there's an assumption maybe.
One assumption I tend to question is like whether the internet is sufficient.
Like the internet is very big, but maybe there's like some kind of mode collapse even
with internet data.
Whereas like with vision, at least you can like, you can like make a robot that just like
travels down the street and like just keeps taking pictures of everything.
You can get like infinite training data with vision, but it might be trickier to like sort
of filter and clean internet data, especially as like more synthetic data ends up on the internet.
One other area that I know you spend a lot of time is music.
You know, you make your own music and produce it.
And, you know, there have been a number of different applications, refusions,
you know, et cetera, that have sort of come up on the music side.
I was just curious how you've been paying attention to that,
what you think of it, and, you know, where you think that whole space evolves to.
Yeah, I love audio.
That would be the other kind of thing that I would go work on if it weren't for playground.
around. Partly, I didn't work on music because the music industry is like only the whole industry
is $26 billion. So it was a little hard for me to figure out like how big like a music thing
could be. But I definitely think audio is going to be enormous. Things like 11 labs are very
interesting. But anyway, I mean, the way that I've been trying, I've been trying to find ways to
figure out how to use it as a user because that gives me like a really, that gives me a stronger
sense of maybe where things are going. And so one thing that I've been waiting for for many years
is that instrumentals in music are actually very easy to get or to make. You know, there's a wide
variety of like quality, of course, but generally instrumentals in a song, like if you hear a song
from Taylor Swift or whoever a rap song, those beats or those instrumentals are fairly easy
to make and easy. What's hard is to get lyrics and vocals.
And that's always been like a difficulty of mine.
Like how do I find a singer and then how do I get them to like write lyrics and then sing it?
That's much, that's a much more scarce resource in the music world.
And so for the first time with like something like Suno AI, it was really cool because it's
the first time that I heard, you know, them be able to make like a rap song where the rapper has like good flow.
flow is just like the swing of lyrics to a beat and or like a you know you hear like actually really
good lyrics that feel like very emotional have the right breathiness doesn't sound like um like it's
all mean on like auto tune i guess and so um i have this like little flow where i like make a song
in suno and then i use a different AI tool it's AI tools all the way down i guess to like
split the stems and just grab the lyrics but then throw away the instrumental and then i get to like
make a song with the instrumental and the vocal.
Anyway, so I put some songs on my Twitter where I basically tried to do this,
and it sounds, you know, so I can get to like a higher quality song, I guess,
because I make the instrumental.
There are still some weird errors in the songs, but that's been like a really cool way
to use AI, in my opinion.
So, Hale, thanks so much for sharing everything they're working on a playground with us.
Find us on Twitter at No Pryor's Pod.
Subscribe to our YouTube channel if you want to see
faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new
episode every week. And sign up for emails or find transcripts for every episode at no-dashbriars.com.