No Priors: Artificial Intelligence | Technology | Startups - The Future of AI Artistry with Suhail Doshi from Playground AI

Starting point is 00:00:00 Hi, listeners, and welcome to another episode of No Pryors. Today, we're talking to Suhail Doshi, the founder of Playground AI, an image generator and editor. They've been open sourcing foundation diffusion models, most recently, Playground 2-5. We're so excited to have Sue Hale on to talk about building this model in conjunction with the playground community and the future of AI pixel generation. Welcome, Suhail. Thanks for having me. So this is your third company. You started Mix Panel, Mighty, now you're working on Playground.

Starting point is 00:00:33 How did you decide this was the next thing? I think back in April of 2022, I think that was just to place that time, it was like GPT 5 kind of came out and then Dolly 2 came out. And I was actually working on the second company, Mighty. And at that time, I was trying to figure out how to like do something with AI inside of a browser address bar. But when I saw Dolly 2 came out, it was just this like very big. strange eye-opening moment where I think a lot of people didn't think that we'd be able to do

Starting point is 00:01:05 like weird interesting art things so soon and so and I think then soon after that I think stable diffusion came out around June or July of that same year and I got early access maybe a couple weeks access to early to SD14 and I just kind of blew my mind what what people could do with that and I just thought that it seemed odd that all of this was being done in a Google co-lab notebook, shouldn't there be like a UI that makes it really easy, that sort of thing? From the start, were you just thinking we will open source, we will train our own models from scratch? Did you think about other modalities? Yeah, I was, I mean, there have been a lot of people that thought I should do like something in music, but I just, I, because music has been

Starting point is 00:01:50 like a huge hobby of mine for like six years or so. I like produce music. But I just couldn't wrap my brain around like what useful thing I would end up making for people. although now there's like a lot of very interesting, cool, useful things for music. And then it seemed like a lot of people were very focused on language. And I had really enjoyed, I already worked with lots of creative tools. Like when I was in high school, I used to make logos or I would make music or whatever. So I was excited that finally I could find something where it was a combination of creativity, tooling. Images have like really amazing, like built in distribution.

Starting point is 00:02:26 People want to share those kinds of things. So it ended up just like being this perfect. thing that I was excited to work on. How do you think the overall landscape for competition is different in language versus images versus music, right? Like, how did you think about in what ways you guys would want to, like, build advantage and stand out? I think with language, there's like, I don't know, I don't know how many language companies

Starting point is 00:02:51 there are. You guys would probably know better than me, but it seems like there's like over 20. And then maybe like five or like eight of them have a billion. dollars worth of funding. I also didn't want to work on something if they were already extremely passionate people really working hard at that thing. People that I like really respected that were working on that thing. And so at the time with with images, there was just sort of, I think there was mid-jury, there was opening I doing some dolly stuff. And then you saw sort of stable diffusion. But for some of these companies, it didn't seem like there was going to be a longstanding

Starting point is 00:03:22 concerted effort to keep making them better. It was sort of unclear like who was doing this as like a fun demo versus who was doing this as something they would like spend and invest tons of their time in. And so once I had kind of figured out, you know, to what extent opening I was going to invest in it or to what extent seemed like the folks that stability AI were like sort of focused on like seven different kinds of things. And I just thought like, hmm, there's just not enough people that want to do this one thing and do it really, really great. So I think for me, it was just about, were there enough capable people that wanted to do this? Can you talk a little bit about the specific direction you decide to take with playground as well?

Starting point is 00:03:57 I know you thought really deeply about some of the applications or use cases for it. So I was just curious if you could share a bit more about that. Yeah, I think one thing that has sort of been surprising, and it hasn't changed too much, actually, from maybe around June or July of August 2022, was that like a lot of people think about, like text to images just, it's right now it's kind of not, it's not even text to image. It's like text to art. Sorry, what's the difference? The difference is that these models have, they don't.

Starting point is 00:04:27 they don't, they don't, they haven't quite reached the potential of like what, maybe what its utility could be. Right now, for the most part, we take, we formulate a prompt, which is really just a caption of what the image is, and then it diffuses into an image, a set of pixels, but a lot of those pixels are primarily used for art. But what we haven't done is we haven't done anything beyond that. We haven't really done something like editing, for example. Like, why can't we take an image that you already have, and why can't we like sort of insert something into that with like the correct lighting and stuff? Why can't we like stylize an existing thing?

Starting point is 00:04:59 Why is there not a blend of like real and synthetic imagery into a single image? That could then be used for a lot more things than just pure art. And so right now it's a lot of just people making art, but not a lot of people. Sometimes that reduces its like practicality or its utility. Yeah, that makes sense. I think one of the things that you focused it as well that I thought was really interesting is you built your own models, right? Or you train your own models.

Starting point is 00:05:21 And a lot of people in this space just take stable diffusion and fine tune it or do other approaches like that. And you've gotten, you just launched 2.5, the model is performing incredibly well. Like it creates really beautiful imagery and it's super high quality and I'm a little bit more about

Starting point is 00:05:37 how you went about training your model and hiring a team specifically for that purpose and how you thought about it and approached it. Yeah, it turns out that, you know, I think when, you know, instead of strong engineers, like their first thought is that like you just take a model architecture, you find a lot of data,

Starting point is 00:05:54 you fund yourself with enough compute and you just sort of like throw these things into like a mixture of sorts and like out comes out like something like dolly two or dolly three it turns out that it's just way more complex than that and way more complex than i even imagined i had a sense that it was more complicated than that but then it's still further it's more complicated than even that um you know so i think there are a couple things that we did uh one of the things that we were really focused on with that model was that we wanted to see how far we could push the architecture of something that already existed. This was mostly like a test. It was a test to see whether how far we could get as a research team before like the next model change. And so we wanted to take something that we knew

Starting point is 00:06:34 was a recipe that worked already, which was stable diffusion Excel's architecture, which is like a UNET, right? And clip and the same VAE that Robin Rombach trained all this stuff. And then we sort of said, okay, what if we try to get something that's just at least better than SDXL, better than the open source model? And we weren't really sure. sure by how much. And so our only goal was to just be better and try to deliver on the number one save-of-the-art open source model that we could release. And so we kind of learned two things. One is that when we looked at some of the images from something like SDXL, we noticed that there was sort of this average brightness. It was really confusing. It didn't quite have like the right

Starting point is 00:07:16 kind of color and contrast. And in fact, I became so used to this. I became so surprised about the average brightness when comparing it to the images of everything. our model that I thought it was a bug during eval. I literally was like looking at the images and I was like, these cannot be the right images. And my team was sort of like, hey, I think you're actually just getting used to the images of the new model. And so we employed this thing called like this EDM formulation, which like samples the noise

Starting point is 00:07:39 slightly differently. And it's like a really clever kind of math trick. And there's a paper that you could probably read on it. But it's surprising how this like one little like very clever trick, what can produce images that have like incredibly. great color in contrast. The blacks are really vibrant with like a bunch of different colors

Starting point is 00:07:58 and this average brightness kind of goes away. So that's like one thing. You know, that's a really interesting example of really optimizing for one aspect of creating, you know, aesthetically pleasing imagery. And there's a few other aspects like that. So I'm just curious, how much do you have to sort of hand-tune different parameters

Starting point is 00:08:15 versus it's just something that you get as you, you know, train a model or post-training model? yeah i mean there's just so many there's different dimensions of these models i mean one is just like there it's understanding of knowledge but then for like aesthetics it's really tricky um i honestly i think like the field itself is just so nascent that like every month there's like a new trick there's like a new thing that we all sort of develop or find find out i think there are there's an element of some of that being like a lot of different trucks like there's like this new trick that hasn't been

Starting point is 00:08:47 like um well employed or well exploited yet by this guy named teo uh Harris. And he basically does this weird thing called power EMA. Anyway, like basically helps converge training really fast. And so that's like one trick. And then there's this EDM trick. And there's this thing called offset noise. And so there is a lot of tricks for things like color and contrast. There's even a trick called DPO for like, I think works in the language model world and also the image world. Right. So I think there are all these, there are like lots of tricks that sometimes get you like 10, 20, sometimes 2x improvements. But I think the number one trick is like really just like that last phase of, you know, a supervised fine tune where you're finding like really great

Starting point is 00:09:25 curated data. And it's hard to say how much of that is a trick because it's actually just a lot of meticulous work. So I think there's a kind of a combination of some of these things being tricks and techniques. And then there's just like this other thing that's just like really hard, meticulous, like there has to be deep care. And with images, maybe more so than language, there has to be like taste and judgment. Yeah. How do you think about that from the perspective of the e-value do? Because not that many people have amazing taste aesthetic. right? And so I'm a little bit curious how you end up determining what is good taste or is it just user feedback, you know, thumbs up, thumbs down. Like, how do you think about that? One thing that I've

Starting point is 00:10:01 noticed is that every time we do an e-vow, we try to make our e-vals better and we try to make them better than the predecessor e-val. And so one thing I always noticed, though, with each successive run is that I find out much later after eval that the model has like all these gaps. So an example of a gap that we recently had was like, we did well on our e-val, but one area that I thought we did poorly that I wish we had done better on was photorealism. Sometimes it would make faces look like they hadn't gone to sleep for three days or something. And so I think that most evals in the industry are relatively flawed. Like a lot of them are doing like benchmarks on things that maybe are valuable from the purposes of marketing, but are not necessarily well correlated

Starting point is 00:10:44 with what maybe users care about. And so like a simple example, would be like with large language models like there's probably a good reason there's a reason why they're probably good at homework it's because like a lot of the evals are like related to things that are related to things that could be like homework like solving like an LSAT or a biotest or a math test and so some of these evals just don't have like the necessary coverage so i think with things like judgment and taste my feeling is that overall the evals need to get like way stronger and so one thing that we tend to do is we just tend to like really look at like a lot of images across a lot of grids and we're really like being exacting about like what thing could be off but you have to look

Starting point is 00:11:23 at like thousands of images across that you know lots of lots of different grids across different checkpoints to basically find and pick like sort of release candidates but I still think that our own e-bails are not sufficiently strong enough and they could be better at like world knowledge whether that's like its ability to reproduce a celebrity if that's what you want or paintings sometimes paintings are like difficult or like 3D or like illustrations or logos those kinds of things are all like overall I think coverage is like pretty tricky so one of the things that you guys do is have like voting schemes or user studies within the product itself so I don't know if it's grids but you're asking users to you know express preferences more so than I think

Starting point is 00:12:09 perhaps other research efforts are can you talk about just like generally your data curation strategy if there's some sort of overall framework or if community is a big piece of it. Generally, we try to keep something like very simple because we know that users are they have like, they're there to make like images. They're not there to like necessarily like help us label images. And so or annotate things or tell us everything about their preferences. And so we kind of have like a very sophisticated process of how we sort of curate images and like how we're collecting, collecting data from these users to help us kind of like, rank and sort of like make sure we're choosing the right sort of things that we want to curate.

Starting point is 00:12:47 And so I think these things are like very, they might seem like very simple when you encounter it. But like beneath that is like something very, very complex. But yeah, it's a little tough to go into it too deeply because yeah, it does feel like a little bit of a secret sauce, I suppose. Yeah. Well, at least I'll feel good that my guess as to what's interesting is right. Can you just characterize like where playground does really well today, like where you stand out and what sort of use cases you're focused on like winning? We're maybe like probably like number two, I suspect, I guess, at like text to art at the moment. Just because we're training these models from scratch and we're closing the gap really rapidly as rapidly as we can around all the various like kind of use cases.

Starting point is 00:13:28 But I think that we'll probably diverge from some of the other companies in part because we care. I think we start, we're going to care a lot more about editing. You know, people just have like a lot of images on their phone or they want to take some image that they love, whether that's made as art or something. um, that they found and they want to like tweak it a little bit. It's a little annoying that like you make this image and then you can't really change too much about it. You can't change the likeness of it. Maybe there's like a dog or like your face or like something, character consistency issues. It feels a little like a loot box right now. And I think that because it's so much of a loot box, it feels, it feels like it's too much effort, I guess to get something that you

Starting point is 00:14:11 really, really want. So I think more where we're navigating is like, how can we help you take an image that you love? Maybe it's your logo, maybe or incorporate something like your logo or put it in some sort of situation that you would prefer. Tech synthesis is like something that we want to do, for example. So those are some areas that we want to head towards where there's like higher utility and less like you make an image and you just post it to Instagram or something like that. Where do you want to take the company and the product over the next few years? Like what is a long-term vision of what you're doing? If people are out there kind of like working on scaling text, we're basically trying to focus on like scaling pixels in the first

Starting point is 00:14:53 area that we're basically started on is just images. And the reason why we're working on images instead of say something like video or 3D or something like that is one part, one issue with 3D is that it tends to be better to work on 3D if you're like making the content, like you're making Pixar movies. The tools in 3D tend to not like make as much money. And then the other thing with video is like videos is just extraordinarily computationally expensive to do inference or even training on. And a lot of the video models first train, like pre-trained with like a billion images first anyway to like have a rich semantic understanding of like pixels. We just think that video, that images is like maybe the most obvious place to start because A, the utility is quite low. And B,

Starting point is 00:15:34 and be it's like actually somewhat efficient computationally like to do. So long term, I think that we're trying to make a large vision model. There's not really like a word, I guess. Like we have LLMs, but I'm not really sure what the word is for vision or pixels. If you're trying to make like a, you know, a multitask like vision model. And so the goal would be to try to like do three areas of a large vision model would be to be able to create things, edit things, and then understand things. So like understanding would be like, gbt4v or if you're using something open source like cog vlm or there's all these amazing vision language models that are happening and then editing and creating are things that we've kind of talked

Starting point is 00:16:15 about but it would be really amazing at some point that if you made this like really amazing large vision model that it could do things like not just like create things like art but like maybe like help some kind of robot like traverse like some sort of path or like maze and then there's like things in the middle that are sort of like you know maybe you have like a video camera or surveillance system or something, and it's like able to understand what's going on in that. But I think right now it's really focused on graphics. And then how do you think about the underlying architecture for what you're doing? Because, you know, traditionally a lot of the models have been diffusion model based,

Starting point is 00:16:51 and then increasingly, you know, you see people now starting to use transformer-based architectures for some aspects of image gen and things like that. How do you think about where all this is heading from an architectural perspective? and what sort of models will exist in the next year or two. My kind of controversial take, perhaps, is that, you know, there's this thing called DIT, which people allegedly believe, like, SORA is based on, and then there are variants of DIT. There's this thing called, like, I think, M.M.D.T., which I think Stable Division III is supposed to be based on by that research team, that Stability AI.

Starting point is 00:17:25 And my overall feeling is that transformers are definitely, I think transformers are definitely, like, the right direction. But I don't think that we're going to, to get a lot of you enough utility if we're not like somewhat trying to figure out a way to combine the great amazing knowledge of like a language model and or and then just like using you know something like d it which is completely trained from some kind of video caption or image caption to an image because there's not enough like interpretable knowledge i suppose like you're not able to interpret anything about the input which a language model is really great at but then there's like these models that are just trained on these captions that emit image image

Starting point is 00:18:03 And it's kind of unclear, like, how we might marry these two things. And so it sure would be nice if, like, somehow we could combine these two things. So I think the architecture is, like, mostly going, is most likely going to change. I don't think that DIT is, like, the right architecture, but transformers, certainly. And just for people in the, if we're listening, DIT just stands for diffusion transformers. So, in case people are wondering. One belief held by some of the large labs focused mostly on language today is, in the end, we end up with like one truly multimodal general model, right?

Starting point is 00:18:38 That it is like not like we don't end up with a language model and a video model and an audio model and an image model. It is any modality in gigabrain knowledge, reasoning, long context, and any modality out. Like, do you believe in that worldview or how do you see it differently? I definitely think the model is going to be multimodal. And in fact, like, that's kind of what I'm. mean about like some of these models that are just like strictly trained through like, you know, a division transformer. Like a diffusion transformer that's only taking like caption image inputs is just like completely lacks sort of somewhat some knowledge. And then conversely,

Starting point is 00:19:19 if you look at just like the language models, you know, we know that language is that as a much lower dimensionality than say like an image which has like all these pixels that like sort of tell us about lighting or physics or spatial relationships or size and shapes. So, you know, for example, if you were to, like, take a glass and, like, shattered it on the floor and then I asked you to describe it and I asked, and then I described it, we would both come up with, like, completely different descriptions if Elad had to go and, like, draw it, right? So we know that, like, pixels have an enormous amount of high information density compared to language. And language is just really between me and you. It's like a compressed way that you and I can converse with each other at a higher, somewhat of a higher bandwidth, right? Like, we have an abstract view of, like, the, like, the, is what those words mean. So I think that the models, there has to be something, like language is really great

Starting point is 00:20:09 because it's compressed information. And then like, and then like vision is really great because it's so information rich, but it's been hard to annotate until recently. It's only because vision language models exist that it's now suddenly a lot easier to like sort of label or annotate or understand

Starting point is 00:20:26 what's going on in an image. So I think that these two things are very, are going to be very likely married. The only question is, is like, does language, to me, it's kind of a question of, does language, language has this wonderful trait where it's like you can use language to control things, which is pretty cool because of its low dimensionality. But my question would be like, I wonder if language will hit a ceiling, it has like a lower ceiling

Starting point is 00:20:50 than say vision because it's very easy to get lots of pixel data. And that pixel data is like very, very high density. Very easy to get more pixel, like additional pixel data to the already collected data for from the internet that's gone into these models. Yeah, I mean, there's an assumption maybe. One assumption I tend to question is like whether the internet is sufficient. Like the internet is very big, but maybe there's like some kind of mode collapse even with internet data.

Starting point is 00:21:13 Whereas like with vision, at least you can like, you can like make a robot that just like travels down the street and like just keeps taking pictures of everything. You can get like infinite training data with vision, but it might be trickier to like sort of filter and clean internet data, especially as like more synthetic data ends up on the internet. One other area that I know you spend a lot of time is music. You know, you make your own music and produce it. And, you know, there have been a number of different applications, refusions, you know, et cetera, that have sort of come up on the music side.

Starting point is 00:21:44 I was just curious how you've been paying attention to that, what you think of it, and, you know, where you think that whole space evolves to. Yeah, I love audio. That would be the other kind of thing that I would go work on if it weren't for playground. around. Partly, I didn't work on music because the music industry is like only the whole industry is $26 billion. So it was a little hard for me to figure out like how big like a music thing could be. But I definitely think audio is going to be enormous. Things like 11 labs are very interesting. But anyway, I mean, the way that I've been trying, I've been trying to find ways to

Starting point is 00:22:19 figure out how to use it as a user because that gives me like a really, that gives me a stronger sense of maybe where things are going. And so one thing that I've been waiting for for many years is that instrumentals in music are actually very easy to get or to make. You know, there's a wide variety of like quality, of course, but generally instrumentals in a song, like if you hear a song from Taylor Swift or whoever a rap song, those beats or those instrumentals are fairly easy to make and easy. What's hard is to get lyrics and vocals. And that's always been like a difficulty of mine. Like how do I find a singer and then how do I get them to like write lyrics and then sing it?

Starting point is 00:23:01 That's much, that's a much more scarce resource in the music world. And so for the first time with like something like Suno AI, it was really cool because it's the first time that I heard, you know, them be able to make like a rap song where the rapper has like good flow. flow is just like the swing of lyrics to a beat and or like a you know you hear like actually really good lyrics that feel like very emotional have the right breathiness doesn't sound like um like it's all mean on like auto tune i guess and so um i have this like little flow where i like make a song in suno and then i use a different AI tool it's AI tools all the way down i guess to like split the stems and just grab the lyrics but then throw away the instrumental and then i get to like

Starting point is 00:23:47 make a song with the instrumental and the vocal. Anyway, so I put some songs on my Twitter where I basically tried to do this, and it sounds, you know, so I can get to like a higher quality song, I guess, because I make the instrumental. There are still some weird errors in the songs, but that's been like a really cool way to use AI, in my opinion. So, Hale, thanks so much for sharing everything they're working on a playground with us. Find us on Twitter at No Pryor's Pod.

Starting point is 00:24:15 Subscribe to our YouTube channel if you want to see faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-dashbriars.com.

No Priors: Artificial Intelligence | Technology | Startups - The Future of AI Artistry with Suhail Doshi from Playground AI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.