No Priors: Artificial Intelligence | Technology | Startups - The Timeline for Realistic 4-D: Devi Parikh from Meta on Research Hurdles for Generative AI in Video and Multimodality

Starting point is 00:00:00 Text prompts are democratizing creative expression, and the Holy Grill is AI-generated and edited video. Elad Gill and I sit down with Debbie Parake. She's a research director in generative AI at Meta, a leading researcher at Multimodality and AI for Visual, Audio, and Video. And she's an associate professor in the School of Interactive Computing at Georgia Tech. Recently, she worked on Make a Video 3D, which creates animation. from text prompts. She's also a talented artist herself. Devi, welcome to No Pryors.

Starting point is 00:00:35 Thank you. Thank you for having me. Let's start with your background and how you got started in Computer Vision. I've heard you say you choose projects based on what brings you joy. Is that how you got into AI research? Kind of, kind of, yeah.

Starting point is 00:00:49 So my background is that I grew up in India and then I moved to the U.S. after high school. And I went to a small school called Rovin University in southern New Jersey for my undergrad. And that is where I first got exposed to what at the time was being called pattern recognition. We weren't even calling it machine learning and got exposed to some research projects. There was a professor there who kind of showed some interest in me, thought I might have potential to contribute meaningfully to research projects. And that's how I got exposed. And I really, really enjoyed what I was doing there,

Starting point is 00:01:22 decided to go to grad school, to Carnegie Mellon. I knew I was enjoying it, but I wasn't sure if I wanted to do a PhD. So at first, I wanted to just kind of get a master's degree with a thesis, but I can do some research. But the year that I applied, that the ECE department at CMU decided that there wasn't going to be a master's track for thesis, like either you can just take courses or you go to a PhD. And so they kind of slotted me onto the PhD track, which I wasn't so sure of. But my advisor there was reasonably confident that I'm going to enjoy it and I'm going to want to keep going. So yeah, that's how I got started in this space. At first, I was doing projects that didn't have a visual element to it. How did you pick a thesis project? So at first,

Starting point is 00:02:06 I wasn't, I was working on projects that didn't have too much of a visual element to them. But when I got to CMU, my advisor's lab was working in image processing and computer vision. And I always thought that it was pretty cool that everybody gets to kind of look at the outputs of their algorithms and see what they're doing, whether if it's kind of non-visual, then, yeah, you see these metrics, but you don't really have a sense for what's happening, if it's working, if it's not. And so that's how I got interested in computer vision, and that then defined the topic of my thesis over the course of my PhD. So you have been working in machine learning long enough that, as you said, it was called

Starting point is 00:02:43 pattern recognition, and you've worked across a bunch of different modalities. How does that change your research path? because like things like diffusion models and GANS and large transformers, none of that existed when you were first starting. And I think you have managed to sort of translate or transition your interest in a way that keeps you on the cutting edge. How has that happen? Yes, I think, I mean, you can always kind of look back and try and find patterns.

Starting point is 00:03:13 Like when you're actually doing it, you don't necessarily have a grand strategy of anything in mind. But when I look back, I think one common theme that, like, led to me transitioning across topics a little bit, was that I was interested in seeing how we can get humans to interact with machines in more meaningful ways. And so kind of, even my transition from kind of non-visual to visual modalities in hindsight, I feel like was essentially that. I felt like you can't interact with these systems too much if it's sort of these abstract modalities that you're looking at. And then when I was working in computer vision, I wanted to

Starting point is 00:03:47 find ways for humans to be able to interact with these systems more. So I started looking at kind of these attributes and adjectives of like, oh, something is furry or something is shiny and using that as a mode of communication between humans and machines, both for humans to teach machines new concepts and for machines to be more interpretable than explaining why they're making the decisions that they're making. And that slowly led to the sort of more into natural language processing, where instead of these kind of just adjectives and attributes, looking at more natural language as a way of interacting. So a lot of my work in visual question answering, where you're answering questions about images, image captioning.

Starting point is 00:04:21 was coming from there. And then over time, I sort of started thinking of other ways to go even deeper in this interaction that are other ways where AI tools can enhance sort of creative expression for people, give them more tools for expressing themselves. And that's how I got interested in AI for creativity. And I was dabbling in kind of a few fairly random projects a few years ago. Kind of my bread and butter research was still multimodal vision and language. but then sort of I enjoyed what I was doing with AI for creativity and a couple of years ago

Starting point is 00:04:55 that took sort of a little bit more of a serious turn where I made it more of my sort of full-fledged research agenda and that's how I started working more seriously on generative modeling including transformer-based approaches diffusion models for images for video for 3D video things of that sort so you became a professor and you also you know now work in industry at meta what brought you there So this was, I've been at Meta for about seven years now. And so this had started when I was transitioning from Virginia Tech to Georgia Tech. So I was an assistant professor at Virginia Tech. And I was getting started at Georgia Tech.

Starting point is 00:05:30 And in that transition, I decided to spend a year at Fair. At the time, it was called Facebook AI Research now, fundamentally I research at Meta. And I knew colleagues there from some of them had been at Microsoft Research. Before that, I had interned at MSR. I had spent summers at MSR, even as a faculty member. And so I had a lot of colleagues who I knew, and so I thought it would be fun to kind of spend a year, collaborate with them, get to know what Fair is like. So that's what that was.

Starting point is 00:05:59 It was supposed to be a one-year stint, and then I was going to go back to Georgia Tech and kind of continue with my academic position. But in that one year, I enjoyed it enough. I think Fair enjoyed having me around enough where we tried to figure out, is there a way to keep this going for longer? And so for many years after that, for five years or so, I was splitting my time. But every fall, I would go back to Georgia Tech to Atlanta, spend the fall semester there at each. And then the rest of the year, I would be in Menlo Park at Facebook now, Meta.

Starting point is 00:06:29 And you transitioned from fundamental AI research to a new sort of generative AI group. Can you talk about why that's interesting to Meta or sort of what kind of things you're working on now? Yeah, yeah. I mean, yeah, that's a very exciting space right now. There's a lot happening both within meta and outside, as I'm sure many of the people listening to this are aware of. But yeah, so that the new organization was created a few months ago, so not a long time ago. And it's looking at things like large language models, image generation, video generation, generating 3D content, audio, music, yeah, all sorts of modalities that you might think of.

Starting point is 00:07:09 And why is it interesting? I mean, like right now, if you think about all the content, there's so much content that we consume in all modalities and all sorts of surfaces. And it makes a lot of sense to ask that instead of, maybe not instead of, but in addition to all of this consumption, can more of us be creating more of this content, right?

Starting point is 00:07:32 And so almost everything that you think of images, video, you can ask this question, like for any situation, you're searching for something, trying to find something, it's relevant to ask, well, could I just create what it is that I have in my head? And so when you think of it that way, you can see how it can touch a lot of different things across a variety of products and surfaces. So, yeah. Yeah, that makes a ton of sense. I think we'll come back and ask you some questions about images and audio and a few other things since you've done so much interesting work across so many different areas. But maybe we can start a little bit with video generation. In part, you know,

Starting point is 00:08:06 due to the fact that you had a really interesting recent project called make a video. and in that approach, users can generate video with a text prompt. So you could type in, imagine a corgi playing with a ball, and it would generate a short video of a quirky playing with a ball. Could you tell us a bit more about that project, how it started, and also how it works and what's the basis for the technology there? Yeah, yeah. So make a video, it started because, I mean, this was a couple of years ago

Starting point is 00:08:33 where this was before Dali 2, by the way. So this was like after Dali 1 had happened. feel like a lot of people don't even remember Dali 1 anymore. Like, people don't even talk about that. It's fun to go check out what those images look like. And that had blown our minds at the time. But now when you go back, you're like, wait, like, that's not even interesting. But anyway, so we had seen a lot of progress in image generation.

Starting point is 00:08:55 And so it seemed like the next kind of entirely open question where we hadn't seen much work at all was to see what can we do with video generation. And so that was kind of the inspiration behind that. And for Make a Video, the approach specifically, the thinking was we have these image generation models. By this time, we had seen a lot of progress with diffusion-based models from a variety of different institutions. And so the idea was, is there a way of leveraging all the progress that's happened with images in a very direct way to sort of make video generation possible? And so that led to this intuition that what if we try and use images? and associated text as a way of learning what the world looks like and how we, how people talk

Starting point is 00:09:42 about the visual content, and then separate that out from trying to learn how things in the world move. So separate out appearance and language and that correspondence from motion of how things, of how things move. And so that is what led to make a video. And so there are sort of multiple advantages of thinking of it that way. One is there's less. for the model to learn because you're directly bringing in everything that you already know about images to start with. The second is all of the diversity that we have in our image data sets that image models already know all sorts of fantastical depictions of like dragons and unicorns and things like that, but you may not have as much video data easily available. All of that

Starting point is 00:10:27 is inherited. So all the diversity of the visual concepts can come in through images, even if your video data set doesn't have all of that. And the third benefit, maybe the biggest one is that because of the separation, you don't need video and associated text as paired data. You have images and text as paired data, and then you just have one labeled video to learn motion from. So these were kind of three things that we thought were quite interesting in how we approached make a video.

Starting point is 00:10:54 And so concretely the way it works is that when you initialize the model, you're starting off with image generation, sort of parameters that have already been learned. So before you do any training for me, make a video, we set it up so that it can generate a few frames that are not temporally coherent. So there is going to be independent images. Like the corgi playing with the ball is just going to be independent images of corgi playing with blue balls, but they're not going to be temporally coherent.

Starting point is 00:11:22 And then what the network is trying to do as it goes through the learning process is to make these images temporally coherent. So at the end of training, it is generating a video rather than just unrelated images. And that's where the videos come in as training data. That's a great explanation. One question, just like if we use an example of something that is not going to be in your video training set, right? So I want a flying corgi, for example, right? How should I think of this in terms of like interpreted motion?

Starting point is 00:11:57 Yeah. So one way of thinking of it could be that you may not have seen a flying corgi, but you've probably seen flying airplanes or flying. birds and other things that fly in images and in video. And so from images, you will have text associated with it. So you will have a sense for what things tend to look like when someone is saying, oh, this is X flying or Y flying. And then in videos, you will have seen the motion of what stuff looks like when it flies. And in images, you will have seen what corgis look like.

Starting point is 00:12:25 And so it's hard to kind of know for sure what it is that these models learn. Sort of interpretability is not a strength of many of these deep, large, architectures, but that could be one intuitive explanation for how the model is managing to figure out what a flying corgi might look like. Well, what are some of the major forward-looking aspects of this sort of project and research? I think there's a ton to do in the context of video generation. Like, if you look at make a video, it was very exciting. It was sort of first of its kind capabilities at the time.

Starting point is 00:12:57 But it's still, it's a four-second video. It's essentially an animated image, right? It's kind of the same scene, the same set of objects that are moving around in reasonable ways, but you're not seeing objects appear, objects disappear, you're not seeing objects reappear, you're not seeing scene transitions. None of this is in there. And so if you look at, if you think about the complexity of videos that you just regularly come across on various surfaces, this is far from that. And so there is a ton to be done in terms of making these videos longer, more complex, having memory so that if an object reappears,

Starting point is 00:13:33 it's actually consistent. It doesn't now look entirely different. Things of that sort of being able to tell more complex stories through videos, all of this is entirely open. I know these things are always extremely hard to predict. But if you look forward a year or two, what do you think the state of the art will be in terms of length of video, complexity of the scenes that you can animate, things like that?

Starting point is 00:13:54 Yeah, that is hard to say. And to be honest, I've actually been surprised that we haven't seen more of this already. Like, make a video was, I think, what, nine months or so ago, maybe approaching one year. And it's not like, even from other institutions, it's not like we're seeing amazingly longer videos or significantly higher resolution or much more complexity. We're still kind of in this videos equals animated images. And, yeah, maybe the resolution is a little bit bigger. Quality is a little bit higher. But it's not like we've made significant breakthroughs, unlike, for example, what we've seen with images.

Starting point is 00:14:28 So I do, that has given me a sense that maybe this is harder than what we might think and sort of our usual curves of like with language models or image models. We're like, oh, just six more ones and there's going to be something else that's an entirely different step change over this. I think that might be harder in video and I wonder if there is something that we're kind of fundamentally missing in terms of how we approach video generation. So it's not quite answering what you asked me, but I do think that it might be a little bit slower than what we might have guessed just based on progress and other modalities.

Starting point is 00:15:01 What do you think is the main either challenge or bottleneck that you think has slowed progress in this field or not slowed it? I mean, obviously there's been, you know, a lot of people are working very hard on these problems. But to your point, it seems like sometimes you have these fundamental breakthroughs and sometimes it's like it's an architecture like transformer-based models versus traditional NLP. And sometimes it's, you know, iterating on a lot of other things that already exist in the pre-existing approaches and just sort of solving specific. engineering or technical challenges. If you were to sort of list out the bottlenecks to this, what do you think they're likely to be? Yeah, I think there's a few different things. One is

Starting point is 00:15:36 videos are just sort of, from an infrastructure perspective, harder to work with, right? They're just sort of larger, more storage and sort of more expensive to process and more expensive to generate and all of that. So there's just that hydration cycle that is much slower with video than it would be with other modalities. So that is one. The second is, I don't, I don't think we've still figured out the right representations for video. There is a lot of redundancy in video, one frame to the next frame. There's not a whole lot that changes. We still kind of approach them fairly independently as sort of independent images.

Starting point is 00:16:12 Even if you're generating it, it's kind of one after the other, or even if you're generating in parallel and then making it finer grain. So I think maybe that could be something that helps with a breakthrough that if we really figure out how to represent videos efficiently. And the third is this hierarchical architecture that if you want longer videos, there's just so many pixels that you're trying to generate, right? It's a very, very high dimensional signal compared to anything else that we're doing.

Starting point is 00:16:42 And so just thinking through how do we even approach that, what sort of hierarchical representation makes sense, especially if you want these scene transitions, if you want to have this consistency, which may be a form of memory, figuring those architectural pieces out, I think maybe another piece. of this puzzle. And then finally, data, right? Data is kind of goal in anything that we're

Starting point is 00:17:01 trying to do. And I don't know if as a community week, we've quite built the muscle of thinking through data, sort of massaging the data appropriately, and all of that in the context of video. We have that muscle quite a bit with language, quite a bit with images, but with video we're perhaps not quite there yet. But what would be the ideal training set for video in that case or what's lacking from the existing appretches? Yeah, I think what's lacking may not be so much the data source itself, although that is certainly a challenge as it is with other modalities, but I think it might also be the data recipes that do we want to start with training with sort of these very short videos where

Starting point is 00:17:44 not much is happening, the scene isn't really changing, but then that also tends to limit the motion. There's just not much happening, and so you can end up with these kind of animated image-looking things. And on the other hand, you have sort of very complex video might be multiple minutes long with all sorts of scene transitions. And that's ideally what you want to shoot for. That's where you want to get. But if you're just kind of directly throw all of that into the network, it's unclear of the models who will be able to learn all of that complexity well. So I think thinking through some sort of a curriculum may be valuable here. And I don't think we've quite nailed that

Starting point is 00:18:15 recipe down. I feel like every generation of sort of technology shifts always runs into video is the hardest thing to do. And if you look at sort of just the first substantiation of the web, one of the reasons YouTube sold was the infrastructure point you made earlier, where just dealing with that huge amount of streaming and the costs associated with it and everything else, even in the prior generation of just, you know, can we host and stream this effectively? In part led to them, you know, getting sold to Google reasonably early in the life of a company. So it's interesting how video is always that much more complicated. Yeah, yeah. And same thing for computer vision, right? Like here we're talking about generation,

Starting point is 00:18:49 but even just understanding, with images, with image understanding, there was so much progress that was being made. And videos was always kind of not only trailing behind, but just sort of continued to be harder. Even sort of the rate of progress was slower, not just the absolute progress. And I think, yeah, I think we're seeing some of that for generative models as well. Devi, I know this isn't within your, like, core field,

Starting point is 00:19:12 but I'm sure you also pay attention. Like, how do you think advances in video may, like, impact robotics? So I think there, the video understanding piece is probably more relevant than the video generation piece. And video is, like if you think of embodied agents, right, they're sort of moving around and consuming visual content, which inherently is video, right? They're not looking at static images. And so I think that video understanding piece is very relevant there. What's also interesting in the context of embodied agents or sort of robotics, physical robots that are moving around is that it's not. not passive consumption of videos, right?

Starting point is 00:19:51 It's not like how you and I might be watching videos on YouTube or anything else. It's that the... I'm yelling at the screen. I'm not passive. It's that the next visual signal that the robot will see will be a consequence of the action that the robot had taken. So if it chose to move a certain way, that's going to change what the video looks like in the next few seconds. And so there's that interesting feedback loop there, but it knows what action it had taken.

Starting point is 00:20:19 And it sees how that changed the visual signal that it is now getting as input. And so that connection makes it adds a layer of interestiness to how it can process the video sort of in contrast with sort of regular computer vision disembodied tasks where we think of sort of a just streaming a video is just kind of happening and you're not controlling what you're seeing. You started by saying that human interaction was a big driving force in. in your research interests and, you know, going beyond, like, metrics as outputs and even language as inputs. How do you think about controllability in video and, like, how important

Starting point is 00:21:02 text prompting is to sort of the next generation of creation of creation. Yeah, I think that's, I think that's very important exactly to your point that if we want these generative models, not just for video, but for any modality to be tools for creative expression, then it needs to be generating content that corresponds to what someone wants to express. It has to bring somebody's voice to life, and that is not possible if there aren't good enough ways of controlling these models. And so text is one way. That's better than random samples.

Starting point is 00:21:37 That's one way in which I can say what I want. But right now, for the most part, you type in a text prompt, you get an image back, a video back, and either you take it or leave it, right? Like, if you like it, that's great. If not, you're just kind of try again. And maybe you tweak the prompt a little bit. You sort of try a whole bunch of these prompt engineering tricks and hope that you get lucky. But it's not really a very direct form of control.

Starting point is 00:22:01 And so I think of more control, at least in two different ways. One is to allow for prompts that are not just text, but are multimodal themselves. So for image generation, for example, instead of just text, it would be nice if I can kind of sketch out what I want to the composition of the scene to look like, and the model would be expected to kind of respect that. For video, instead of just text as input, maybe I can also provide an image as input so that I can tell the system that this is the kind of scene that I want. Maybe I can provide sort of a little audio clip as input to convey that this is the kind of audio or sound that I want associated method. Maybe I also bring in a short video clip and expect the model to sort of bring

Starting point is 00:22:41 in all of these different modalities in a reasonable way to generate a video. So that's one piece, where I can bring in more inputs as a way of more control. And the second piece is sort of the predictability part, that even if I bring in all of these modalities as input, if the model then goes off and kind of does its own thing with these inputs, maybe it's reasonable, but that's not what I'm looking for. What do I do? Like, do I just go back and try again?

Starting point is 00:23:07 It would be ideal if there's some way of having iterative editing mechanisms where whatever I get back, I have a way of communicating to the model, what it is that I want changed in what way, so that over iterations, I can get to the content that I intended in sort of a fairly reasonable way without having to sort of spend hours learning a new tool or something like that, right? So if that can be done in a very intuitive interface, I think that would be pretty awesome. Where do you think we will get to in terms of the frontier of like controls for video generation over the next couple years or five years?

Starting point is 00:23:40 I think control sort of tends to lag behind the core capability. like even with images, I feel like we first had to get to a point when these models can actually generate nice-looking images before we start worrying about whether it's really doing what I wanted it to do. And I feel like we're not quite there with video yet. So get random good first. Exactly, exactly, exactly. Like at least get random good first,

Starting point is 00:24:01 then maybe let me give it text, then let me give it these other prompts. So I do think we'll first probably see more progress in just the core capabilities of sort of text to video generation before we look at prompting. Although we are, and this is in the context of sort of me generating something from scratch, right, which is where I might want the citative control and things like that. A parallel scenario is where I already have a video and I'm trying to edit it an interesting way. I might want to stylize it and all of that.

Starting point is 00:24:30 I think we're already seeing that even in products with runway, for example, right? So I think that we'll probably see much more of, we're already seeing that and I think we'll see more of where you already have a video that you're starting with and then you're trying to edit it, which has similarities to, but is a little bit, is a little bit different in my mind compared to sort of generating something from scratch and wanting control over that. The other potential part of output for videos, obviously, is text to speech or some sort of voice or other ways to sort of accompany the video or animate it. What is your view in terms of the state of the art of tech to speech systems and how those are evolving? I think I haven't tracked

Starting point is 00:25:10 the text to speech quite as much. What I have tracked a little bit more closely is things like text to audio, where you might say that the sound of a car driving down the street. And what you expect is sort of a sound of a car driving down the street to be generated. And so there, the state of the art right now is sort of roughly sort of a few second to tens of seconds long audio. And I would say that roughly it probably works reasonably well. one in five times or so it's like because there aren't concrete metrics it's kind of hard to

Starting point is 00:25:46 articulate where state of the art is but hopefully this is this is helpful and i do think that audio added to visual content makes it much more expressive and much more delightful and i do think that it tends to be underinvested both for audio similarly for music i think it just makes the content much more expressive, much more delightful, but I feel like we don't do enough of that. Yeah, it's interesting too, because there are actually very large sound effect libraries out there. They're very well labeled as well in terms of what the exact sound effect is and the length and the components and all the rest. And so it's interesting that the state of the art hasn't quite caught up with, you know, what used to be a really interesting old business where

Starting point is 00:26:28 you generate an enormous amount of IP for different sound effects and then you just license them out. Yeah, yeah. Which it seems like eventually that industry is likely to go away. And even with audio, similar to what we were talking about with video, there is the, right, like the same kinds of challenges and dimensions exist that you want the piece to be longer. You may want compositionality, right? I might want to be able to say that, well, first, it's the car driving down the street, and then there is a sound of, I don't know, a baby crying and then something else.

Starting point is 00:26:54 And maybe I'm saying that two of these sounds are happening simultaneously, which is not, like, that's something that can happen in audio where you can have the super imposition. But in video is not something that would, where it's quite as, natural. And so all of that isn't stuff that these models can do very well right now. If I described a complex sequence of sounds, or if I try to talk about these different sounds simultaneously, these models can't do that very well. Where do you think we'll see the first application areas? Or what do you think are the first sort of use cases that we'll see immediately?

Starting point is 00:27:24 And then how does that evolve over time? Yeah. I think I'm not too much of a product person. So I feel like I don't know if I have the strongest intuitions there. But I think for kind of like I was touching on earlier, that a lot of these situations where we find ourselves searching for things to express ourselves, I think thinking of whether that can be generated so that it's a closer reflection of what you're trying to communicate as likely things that we'll see. And I know we're not talking about sort of LLMs and conversational agents and all of that too much, but I think AI agents is going to be a thing that we'll see a whole bunch of across many different surfaces. And then thinking about what media creation looks like in the context of AI agents is another dimension to this. Yeah, that makes a lot of sense. I mean, there's all sorts of obvious sort of near-term applications in terms of generating your own animated gifts or, to your point, midstream video editing or, you know, different types of shorter form animations or other things that you could do, marketing, et cetera.

Starting point is 00:28:26 And so, you know, it definitely feels like there's some near-term applications and some longer-term ones. And then the thing I always find interesting about these sorts of technologies is the spaces where they kind of emerge in a way that you don't quite expect but end up being a primary use case. You know, it's sort of the Uber version of the of the mobile revolution where you push a button in a stranger picks you up in a car and you're fine with it, right? And it feels like those sort of unexpected delightful experiences are going to be very exciting in terms of a lot of areas of this field. Yeah, yeah. And to your point that this technology is brand new, right? So it's not like there are existing product lines or ways of thinking about product that we can kind of directly plug into and kind of see, or did the metric go up, did the metric go down?

Starting point is 00:29:06 I think there's a lot of just kind of thinking about where do we anticipate people will be excited to use this. And as you said, I think there's a very good chance that there would be things that we don't necessarily foresee, but just kind of come up as very exciting spaces. I think an interesting cynicism has been that like there aren't that many artists out there or like people don't want to create imagery when when looking at some of these generative like much less video when looking at some of these generative technologies. But, you know, the, the recent history of social media would say that's like certainly not true, right? If you look at

Starting point is 00:29:46 Instagram democratizing photography or TikTok democratizing short form video creation by just like reducing the number of parameters of control, right? As you said, like, you know, sound makes video much richer, but it's also really hard to produce any one of these pieces. So you just take one control away and, like, you know, record with your phone and you get, you get something like TikTok. But I think it's really exciting, like the explosion and usage of things like mid-journey, right? Because the traction suggests there are an awful lot of people who are actually interested in generating high-quality imagery for a whole range of use cases, professional or not. Yeah, yeah. And I think there's people across.

Starting point is 00:30:27 the entire spectrum, right? On one hand, you can talk about artists who already had a voice, were already involved in sort of creating art. And then the other end of the spectrum are people who don't necessarily have the skills, may not have had the training, but are still interested in being able to express their voices a little bit more creatively than they would have otherwise. And so I do think that there is one question of whether or not artists want to be engaging

Starting point is 00:30:53 with this technology. And there is the other question of does it kind of just lift the tide for all of the rest of us to be able to be more expressive in what we can create and what we can communicate. And so I think both of those ends are relevant here. And with artists, there are artists who are, like, whose sort of brand is AI artists, where they are explicitly using AI as the tool of choice for expressing themselves and their entire practices around that, someone like Sophia Crespo or Scott Eaton and others.

Starting point is 00:31:25 So there's also, and this was before mid-jurney or anything like that, right? Like they've been doing this for years, even with like GANS, for example, that existed that were popular before diffusion models and all of that, yeah. You're an artist yourself, both, you know, digital, AI-driven, analog, some of it's behind you. Like, how does that impact, like, your view of this? I kind of always hesitate a little bit to call myself an artist. I feel like somebody else should be deciding whether I'm an artist or not, but then there's

Starting point is 00:31:54 whole community. We'll say you're an artist. By the way, we should mention some of your lovely macromay art is on the wall behind you as well. So I think it looks great. Thank you. Thank you. And yeah, so to be honest, I don't know if I, it's hard to kind of look back and get

Starting point is 00:32:11 a sense for did that play a certain role in it or not. I know for sure that it plays a role in just how excited I am about this technology, then anytime there's some new model out there, whether it's from the teams that I'm working with or if it's something external, I'm definitely very enthusiastic to want to try it out and see what it can do, what it can't do, and sort of tell people about it. And so just kind of my baseline level of excitement around this technology is high in part because of all these other interests that I have. I'm pretty sure that my emphasis on control is probably also coming from that, where I feel like I want to be using these tools to kind of have them do the thing that I want to do. and sort of text prompts are restrictive in that way. And, I mean, we talked about it in the context of control,

Starting point is 00:32:56 that if you can bring in multiple modalities as input, that definitely gives you more control. But it also means that there is more space to be creative, right? I can now pick interesting images or interesting videos or interesting pieces of audio and pair that up with like this really interesting text prompt and just kind of see what happens. Like if I put all of this in, you don't know what the model is necessarily going to do. And so it's also just more not.

Starting point is 00:33:18 norms to play with as you're trying to interact with these, that, yeah, there's just more space to be creative if there's more ways of more norms to control these models. Yeah, I was talking to Alex Isra, who's an L.A. based artist, and he's not a technical guy, but an amazing artist. And he was describing this new video project he wants to do that involves use of AI. And I was very inspired by, like, how specific the vision

Starting point is 00:33:48 was and like thinking through the implementation a little bit for somebody who doesn't come from the technical field. And I imagine there would be a whole crop of people who look at the capabilities as as another tool for expression. Yeah. Yeah. And so and there are some people who have a very specific vision and they just want the tool to kind of help them get there. And then there are others whose process involves sort of bringing the model along where the unpredictability and sort of not necessarily knowing what this model is going to generate is a part of their process and is a part of the final piece that they create. Some view them, view these models very much as tools,

Starting point is 00:34:23 and then others tend to view them as more of a collaborator in this process of creating. And it's always interesting to see what end of the spectrum, different people lie on. Okay, so as we're nearing the end of our time together, we want to run through a few rapid-fire questions, if that's okay. Maybe I'll start with, one, just given your breath in the field. Is there an area of...

Starting point is 00:34:47 image, audio, video generation, understanding, control that you feel like is just under explored for people looking for research problems? Yeah, so one is the control piece that we already talked about quite a bit. And I think the other is multi-modality, like bringing all of these modalities together. Right now we have models that can generate text. We have models that can generate images, models that can generate video. But there's no reason these all need to be independent. You can envision systems that are sort of ingesting all of these modalities. understanding all of it and generating all of these modalities. And I haven't, I'm starting to see some work in that direction,

Starting point is 00:35:24 but I haven't seen a whole lot of it that goes across many different modalities. You just got back from CVPR and presented there. Can you mention both what you were talking about and then sort of the project or work that most inspired you there? Yeah. So I was at CVPR. I was on a few different panels and I was giving some talks. So one of it was on vision language and communication.

Starting point is 00:35:47 creativity at the main conference. And so, yeah, that was kind of what I was representing there. In terms of something exciting that I saw there, not necessarily a paper, but there was a workshop there called scholars and big models, where the topic of discussion was, as these models are getting larger and larger, making a lot of progress in what way can sort of academics or labs that don't have sort of as many compute resources, what should their strategy be? how should they be approaching these things?

Starting point is 00:36:19 And that I thought was a really nice discussion. In general, I tend to enjoy venues that talk about kind of the meta, like the human aspects of the work that we do. We have a lot of technical conversations, but we don't tend to talk about these other components. And so that workshop is something that I enjoyed quite a bit. Is there a prediction you'd make about the impact of all these technologies on social media since you're working at the intersection?

Starting point is 00:36:43 Yeah, there's going to be more of it. That's one prediction that I can very confidently make that we are going to see all of these tools sort of show up where millions and billions of people can be using these in various forms. And I'm excited about that. Like I said, I think it kind of enhances creative expression and just sort of communication. And we're going to have these entirely new ways of interacting with each other. And even the social entities on these networks might change, right? Like when we talk about AI agents and you think about them being sort of part of the social

Starting point is 00:37:19 graph, what that does and how that changes, how we connect with each other, all of that is fascinating. And I can't, I can't wait to see how that involves. It actually feels a little bit under-discussed in terms of how much impact it's already had in some ways, even if it sometimes is transitory, like a lensa, where, you know, my understanding is tens of millions of people at least who ended up using that product. There's character in terms of the engagement there. One could argue majority for certain types of art you share with friends, et cetera. And so I feel like there's already these social expression modalities, bot-based interactions, et cetera, that are already impacting aspects of social or

Starting point is 00:37:56 communicative media in ways that people don't really recognize it as such. So to your point, it's very exciting to see where all this is heading. Yeah, yeah. Devin, you also write quite a bit online in terms of sort of giving back to the community and advice for researchers or young people newer to the field. You talk about time management and other topics. What's one one piece of wisdom you'd offer to people in terms of productivity and joy and AI research? Yeah, so time management is something that I'm quite excited about. And so yeah, if I have a blog post on that, it's sort of main philosophy is that you should be writing everything that you want to do down on your calendar. So it's kind of the point is that it should be on your calendar. It shouldn't be a

Starting point is 00:38:39 do-do list. And the reason it should be on your calendar is it forces you to think through how much time everything is going to take. If it's just a list, you have no idea how long it's going to take. And that's not a good way to plan your time out. So that's kind of the main thesis of it that I hadn't anticipated, but it resonated a whole lot with many people, which was kind of surprising when I put it out there. So yeah, if anyone's interested, you should check that out. In terms of advice outside of what I've written, one advice that has stuck with me over the years is don't self-select that if you want something, go for it. If you want a job, apply for it. If you want a fellowship, for any students who might be listening, just apply

Starting point is 00:39:16 for it. You want an internship, just apply for it. And yeah, like, don't assume, don't question, oh, am I good enough? Am I not? It's on the world to say no to you. If you are not a good fit, the world will tell you that. And so, yeah, there's nothing to lose by just kind of giving it a shot. So don't self-select. great note to end on. Devi, thank you so much for joining us on No Pryors. Thank you. Thank you for having me. Thanks for the time.

Your Ad Here

No Priors: Artificial Intelligence | Technology | Startups - The Timeline for Realistic 4-D: Devi Parikh from Meta on Research Hurdles for Generative AI in Video and Multimodality

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.