No Priors: Artificial Intelligence | Technology | Startups - Teaching AI to Understand the Physical World, with Dr. Fei-Fei Li of World Labs

Starting point is 00:00:00 Hi, listeners, and welcome back to No Priors. Today's guest is Dr. Fei-Fei Lee, a pioneer in computer vision and deep learning. She created ImageNet, the groundbreaking data set that helped spark the deep learning revolution. Fei-Fei is a Stanford professor and the co-director of the Stanford Institute for Human-Centered AI. She's also led AI, Google Cloud, advised international policy makers, and recently co-founded World Labs, a company dedicated to develop being spatially intelligent AI. Faye, thank you for joining us today. Well, thanks for inviting me.

Starting point is 00:00:35 This is going to be fun. So you have made extraordinary contributions to science and policy over the past two decades. I'll start with the biggest question. Like, why start a company now? Because in my heart, I want to build. I see this as such a critical and fun and exciting moment to build some extraordinary technology that everybody can use. And I believe so much in spatial intelligence and the kind of 3D world models that can empower so many people as well as so many use cases.

Starting point is 00:01:09 And I think that's just, it's going to be really exciting. And I can do that with an extraordinarily brilliant group of young technologists. I want to come back to the people you're working with because I know, I know, know, some of your co-founders and was, you know, trying to convince them desperately to start a company a while back. And then they were like, oh, no, we have a bigger mission now with Fei-Fei. What is spatial intelligence? Can you define it for a broader audience? Spatial intelligence to me is the ability to understand reason and interact and generate 3D worlds. Because our world fundamentally, no matter how you say we can project it, fundamentally

Starting point is 00:01:57 is 3D. And it's 3D because physically it's 3D. And digitally, if there is a true 3D representation, then we can make a lot of things happen more easily, whether it's designing or creation or navigation or simulation or the experiencing of AR, VR, all this to me is part of spatial intelligence. And again, I think it's, what really excites me is humans have spatial intelligence. We are, it's part of our core intelligent capabilities. Animals have spatial intelligence. The entire journey of evolution also is deeply intertwined with the evolution of spatial intelligence. So it's so fundamental.

Starting point is 00:02:49 Without spatial intelligence, AI would be incomplete. How does that translate into what you're doing with your company? Or is there anything you can share in terms of what that means relative to what you're building? Yeah. So we're cracking one of the hardest problem in AI, which is actually making world models that are fundamentally 3D. Because once you can crack that problem, you can unlock a lot of spatial intelligence problems. So we are the first company we know of that is solving this. the 3D generation foundation model problem.

Starting point is 00:03:27 I have many questions, but since you are, you know, describing this first as, you know, the, you know, 3D's criticality to just sort of understanding the world, does that imply you feel that the world models that, you know, world labs will create or, or others in academia or in companies will create, will someday be, like, you know, realistically accurate, like represent physics and understand the world that we can do many more things with? Yeah, it should. It should be realistically accurate or plausible. So you can create a fantastical world, but it should be plausible because the geometry and the physics of it need to be plausible. And that is fundamental to spatial intelligence. Does that imply you have a

Starting point is 00:04:20 particular point of view from a like a neuroscience perspective of like, you know, how fundamental visual, I mean, you've always been a leader in computer vision, right, but in how important visual intelligences versus, let's say, like, large language models and textual intelligence. I actually do. I think from a neural and cognitive science point of view that spatial intelligence is a really hard problem that evolution has to solve for animals. And what's really interesting is I think animals have solved it to an extent, but not fully solved it. It's one of the hardest problems because what is the problem that animal has to solve?

Starting point is 00:05:05 Animals have to evolve the capability of collecting lights in something, which we call eyes mostly. And then with that collection of eyes, it has to reconstruct a 3D world in their mind somehow so that they can navigate and they can do things. And of course, they can interact for humans. We're the most capable animal in terms of manipulation. We can do a lot of things. And all this is spatial intelligence. To me, that's just rooted in our intelligence.

Starting point is 00:05:44 What is interesting is it's not a fully solved problem, even in animals. We, for example, for humans, right, if I ask you to close your eyes right now and draw out or build a 3D model of the environment around you, it's not that easy. We don't have that much capability to generate extremely complicated 3D model until we get trained. You know, there are some of us, whether they're architects or designers or just people with a lot of training and a lot of talent. And that's a hard thing to do. And imagine you do it at your fingertip much more easily and allow much more fluid interactivity and editability. That would just be a whole different world for people, no pun intended.

Starting point is 00:06:43 Are there other big areas like spatial intelligence that you feel haven't been as developed as it could be from a model perspective or other sort of missing gaps that you think? In general, as we build this sort of AI future, we should focus on over time or people should build out. I was just wondering, in addition to sort of 3D and world generation and other big problems like that, because it feels like there are a few big things that we've solved for over time and other things we're working on. We're short of solving language. I would say language is solved to a huge extent. And 3D to me is as critical and difficult as language. So what else that's not solved? I mean, the entire space of emotional intelligence is something that I don't even know how to begin to solve.

Starting point is 00:07:31 I know a lot of people who haven't solved it. That's when AGI is achieved. Yeah, so that's another one. And I can tell you the training data for that is not going to come from Silicon Valley people. Don't underestimate Silicon Valley. I'll put myself in this bucket, but I think we probably need a broader set of people. Yeah, no, that I agree. But these are the three big buckets.

Starting point is 00:08:03 To be honest, I don't know, what do you think in a lot in this era? I think it depends a lot on what you account. in each model. So I agree with your framework in terms of those three. And then certain things like, you know, the spatial intelligence, I'm assuming, also delves into different types of physics simulation and simulations to the world. And, you know, like, those are big areas that I think a lot of people aren't working on that I think are really interesting or important. So, and there's, there's sort of the macro and the microscale of that. The microscale eventually becomes material sciences and other very different types of things from what you're talking

Starting point is 00:08:33 about, where it's more molecular modeling or, yeah. Right. And also someone goes out the current definition of AI, which I do think they'll be empowered by AI. Of course, there's robotics, but robotics is very much a system integration problem as much as a, you know, even if you look at animals, it's not just the compute in the brain per se, right? Yeah, a lot of these things seem much more distributed in terms of spatial intelligence relative to specific systems the animals have. And in some cases, it's to your point, not as centralized as one would think. So it's very interesting to start thinking in terms of those models of more distributed intelligence across an organism versus the CNS.

Starting point is 00:09:12 But yeah, I think it's very interesting stuff. You've also done work in this field, Fefei, of robotics and like physical intelligence. I think of the data hierarchy for, you know, robotics foundation models and actuation as, you know, people want to, of course, use video, right? Because that is what is available to us. There's a big question on like simulation and how much you can get from that today. Perhaps people or do not see the future of like the quality. and the physics that are going to be available to us.

Starting point is 00:09:42 And then there's, you know, close to embodied, like different forms of teleop and then, like, embodied data collection. Is that the hierarchy you have in your mind, or do you think people underestimate simulation and world models for the future? Yeah, great question. First of all, I, like you said, I do work in robotics, especially in my lab at Stanford. I have no doubt that humanity will move into an age where we cohabit with robots. And also the world robot is not a humanoid per se.

Starting point is 00:10:14 Robots taking all kind of forms and shapes. Actually, a few years ago, my lab wrote a really fun paper about morphological intelligence is where the morphology of an agent actually can change by optimizing the tasks they're trying to achieve. So we should be a little more imaginative than just humanoid. Having said that, had a train robot, you mentioned this whole data. Some people call it data pyramids or data cakes or whatever. I agree. I think it's going to be a hybrid of many different forms of data. I also think simulation is underrated.

Starting point is 00:10:58 Actually, it's not underrated by a lot of experts and people in the field. If you look at a lot of robotics companies, they are working on simulated. and synthetic data, I also think we have to be also aware that unlike language models or even unlike spatial intelligence foundation models, robotics is a highly multimodal system that I think what is truly underappreciated, in my opinion, is haptics, is there's so much, especially if we want to do. manipulation, not just navigation. I think haptics data and the ability to really integrate haptics into vision and perception and spatial data is absolutely critical. One thing that you said

Starting point is 00:11:53 that I thought was really interesting is how many different, what are the different morphological forms that a robot may adopt or adopt? And there's sort of two counter arguments people make in terms of the potential future. One argument is that from a supply chain perspective and managing builds and scale of manufacturing, you're going to have many fewer form factors. And the other argument is the economic value of specialization is very high. And therefore, there'll be, you know, thousands and thousands of different form factors as we move to sort of a robotry in future. Do you have a point of view on sort of where we're likely to land between those two viewpoints? I think we're going to a gradient descending to optimization of productivity and efficiency.

Starting point is 00:12:34 My hypothesis is that the requirements of different tasks are so vast that having very few forms or sticking with one form is energy inefficient. And a lot of tasks can be done and should be done by much more energy efficient form factors. Just an extreme and trivial example. if we put robots underwater, they should not be in the shape of humans. They better be in the shape of fish, right? Just think about energy efficiency. And the same with flying.

Starting point is 00:13:15 I don't think human form is, our airplanes are becoming more and more robots. And so I do think there's going to be diversity. Robotics is one potential application for the future. You're scientists first, but also. So, you know, did the Twitter board involved in startups. What are the near-term commercial applications that you can imagine for generating 3D worlds? I believe creativity is a vastly exciting area where humans can be superpowered by AI and by spatial intelligence. And here I draw an analogy with software engineering.

Starting point is 00:13:59 If you look at today's success of LLMs in software engineering, including applications like cursor and windsurf and all that, what you see is a lot of collaboration between AI and humans, and then the collaboration comes in different levels of skill sets and all that. And I think creativity will be similar, is that whether we're talking about designers, 3D artists, VFX, artists, or even marketing talents and game developers, there's so much need in designing and creating 3D space.

Starting point is 00:14:41 And this is fundamentally such a hard problem, even for the trained, skilled people, that having a collaborator will be extremely fun if we do it right. And so I see creativity as an area that is really, exciting. I also do think that a lot of what we're waiting for for Metaverse or X-R-A-R-R-V-R is content creation. I understand hardware itself needs to continue to evolve, but I also think software, we're looking for content creation, and that lends itself so naturally to 3D modeling and 3D or generative spatial models. And that's another interesting area to look into.

Starting point is 00:15:33 Do you have a strong point of view on whether or not world models are like an interesting answer to scalable RL for like more generalizable agents? I actually do think this. This is like I said, AI is not complete without spatial intelligence because the humans interact in in 3D worlds. And in the digital world, we need all. all kinds of interaction, you know, take design as an example. It's a deeply, you know, it has, when we are thinking about design, there's so much we are optimizing for in our mind's eye, whether it's beauty or efficiency or optimization or whatever it is.

Starting point is 00:16:17 And that lends itself pretty naturally to RL settings. What are the biggest challenges in, I guess, trying to go down this path of, you know, designing and training world models. I imagine one is like you worked on images, you worked on video, but we we have images and we have video and we don't have lots of, you know, 3D worlds like in a format, I assume you're building. Yeah, data is absolutely a challenge. You're totally right about that, you know, to create world models, 3D foundation models. We require more and more sophisticated data engineering, data acquisition, data processing and

Starting point is 00:16:57 data synthesis. So I am envious of my NLP LLM colleagues that the data is so abundant on the internet and we don't necessarily have that luxury. So that's definitely one challenge. Another one challenge is that 3D, this is kind of ironic, right? Every one of us use 3D every day, like in so many settings. Basically, you open your eye and the whole life that you experience is 3D. Even when we type on the computer or stare at a screen all the time, yet it's still not as easy a form factor to deliver in the hands of people compared to language. The language is just so easy. And it's also a very active form of. It's not a passive consumption of viewing.

Starting point is 00:17:57 Nobody wakes up and say, I'm just going to sit here and watch 3D, you know. So that creates challenges for productization and how to do it in the right way. Were you ever like a second life player? I'm not a gamer, but my kids love Minecraft. I was going to ask you if there was like a world that you want to experience or imagine? That's a great question, Sarah. You know, I would love to see worlds. I

Starting point is 00:18:29 love seeing worlds. I don't see. For example, like zooming in and in and into like microscopic worlds or, you know, going to the inside of an engine, you know, knowing how the actual engine is, I know, of course I know theoretically how it works, but seeing it with my own eyes experiencing it. or even you might laugh at this, I want to be inside a dishwasher and just experience what that is. All this can be done in a virtual way if we manage to create, you know, world models of anything. Okay. I think a lot and I both want to talk a little bit about your past career and maybe some insights for anyone doing research or trying to, you know, have an impact within AI. Right before this, I asked Andre Carpathie what I should ask you. And he said, you know, Fifei is really magic about ambition and thinking about data.

Starting point is 00:19:29 You should ask her about her PhD and the creation of that 101 data set with Pietro because it's instructive. So I have to ask you about that. You know, first of all I have to say it's always really the greatest thing when your student is more well-known. and achieving so much more than you can. It makes me so proud. So very proud of Andre. I'm surprised he remembers my PhD work. So yes, it's true.

Starting point is 00:20:02 It's, well, gosh, it goes back to 2003-ish, and the world was just barely scratching the surface of Internet and data was not much of a thing, but doing computer vision. We were, my PhD work was really, trying to get object recognition to work. That's the problem of calling out cats and dogs and microwaves and chairs and all that when you're presented with a picture.

Starting point is 00:20:29 And we were beginning to hypothesize that data matters, but we had no idea. There's no scaling law. We had no idea, you know, how far data can go. All we wanted is if we have a machine learning algorithm, whether it's a neural network or base net at that time was very popular. or support vector machine, we need some data to train. And there was no data to train. And as a PhD student, you want to, you know, graduate.

Starting point is 00:20:58 And Pietro is like, well, Fei, Faye, curate a dataset. And, you know, I was thinking, yeah, I do need to curate a data set because every data set out there is so tiny. I'm just not convinced. And Pietro and I were just talking, you know, is there 15 different things or 30? different things. And then God forbid, the PhD advisor set the three-digit number, 100. And I was like, you know, that's a lot of work. But deep in my heart, I know he's right from a mathematical point

Starting point is 00:21:32 of view is pushing the model to generalize we need enough data at least. So, you know, I did write about this process in my book, the words I see that I stumbled upon a dictionary somehow, and it really was for my own English study, that the dictionary, I think it's the Webster Dictionary, if I'm not wrong. It just kind of randomly has depiction of, a visual depiction of some words. I don't even know what rule they follow, to be honest, to be honest, some of flowers, some of bicycles, some are dogs. I was like, okay, this is actually, you can call it a cheat or a tool. I grabbed 101 of those words. And that really made my PhD advisor kind of chuckle because he's like,

Starting point is 00:22:23 ah, yeah, you just want to do one more than I asked for to, you know, dare me. So that's what I did. And I got to say that I still remember I downloaded or, you know, tried, you know, from Google. And Google was so new at that point. And the Google image search were so terrible at that point, you know, compared to today. And I had to do so much cleaning. At some point, I got so desperate. I just asked my mom to do the image cleaning because I wrote a little interface on the computer.

Starting point is 00:22:59 She doesn't know computer, but at least she knows click, click. So she helped me to do some of that. I mean, you've had one of the most storied careers in AI. And to your point, many of your students have similarly gone on to do really great things across the field, across industry, across the world. what are two or three moments that you think of when you think back on your career to date? And obviously there's still a lot of career to come. But I'm just sort of curious. I mean, obviously there's a lot of things that you did in terms of sort of image and visual recognition related systems.

Starting point is 00:23:30 But I'm just sort of curious, like, when you think of the last 20 years, what stands out the most, just given everything that you've done. Oh, thank you for asking that question. Of course, ImageNet is one of those, image that is consists of multiple moments from the early struggles and being. I would not get tenure to actually realizing Amazon Mechanical Turk comes to rescue to the moment of Alex Nett winning. And also to a couple of years ago, I was at an event in Toronto with Jeff Hinton. And he said publicly, like, how that was so defining. And he was almost a little bit apologetic, that image that was not,

Starting point is 00:24:16 as recognized as neural network. So that journey is very validating. And for scientists, the validation is not about recognition or awards. It's that you made a difference. Like that conjecture that no one believed in, that hypothesis that no one believed in, we were able to make it happen. So that's one thread.

Starting point is 00:24:39 Just to make sure for any, like, you know, people from the business world that are not familiar with it, ImageNet was a large scale, is a large-skilled data set with millions of labeled images across thousands of categories, not just 101, right? 15 million label images. 15 million labeled images. Thank you, Faye, that, you know, led to amazing breakthroughs in deep learning,

Starting point is 00:25:01 in particular, AlexNet, and lots of progress in the field of computer vision. Yeah, I drove a lot of machine forward. And I actually remember in 2016 or 2017, I used to show a slide, which was the history of AI, or, you know, back then it was CNN's and RNNs. just Gans were, you know, kind of going. And I had ImageNet and AlexNet as like one of the seminal moments of, you know, this very small number of events that really define AI progress. And obviously now we have Transformers as part of that and maybe diffusion models or something.

Starting point is 00:25:29 But it's, it was such a big breakthrough. Yeah, thank you. Another moment I'm very proud of was actually Andre and also Justin Johnson and their dissertations. It's where, in my opinion, the first. time that language and images converged by captioning and writing stories of the visual world, it was significant for me for two reasons. One is that I literally thought, I kid you not, at the end of my PhD, I thought if I can live to a 100-year-old, that was the problem we might be able to solve, which is storytelling of pictures. So I entered my, my

Starting point is 00:26:15 my career, like my first year, um, assistant professor thinking, okay, I'm going to do image that to solve object recognition. And then I'm going to spend the rest of my entire career solving this problem of storytelling. And then by the time Andre and then a little later, Justin Johnson entered my lab, that was around 2013, 20, 14, the beginning of deep learning. And then suddenly the combination of sequential model at that point is LSTM. It's not transformer models, but LSTM and CNN just had this lasted open, the image captioning work. And Andrea and my work were the first together with Googles that was out of the door. And that was really, to me, it made me so proud, I almost had, it was made me so proud, I almost had a crisis, which is like, what am I going to do for the rest of my 70 years or 65 years?

Starting point is 00:27:25 So that was really exciting how fast the field has, has, you know, evolved. Can I ask you one more question about this just because you have, you know, made this amazing progress. very efficiently, right? Like you and I have offline talked before about how you feel is really important for there to be, you know, moonshots and creativity in AI research beyond like very large funded corporate labs, let's say. And, you know, you pointed to several moments that they come from like creativity and research and academia. What advice you have for people about whether or not there's still opportunity for that or, you know, it's all just $10 billion training runs from here. My singular advice, and I still say that in my comedy, in my lab, is be fearless.

Starting point is 00:28:20 I think scientists and technologists and entrepreneurs have to be fearless. You know, eventually you have to figure out, do you need $10 billion runs? Or when then you come to Sarah and ask for funding. Probably a lot, but both, yeah. Or you have to figure out, you know, I don't know, data. Sometimes fearless is this very interesting position where you're somewhat delusional and crazy, but somewhat just rationally bold.

Starting point is 00:28:56 And it kind of is in between, because if you're too rational, it's not courageous enough. You're not identifying problems. that are big enough. But if you're completely crazy, then I don't know. There's many things that can go wrong. So be fearless, be courageous. I, to me, that is, you know, even as old as I am,

Starting point is 00:29:24 that's how I feel I started by startup world labs is, I want to be fearless and solve this problem of spatial intelligence. As part of problem solving, you've worked with some of the best AI researchers in the world over time and best engineers. How do you think about that in the context of your company? Like what sorts of people you're trying to hire are there open roles currently? And Daddally, it's an amazing team. I'm just curious, like, what sorts of folks you want to add and how you're thinking about that over time? Yes, we have open roles and we would love to hire the best engineers as well as product thinkers at this point

Starting point is 00:29:59 for our company. So if you're an engineer or AI researcher or product talent out there, passionate about joining the most talented team and solving this problem. Please join us. So who do we hire? First of all, we really do hire in diversity of thinking. And this is where, you know, you call us an AI company, but if you look under the hood, we've got computer graphics experts, we've got computer vision experts, we've got data experts, we've got, you know, generative AI experts, We've got machine learning infra experts. We've got optimization. So it's actually really important to hire a diverse group of really talented people

Starting point is 00:30:47 because the problem as hard as spatial intelligence is not a homogenous problem. Like it takes talents of all kinds of background to solve it. And then I also just, you know, like I look for fearlessness. Like, you know, we all have... How do you do that? Like, how do you identify if somebody has fearlessness in their background

Starting point is 00:31:08 or in their thinking processes? It's in their background. You talk to them. You can sense someone is fearless. You know, you can sense what drives them. You know, you can sense the questions they ask. If they are, if they start to asking you a lot of things about, I don't know how.

Starting point is 00:31:31 how to get this done. And I mean, of course you have to ask those questions because you want to get it done. But if you sense that it comes from the point of view of being scared of solving that, then that's not fearlessness. But those fearless people, they are creative, they're ambitious. They're not afraid of the uncertainty. or the unknown. And I really love that.

Starting point is 00:32:05 Well, I think a lot and I, you know, we try to make a business of doing business with fearless people and hopefully those that are technically creative. One last broader question for you because I think an important part of your work has been also thinking how to bring more people into AI, you know, co-directing the Stanford Center for Human-centered in artificial intelligence, what is your most, like if you picture, you know, not to use a

Starting point is 00:32:36 pun on the book, but if you picture the world like several years out from your last set of predictions, what's your most optimistic view of what human-centered AI looks like? Yeah, thanks for asking. In fact, that is another point of my career I feel very proud of is the founding of Human Center AI Institute, H-AI, and also the continued movement towards that. way of thinking. I think I want to build a world that AI collaborates and superpowers people. I still believe our world, our human world needs to be human-centered, you know, where love, relationship, just prosperity across, you know, all communities. These are really important justice

Starting point is 00:33:22 and these are really important values. And I don't think any piece of machinery, whether it's, AI or airplane or biotech should take those away. But with that, those critical values in mind, having AI to superpower us is really, really important because there's so many unsolved problems. One application area I had worked on is healthcare, for example, at Stanford, right? If you look at health care from drug discovery to cure diseases, to diagnosis that can reach all people in the world, to treatment that can be accessible to all people in the world, to the whole healthcare delivery, how to make aging better, how to take care of chronic diseases, how to deal with mental health. all of this, we do not have an issue of excessive humans or anything. We're lacking help. You know, we are lacking scientific discovery. We're lacking diagnosis. We're lacking precision

Starting point is 00:34:33 medicine. We're lacking safer and more effective ways of healthcare delivery and aging help and all that. And that's what I believe. I think AI is a tool to help people. Yeah, I think a lot and I are collectively invested in a series of companies that I hope will be useful here from a bridge to open evidence to latent. But as you said, there's a huge spectrum of problems. And honestly, I've been less optimistic about the adoption of, you know, generally technology and healthcare for the last 15 years. But it does feel like this time it's different. And actually, it's just massively net good here. Yeah, I actually started a digital health company before this. And my hope is finally a lot of the things that people have been

Starting point is 00:35:17 talking about for decades will come to fruition, and it seems like AI is a great delivery mechanism for that. Totally. Totally. Well, thank you so much, Faye Faye. It's fantastic. This has been inspiring and great to hear a little bit more about World Labs as well. Thank you.

Starting point is 00:35:32 Thank you, love. Thank you, Sarah. Find us on Twitter at No Pryor's Pod. Subscribe to our YouTube channel if you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-dash priors.com.

No Priors: Artificial Intelligence | Technology | Startups - Teaching AI to Understand the Physical World, with Dr. Fei-Fei Li of World Labs

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.