No Priors: Artificial Intelligence | Technology | Startups - The Robotics Revolution, with Physical Intelligence’s Cofounder Chelsea Finn

Starting point is 00:00:00 Hi, listeners. Welcome to No Pryors. This week we're speaking to Chelsea Finn, co-founder of physical intelligence, a company bringing general-purpose AI into the physical world. Chelsea co-founded physical intelligence alongside a team of leading researchers and minds in the field. She's an associate professor of computer science and electrical engineering at Stanford University, and prior to that, she worked at Google Brain and was at Berkeley. Chelsea's research focuses on how AI systems can acquire general purpose skills through interactions with the world. So Chelsea, thank you so much for joining us today on No Pryors. Yeah, thanks for having me. You've done a lot of really important storied work in robotics between your work at Google

Starting point is 00:00:40 at Stanford, et cetera. So I would just love to hear a little bit firsthand your background in terms of your path in the world of robotics, what drew you to it initially and some of the work that you've done. Yeah, it's been a long road. At the beginning, I was really excited about the impact that robotics could have in the world, but at the same time, I was also really fascinated by this problem of developing perception and intelligence and machines. And robots embody all of that. And also there's sometimes there's some cool math that you can do as well that keeps your brain active, makes you think. And so I think all of that is really fun about working in the field. I started working more seriously in robotics more than 10 years ago at this point at the start of my PhD at Berkeley. and we are working on neural network control

Starting point is 00:01:26 trying to train neural networks that map from image pixels to directly to motor torques on a robot arm. At the time, this was not very popular, and we've come a long way and it's a lot more accepted in robotics and also just generally something

Starting point is 00:01:44 that a lot of people are excited about. Since that beginning point, it was very clear to me that we could train robots to do pretty cool things, but that getting the robot to do, do one of those things in many scenarios with many objects was a major, major challenge. So 10 years ago, we were training robots to screw a cap onto a bottle and use a spatula to lift an object into a bowl and kind of do a tight insertion or hang up like a hanger on

Starting point is 00:02:10 a clothes rack. And so pretty cool stuff. But actually getting the robot to do that in many environments with many objects, that's where a big part of the challenge comes in. And I've been thinking about ways to make broader data sets, train on the those broader data sets and also different approaches for learning, whether it be reinforcement learning, video prediction, imitation learning, all those things. And so, yeah, moved from spent a year at Google Brain in between my PhD and joining Stanford, became a professor at Stanford, started a lab there,

Starting point is 00:02:43 did a lot of work along all these lines, and then recently started physical intelligence almost a year ago at this point. So I've been on leave from Stanford for that. And it's been really exciting to be able to try to execute on the vision that the co-founders that we collectively have and do it with a lot of resources and so forth. And I'm also still advising students at Stanford as well. That's really cool. And I guess we started physical intelligence with four other co-founders and an incredibly impressive team. Could you tell us a little bit more about what physical intelligence is working on in the approach that you're taking? Because I think it's a pretty unique slant on the whole old and approach. Yeah. So we're trying to build a big neural network model that could ultimately

Starting point is 00:03:27 control any robot to do anything in any scenario. And like a big part of our vision is that in the past robotics is focused on like trying to go deep on one application and like developing a robot to do one thing. And then ultimately gotten kind of stuck in that one application. It's really hard to like solve one thing and then try to get out of that and broaden. And instead we're really in it for the long term to try to address this broader problem of physical intelligence in the real world. We're thinking a lot about generalization, generalists, and unlike other robotics companies, we think that being able to leverage all of the possible data is very important. And this comes down to actually not just leveraging data from one robot, but from any robot platform

Starting point is 00:04:16 that might have six joints or seven joints or two arms or one arm. We've seen a lot of lot of evidence that you could actually transfer a lot of rich information across these different embodiments and allows you to use data. And also, if you iterate on your robot platform, you don't have to throw all your data away. I have faced a lot of pain in the past where we got a new version of the robot and then your policy doesn't work. And it's a really painful process to try to get back to where you were on the previous robot iteration. So yeah, trying to build general's robots and essentially kind of develop foundation models that will power. the next generation of robots in the real world.

Starting point is 00:04:54 That's really cool because, I mean, I guess there's a lot of sort of parallels to the large language model world where, you know, really a mixture of deep learning, the transformer architecture in scale has really proven out that you can get real generalizability in different forms of transfer between different areas. Could you tell us a little bit more about the architecture you're taking or the approach or, you know, how you're thinking about the basis for the foundation model that you're developing? At the beginning, we were just getting off the ground. We were trying to scale data collection. And a big part of that is, unlike in language, we don't have like Wikipedia or an internet of robot motions. And we're really excited about scaling data on real robots in the real world. This is this kind of real data is what has fueled machine learning advances in the past. And a big part of that is we actually need to collect that data. And that looks like teleoperating robots in the physical world. We're also exploring other ways. of scaling data as well, but the kind of bread and butter is scaling real robot data.

Starting point is 00:05:54 We released something in late October where we showed some of our initial efforts around scaling data and how we can learn very complex tasks of folding laundry, cleaning tables, constructing a cardboard box. Now, where we are in that journey is really thinking a lot about language interaction and generalization to different environments. So what we showed in October was the robot in one environment, and it had data in that environment. We were able to see some amount of generalization, so it was able to fold shirts that had never seen

Starting point is 00:06:30 before, fold shorts that has never seen before, but the degree of generalization was very limited. And you also couldn't interact with it in any way. You couldn't prompt it and tell you what you want to do beyond kind of fairly basic things that it saw in the training data. And so being able to handle lots of different prompts and lots of different environments, and lots of different environments is a big focus right now. And in terms of the architecture, we're using transformers. And we are using pre-trained models, pre-trained vision language models.

Starting point is 00:07:01 And that allows you to leverage all of the rich information in the internet. We had a research result a couple years ago where we showed that if you leverage vision language models, then you could actually get the robot to do tasks that require concepts that were never in the robots training data, but we're in the internet. One famous example is that you can pass the Coke can, to Taylor Swift, or a picture of Taylor Swift, and the robot has never seen Taylor Swift in person, but the internet has lots of images of Taylor Swift in it. And you can leverage all of the information in that data and then the weights of the pre-trained model to kind of transfer that to the robots. So we're not starting from scratch, and that helps a lot as well. So that's a little

Starting point is 00:07:35 bit about the approach, happy to dive deeper as well. That's really amazing. And then what do you think is the main basis then for really getting to generalizability? Is it scaling data further? Is it scaling compute. It's a combination of the two. It's other forms of post-training. I'm just sort of curious, like, as you think through the common pieces that people look at now, I'm sort of curious what you think needs to get filled in. Obviously, on the, again, the more language model world, people are spending a lot of time on reasoning modules and other things like that as well. So I'm curious, like, what are the components that you feel are missing right now? Yeah, so I think the number one thing, and this is kind of the boring thing, is just getting more diverse robot

Starting point is 00:08:09 data. So for that release that we had in late October last year, We collected data in three buildings, technically. The internet, for example, and everything that is fueled language models and vision models is way, way more diverse than that, because the internet is pictures that are taken by lots of people and text written by lots of different people. And so just trying to collect data in many more diverse places and with many more objects, many more tasks. So scaling the diversity of the data, not just the quantity of the data is very important.

Starting point is 00:08:42 And that's a big thing that we're focusing on right now, actually bringing our work. robots into lots of different places and collecting data in it. As a side product of that, we also learn what it takes to actually get your robot to be operational and functional in lots of different places. And that is a really nice byproduct because if you actually want to get robots to work in the real world, you need to be able to do that. So that's the number one thing. But then we're also exploring other things, leveraging videos of people.

Starting point is 00:09:09 Again, leveraging data from the web, leveraging pre-trained models, thinking about reasoning, although more basic forms of reasoning, in order to, for example, put a dirty shirt into a hamper. If you can recognize where the shirt is and where the hamper is and what you need to do to accomplish that task, that's useful, or if you want to make a sandwich, and the user has a particular request in mind, you should reason through that request. If they're allergic to pickles, you probably shouldn't put pickles on the sandwich, things like that. So there's some basic things around there, although the number one thing is just more diverse robot data. And then I think a lot of the person you've taken the data has really been an emphasis on releasing open source

Starting point is 00:09:50 models and packages for robotics. Do you think that's the long-term path? Do you think it's open core? Do you think it's eventually proprietary models? Or how do you think about that in the context of the industry? Because it feels like there's a few different robotics companies now each taking different process in terms of either hardware only, excuse me, hardware plus software, and they're focused on a specific hardware footprint. There's software and there's close source versus open source if you're just doing the software. So I'm sort of curious where in that spectrum, physical intelligence lies. Definitely. So we've actually been quite open. We've, not only have we open source some of the weights and release details and technical papers, we've actually also been working with hardware

Starting point is 00:10:28 companies and giving designs of robots to hardware companies. And some people have actually, like, when I tell people that, sometimes they're actually really shocked that, like, what about the IP? What about, I don't know, confidentiality and stuff like that? And we've actually made this, made a very intentional choice around this. There's a couple of reasons for it. One is that we think that the field, it's really just the beginning. And these models will be so, so much better. And the robots should be so so much better in a year, in three years. And we want to support the development of the research. And we want to support the community, support the robots. so that when we hopefully develop the technology of these generalist models,

Starting point is 00:11:10 the world will be more ready for it. We'll have better, like, more robust robots that are able to leverage those models, people who have the expertise and understand what it requires to use those models. And then the other thing is also, like, we have a really fantastic team of researchers and engineers and really, really fantastic researchers and engineers want to work at companies that are open, especially on researchers, where they can get kind of credit for their work and share their ideas, talk about their ideas. And we think that having the best researchers and engineers will be necessary for solving this

Starting point is 00:11:43 problem. The last thing that will mention is that I think the biggest risk with this bet is that it won't work. Like, I'm not really worried about competitors. I'm more worried that no one will solve the problem. Oh, interesting. And why do you worry about that? I think robotics is, it's very hard.

Starting point is 00:11:59 And there have been many, many failures. the past. And unlike when you're like recognizing an object in an image, there's very little tolerance for error. You can miss a grasp on an object or like not make, like the difference between making contact and not making contact in an object is so small. And it has a massive impact on the, on the outcome of whether the robot can actually successfully manipulate the object. And I mean, that's just one example. There's challenges on the data side of collecting data. Well, just anything involving hardware is hard as well. I guess we have a number of examples now of robots in the physical world.

Starting point is 00:12:35 You know, everything from autopilot on a jet on through to some forms of pick and pack or other types of robots and distribution centers. And there's obviously the different robots involved with manufacturing, particularly in automotive. So there's been a handful of more constrained environments where people have been using them in different ways. Where do you think the impact of these models will first show up? Because to your point, there are certain things where you have very low. tolerance for error. And there's a lot of fields where actually it's okay or maybe you can constrain

Starting point is 00:13:05 the problem sufficiently relative to the capabilities of the model that it works fine. Where do you think physical intelligence will have the nearest term impact or in general the field of robotics and these new approaches will substantiate themselves? Yeah, as a company, we're really focused on the long-term problem and not at like any one particular application because of the failure modes that can come up when you focus on one application. I don't know where the first applications will be. I think one thing that's actually challenging is that typically in machine learning, a lot of the successful applications of like recommender systems, language models, like image detection, a lot of the consumers of that of the model outputs are actually humans who could actually

Starting point is 00:13:45 check it and the humans are good at the thing. A lot of the very natural applications of robots is actually the robot doing something autonomously on its own, where it's not like a human consuming the commanded arm position, for example. And then, checking it and then validating it and so forth. And so I think we need to think about new ways of having some kind of tolerance for mistakes or scenarios where that's fine or scenarios where humans and robots can work together. That's, I think, one big challenge that will come up when trying to actually deploy these. And some of the language interaction work that we've been doing is actually motivated by this challenge

Starting point is 00:14:22 where we think it's really important for humans to be able to provide input for how they want the robot to behave and what they want the robot to do, how they want the robot to help in a particular scenario. That makes sense. I guess the other form of generalizability, to some extent, at least in our current world, is the human form, right? And so some people are specifically focused on humanoid robots like Tesla and others under the assumption that the world is designed for people and therefore is the perfect form factor to coexist with people. And then other people have taken very different approaches in terms of saying, well, I need something that's more specialized for the home in certain ways or for factories or manufacturing or you name it. What is your view on

Starting point is 00:14:58 kind of humanoid versus not? On one hand, I think humanized are really cool, and I have one in my lab at Stanford. On the other hand, I think that they're a little overrated. And one way to practically look at it is I think that we're generally fairly ball-necked on data right now. And some people argue that with humanoids, you can maybe collect data more easily because it matches the human form factor. And so maybe it'd be easier to mimic humans.

Starting point is 00:15:24 And I've actually heard people make those arguments. But if you've ever actually tried to teleoperate a humanoid, it's actually a lot harder to teleoperate than that a static manipulator or a mobile manipulator with wheels. Optimizing for being able to collect data, I think is very important because if we can get to the point where we have more data than we could ever want, then it just comes down to research and compute and evaluations. and so we're optimizing for that's one of the things we're kind of optimizing for and so we're using cheap robots we're using robots that we can very easily develop teleoperation interfaces for in which you can do teleoperation very quickly and collect diverse data collect lots of data yeah it's funny there was that viral fake kum Kardashian video of her going shopping with the robot following her around carrying all of her shopping bags when I saw that I really wanted a humanoid

Starting point is 00:16:16 robot to follow me around if it were as that would be really funny to do that so I'm hopefully that someday I can use your software to cause a robot to follow me around to do things. So exciting future. How do you think about the embodied model of development versus not on some of these things in terms of that's another sort of, I think, set of tradeoffs that some people are making or deciding between? A lot of the AI community is very focused on just like language models, vision language models and so forth. And there's like a ton of hype around like reasoning and stuff like that. Oh, let's create like the most intelligent thing. I feel like actually people underestimate how much intelligence goes into motor control.

Starting point is 00:16:55 Many, many years of evolution is what led to us being able to use our hands the way that we do. And there are many animals that they can't do it, even though they had so many years of evolution. And so I think that there's actually so much complexity and intelligence that goes into being able to do something as basic as like make a bowl of cereal or pour a glass of water. And yeah, so in some ways I think that actually embodied intelligence or physical intelligence is very cool. core to intelligence and maybe kind of underrated compared to some of the less embodied models. One of the papers that I really loved over the last couple years in robotics was your Aloha paper. And I thought it was a very clever approach. What is some of the research over the last two or three years that you think has really caused

Starting point is 00:17:39 this flurry of activity? Because I feel like there's been a number of people now starting companies in this area because a lot of people feel like now is the time to do it. And I'm a little bit curious what research you feel was the basis for that shift in and people thinking this was a good place to work. At least for us, there were a few things that we felt like were turning points that felt like where it felt like the field was moving a lot faster compared to where it was before.

Starting point is 00:18:03 One was the SACAN work where we found that you can plan with language models as kind of the high-level part and then kind of plug that in with a low-level model to get a model to do long horizon tasks. One was the RG2 work, which showed that you could do the Taylor's Fifth example that I mentioned earlier and be able to plug in kind of a lot of the web data and get better generalization on robots. A third was our RTX work where we were, we actually were able to train models across robot embodiments and significantly, we basically took all the robot data that different

Starting point is 00:18:41 research labs had. It was a huge effort to aggregate that into a common format and train on it. And we also, when we trained on that, we actually found that we could, take a checkpoint, send that model checkpoint to another lab halfway across the country, and the grad student at that lab could run the checkpoint on their robot, and it would actually more often than not do better than the model that they had specifically iterated on themselves in their own lab. And that was like another big sign that, like, this stuff is actually starting to work

Starting point is 00:19:10 and that you can get benefit across, by pooling data across different robots. And then also, like you mentioned, I think the Alo work and later the mobile loa work was work that showed that you can teleoperate and get models to train pretty complicated dexterous manipulation tasks. We also had a follow-up paper with the shoe lace tying. That was a fun project because someone said that they would retire if they saw a robot tie shoe laces. So did they retire? They did not retire. Oh, that's awful. We need to force them into retirement. Whoever that person is, we need to follow up on that. Yeah. So I think those are a few examples. And so yeah, I think we've seen a ton of progress in the field. I also, it seems like after we started Pi that that was also kind of

Starting point is 00:19:52 assigned to others that if the experts are really willing to bet on this, then something, maybe something will happen. So one thing that you all came out with today from Pi was what you call a hierarchical interactive robot or high robot. Can you tell us a little bit more about that? So this is a really fun project. There's two things that we're trying to look at here. One is that if you need to do like a longer horizon task, meaning a task that might take minutes to do, then if you just train a single policy to like output actions based on images, like if you're trying to make a sandwich and you train a policy that's just outputting the next motor command, that might not do as well as something that's actually kind of thinking through the steps to accomplish that task.

Starting point is 00:20:40 That was kind of the first component. That's where the hierarchy comes in. And the second component is a lot of the times when we train robot policies, we're just saying, like, we'll take our data, we'll annotate it and say, like, this is picking up the sponge. This is putting the bowl in the bin. This segment is, I don't know, folding the shirt. And then you get a policy that can, like, follow those basic commands of, like, fold the shirt or pick up the cup, those sorts of things. But at the end of the day, we don't want robots just to be able to do that. We want them to be able to interact with us where we can say, like, oh, I'm a vegetarian.

Starting point is 00:21:11 Can you make me a sandwich? oh, and I'm allergic to pickles, so, like, maybe don't include those. And maybe you also be able to interject in the middle and say, like, oh, hold off on the tomatoes or something. It's actually kind of a big gap between something that can just follow, like, an instruction, like, pick up the cup, and something that could be able to handle those kinds of prompts and those situated corrections and so forth. And so we developed a system that basically has one model that takes as input the prompt, and kind of reasons through is able to output, like, the next step. that the robot should follow and that might be, that that's kind of like it's going to tell it to

Starting point is 00:21:46 then the next thing will be pick up the tomato, for example. And then a lower level model that takes its input, pick up the tomato and outputs the sequence of motor commands for the next like half second. That's the gist of it. It was a lot of fun because we actually got the robot to make a vegetarian sandwich or a ham and cheese sandwich or whatever. We also did a grocery shopping example and a table cleaning example. And I was excited about it at first because it was just like cool to see the robot be able to respond to different problems and do these challenging tasks. And second, because it actually seems like a, like the right approach for solving the problem. On the technical capability side, one thing I was wondering about a little bit was

Starting point is 00:22:24 if I look at the world of self-driving, there's a few different approaches that are being taken. And one of the approaches that is the more kind of waymo-centric one is really incorporating a variety of other types of sensors besides just visions. We have LIDAR and a few other things as ways to augment the self-driving capabilities of a vehicle. Where do you think we are in terms of the sensors that we use in the context of robots? Is there anything missing? Is there anything we should add? Or there are types of inputs or feedback that we need to incorporate that haven't been

Starting point is 00:22:52 incorporated yet? So we've gotten very far just with vision, with RGB images even. And we typically will have one or multiple external kind of what we call base cameras that are looking at the scene. And also cameras mounted to each of the room. wrists of the robot. We can get very, very far with that. I would love if skin, if we could give our robot skin. Unfortunately, a lot of the tactile sensors that are out there are either far less robust than skin, far more expensive, or very, very low resolution. So there's a lot of kind of challenges

Starting point is 00:23:29 on the hardware side there. And we found that actually that mounting RGB cameras to the wrists ends up being very, very helpful and probably giving you a lot of the same information that tactile sensors can give you. Because when I think about the set of sensors that are incorporated into a person, obviously to your point, there's the tactile sensors effectively, right? And then there's heat sensors. There's actually a variety of things that are incorporated that people usually don't really think about much.

Starting point is 00:23:54 Absolutely. And I'm just sort of curious, like, how many of those are actually necessary in the context of robotics versus not? What are some of the things we should think about? Like, just if we extrapolate off of humans or animals. or other, you know. It's a great question. I mean, for the sandwich making, you could argue that you'd want the robot to be able

Starting point is 00:24:09 to taste the sandwich to know if it's good or not. Or smell it at least, you know. Yeah, I've made a lot of arguments for smell to Sergei in the past because there's a lot of nice things about smell, although you've never actually attempted it before. Yeah. In some ways, the redundancy is nice. For example, and I think like audio, for example, like a human, if you hear something that's unexpected, it can actually kind of alert you to something. In many cases, it might actually

Starting point is 00:24:36 be very, very redundant with your other sensors, because you might be able to actually see something fall, for example, and that redundancy can lead to robustness. For us, it's not currently not a priority to look into these sensors, because we think that the bottleneck right now is elsewhere, is on the data front, is on kind of the architectures and so forth. The other thing that I'll mention is actually right now, we're most, like, our policies right now do not have any memory, they only look at the current image frame. They can't remember even half a second prior. And so I would much rather add memory to our models before we add other sensors. We can have commercially viable robots for a number of applications without other sensors.

Starting point is 00:25:20 What do you think is a time frame on that? I have no idea. Yeah. Some parts of robotics that make it easier than self-driving and some parts that make it harder. On one hand, it's harder because you're not just, like, it's just a much higher dimensional space. Even our static robots have 14 dimensions of seven for each arm. You need to be more precise in many scenarios than driving. We also don't have as much data right off the bat. On the other hand, with driving, I feel like you kind of need to solve the entire distribution to be have anything that's viable. You have to be able to handle an intersection at any time of day or with any kind of possible pedestrian scenario or other cars and all that.

Starting point is 00:26:00 Whereas in robotics, I think that there's lots of commercial use cases where you don't have to handle this whole huge distribution. And you also don't have as much of a safety risk as well. That makes me optimistic. And I think that also like all the results in self-driving have been very encouraging, especially like the number of Waymos that I see in San Francisco. Yeah, it's been very impressive to watch them scale up usage. I think I found striking about the self-driving world is, you know, there was two dozen

Starting point is 00:26:28 startups started roughly, I don't know, 10 to 15 years ago around self-driving. And the industry is largely consolidated, at least in the U.S., and obviously the China market's a bit different, but it's consolidated into Waymo and Tesla, which effectively were two incumbents, right? Google and Tesla was an automaker. And then there's maybe one or two startups that either spacked and went public or still kind of working in that area. and then most of us kind of fallen off, right? And the set of players that existed at that starting moment 10, 15 years ago

Starting point is 00:26:57 was kind of the same players that ended up actually winning, right? There hasn't been a lot of dynamism in the industry other than just consolidation. Do you think that the main robotics players are the companies that exist today? And do you think there's any sort of incumbency bias that's likely? A year ago, like, it would be completely different. And I think that we've had so many new players recently. I think that the fact that self-driving was, like that suggested that it might have been a bit too early 10 years ago for and I think that arguably

Starting point is 00:27:28 it was like I think deep learning has come a long long way since then and so I think that that's also part of it and I think that the same with robotics like if you were to ask you 10 years ago or even even five years ago honestly I think it would be too early I think the technology like wasn't there yet we might still be too early for all we know I mean it's a very hard problem and like how hard self-driving has been is I think is a testament to how hard it is to build intelligence in the physical world. In terms of like major players, there's a lot of things that I've really liked about the startup environment and a lot of things that were very hard to do when I was at Google. And Google is an amazing place in many, many ways. But like,

Starting point is 00:28:09 as one example, taking a robot off campus was like almost a non-starter just for code security reasons. And if you want to collect diverse data, taking robots off campus, is valuable. You can move a lot faster when you're a smaller company when you don't have kind of restrictions, red tape, that sort of things. The really big companies, they have a ton of capital so they can last longer, but I also think that they're going to move slower too. If you were to give advice to somebody thinking about starting a robotics company today, what would you suggest they do or where would you point them in terms of what to focus on? I think the main advice that I would give someone trying to start a company would be to try to learn

Starting point is 00:28:49 as much as possible, uh, quickly. And I think that actually like trying to deploy quickly and learn and iterate quickly, um, that's probably the, the main advice and try to, yeah, do, like actually get the robots out there, learn from that. Um, I'm also not sure if I'm the best person to be giving startup advice because I've only been an entrepreneur myself for 11 months. But, uh, yeah, that's probably their advice. That's cool. I mean, you're running an incredibly exciting startup. So I think you have, uh, full ability to suggest stuff to people in that area for sure. I've heard a number of different groups doing is really using observational data of people

Starting point is 00:29:27 as part of the training set. So that could be YouTube videos. It could be things that they're recording specifically for the purpose. How do you think about that in the context of training robotic models? I think that data can have a lot of value, but I think that by itself, it won't get you very far. And I think that there's actually some really nice analogies you can make where, for example, if you watch like an Olympic swimmer, swimmer race, even if you had their strength, just their

Starting point is 00:29:55 practice at moving their own muscles to do the, to accomplish what they're accomplishing is like essential for being able to do it. Or if you're trying to learn how to hit a tennis ball well, you won't be able to learn it by kind of watching the pros. Now, maybe these examples seem a little bit contrived because they're talking about like experts. The reason why I make those analogies is that we humans are experts at motor control, low-level motor control already for a variety of things that our robots are not. And I think the robots actually need experience from their own body in order to learn.

Starting point is 00:30:25 And so I think that it's really promising to be able to leverage that form of data, especially to expand on the robot's own experience, but it's really going to be essential to actually have the data from the robot itself too. In some of those cases, is that just general data that you're generating around that robot, or would you actually have it mimic certain activities or how do you think about the data generation, because you mentioned a little bit about the transfer and generalizability. It's interesting to ask, well, what is generalizable or not and what types of data are and aren't and things like that?

Starting point is 00:30:54 I mean, when we collect data, we have, it's kind of like puppeteering, like the original Aloha work. And then you can record both the actual motor commands and the sensor, like the camera images. And so that is the, like, experience for the robot. And then I also think that autonomous experience will play a huge role, just like we've seen in language models after you get an initial language model. If you can use reinforcement learning to have the language model bootchrap on its own experience, that's extremely valuable. Yeah. And then in terms of what's generalizable versus not, I think it all comes down to the breadth of the distribution.

Starting point is 00:31:32 It's really hard to quantify or measure how broad the robot's own experiences. and there's no way to categorize the breadth of the tasks, like how different one task is from another, how different one kitchen is from another, that sort of thing. But we can at least get a rough idea for that breadth by looking at things like the number of buildings or the number of scenes, those sorts of things. And then I guess we talked a lot about humanoid robots and other sort of formats.

Starting point is 00:32:01 If you think ahead in terms of the form factors that are likely to exist in N-years as this sort of robotic future comes into play, do you think there's sort of one singular form or there are a handful? Is it a rich ecosystem, just like in biology? How do you think about what's going to come out of all this? I don't know exactly, but I think that my bet would be on something where there's actually a really wide range of different robot platforms. I think Sergey, my co-founder, likes to call it a Cambrian explosion of different robot hardware types and so forth. once we actually can have the technology that can, the intelligence that can power all those different robots.

Starting point is 00:32:41 And I think it's kind of similar to, like, we have all these different devices in our kitchen, for example, that can do all these different things for us. And rather than just like one device that cooks the whole meal for us. And so I think we can envision like a world where there's like one kind of robot arm that does things on the kitchen that has like some hardware that's optimized for that and maybe also optimized for it to be cheap for that particular use case. And another hardware that's kind of designed for like folding clothes or something like that, dishwashing those sorts of things.

Starting point is 00:33:15 These are all like speculation, of course. But I think that a world like that is something where, yeah, it's, I think different from what a lot of people think about. In the book The Diamond Age, there's sort of this view of like matter pipes going into homes and you have these 3D printers that make everything for you. And in one case, you're like downloading schematics and then you 3D. print the thing, and then people who are kind of bootlegging some of this stuff end up with almost evolutionarily based processes to build hardware and then select against certain functionality is the mechanism by which to optimize things. Do you think a future like that, is it all likely,

Starting point is 00:33:47 or do you think it's more just, hey, you make the foundation model really good, you have a couple form factors, and you know, you don't need that much specialization if you have enough generalizability in the actual underlying intelligence? I think a world like that is very possible. And I think that you can make a cheaper hardware piece of hardware if you are optimizing for a particular use case and maybe it would be like also be a lot faster and so forth. Yeah, obviously very hard to predict. Yeah, it's super hard to predict because one of the arguments for a smaller number of hardware platforms is just supply chain, right? It's just going to be cheaper at scale to manufacture all the subcomponents and therefore you're going to collapse down to fewer things because unless there's a dramatic cost advantage of those fewer things will be more easily scale. scalable, reproducible, cheap to make, et cetera, right, if you look at sort of general hardware

Starting point is 00:34:33 approaches. So it's an interesting question in terms of that trade-off between those two tensions. Yeah, although maybe we'll have robots in the supply chain that can manufacture any customizable device that you want. It's robots all the way down. So that's our future. Well, thanks so much for joining me today. It was a super interesting conversation and we covered a wide variety of things. So I really appreciate your time. Yeah, this is fun. Find us on Twitter at No Pryor's Pod. Subscribe to our YouTube channel if you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no dash priors.com.

Your Ad Here

No Priors: Artificial Intelligence | Technology | Startups - The Robotics Revolution, with Physical Intelligence’s Cofounder Chelsea Finn

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.