Y Combinator Startup Podcast - Chelsea Finn: Building Robots That Can Do Anything

Starting point is 00:00:00 Hi everyone. I'm really excited to talk about developing general purpose robots and how we might actually like truly develop and bring intelligence into the physical world. So to start off, I'd like to talk about this problem which is that if you want to truly solve a robotics application, you essentially need to build an entire company around that application. You need to build a different company for logistics for wet lab automation, for robots and kitchens, for surgical robots and so on. And this is really, really hard to do because that company needs to make new hardware, develop custom software, design unique movement primitives for that application, handle edge cases, and so on. You have to do all of that from scratch if you want to solve a robotics problem. And as a result, a lot of robotics companies haven't been very successful

Starting point is 00:00:52 in actually bringing robots into the physical world successfully in our daily lives. I co-founded a company called Physical Intelligence that's trying to solve this problem. And in particular, we're trying to develop a general purpose model that can enable any robot to do any task in any environment. And we think that this sort of generalist model may work better and be easier to use than purpose-built models, just like we've seen in the development of foundation models for language and other applications. For example, if you want to build a coding assistant, you don't nowadays develop something specifically for coding, but you develop and you build on models that were trained on large amounts of data, not just on code. And essentially, this is the problem of trying to develop these sorts of foundation models and bring this sort of intelligence into the physical world rather than the digital world where they largely are today. So how do we do this?

Starting point is 00:01:49 In this talk, I'd like to talk about how we go about doing this. And if we were to take a lesson from language models, we know that language models have taught us the importance of scale. And so one possible conclusion would be that perhaps scale is the most important ingredient for developing these models. And if you were to say this conclusion is true, then you might look to certain data sources for large-scale data. So for example, we might look at data from industrial automation.

Starting point is 00:02:21 And you get tons and tons of data of robots doing tasks over and over again like this, but the sort of data isn't going to allow robots to go into disaster zones or to make a sandwich or to bag groceries. And so this massive scale doesn't have the diversity of behaviors that we need in order to solve this general problem. Alternatively, maybe we look at data from YouTube, which has also a massive data source and many videos of humans doing tasks, that can be useful for training robots. But at the same time, we don't learn how to write by watching other people write, and we don't become expert tennis players by watching Wimbledon. And so even though there's a massive scale of data here, it's very challenging to use,

Starting point is 00:03:01 and there's also a gap between the embodiment of robots and humans. And lastly, we might look at data from simulation. You can also get a massive scale of data here, but this data lacks realism and also has a gap from reality. And so I think the lesson here is that scale is necessary, for developing these models that can generalize in open world conditions, but they're subordinate to actually solving the problem. So you need scale, but it's not sufficient for the entire problem.

Starting point is 00:03:29 And so at physical intelligence, we've been, this is an example of a data episode that we've collected. This is in order of our first anniversary, which was a few months ago, where we, here you can see a teleoperator in person who's operating some leader arms to control the robot to light, to light a match and light a candle with the match. And with this sort of data, we can train robots to do a variety of different tasks.

Starting point is 00:03:56 And so what I'd like to talk about is some of our recent results at trying to develop sort of physical intelligence with large-scale real robot data. I should mention this is large-scale by today's robot standards, and arguably a minuscule amount of data compared to the sorts of robot data that we should have in the years to come. And so in particular, we'll be looking at whether we're can do a variety of dexterous long horizon tasks,

Starting point is 00:04:19 whether robots can succeed in places they've never been, whether robots can respond to open-ended prompts and interjections. And even if you're not excited about robotics, I think that the lessons that we've learned from trying to address these problems are applicable outside of the physical world. So can we develop robots that can have complete dexterous long horizon tasks?

Starting point is 00:04:39 And in particular, in this first part, I'd like to talk about how we trained a Pi-0 foundation model to do this task, which is to unload a dryer and fold laundry. And to date, I think this is the most impressive thing that I've seen a robot do in the physical world. It's really hard. This is an incredibly difficult problem.

Starting point is 00:05:05 You can see that it's not perfect. Here it's making some misgrap, making some mistakes. But it's really, really hard because you have to deal with the variability in the clothes and the way in which they might be positioned and crumpled and be able to handle all those sorts of things. And as you're doing this task, which takes about 10 minutes for the robot, There's many opportunities to fail, to fail catastrophically. For example, dropping things on the ground, which is hard to recover from.

Starting point is 00:05:30 And you have to be able to recover from even small mistakes. I was personally actually working quite a bit on this laundry folding robot, along with Michael and Saraj, of course, supported and with contributions from the whole physical intelligence team. So how do you even approach this sort of problem? This is a really, really hard thing for a robot to do. And what we did is we started simple. We started with, can a robot fold a single-size, single-brand shirt?

Starting point is 00:06:00 And can a robot dynamically flatten one shirt, again, single-brand, single-sized? And if you start simple, this makes the problem quite a bit easier. We collected some data with teleoperation and trained a policy with imitation learning. And our model had around 100 million parameters, mapping from images, from the robot's cameras, to target joint positions on the robot arms. And we do this source of control at 50 hertz on the robot. And we founded the company in mid-March of 2024. And a couple months later, after we had set everything up,

Starting point is 00:06:32 we were able to get a policy that could barely reliably fold a single-size, single-brand shirt. You can see that I'm testing the policy right here. And we also wanted to test some dynamic motions, because you need to be able to match the control frequency accurately in order to do these sorts of dynamic motions. And so these were some of our very initial tests, at addressing this sort of laundry folding problem.

Starting point is 00:06:53 Then from there, we wanted to make the problem incrementally harder. And so we, instead of starting from the shirt flat on the table, we started in a crumpled position like these. And it turns out that this actually makes it a lot harder. And so here are some videos of some of our initial attempts at trying to train the robot to fold these shirts. And the robot struggles. The robot does some things that kind of look somewhat sensitive,

Starting point is 00:07:20 somewhat sensible, but generally isn't able to make progress on the task. With many tests, we frequently were getting 0% success rate in our tests of this system and really struggling to make progress. So really here is the, it introduces this challenge of handling the sorts of variability in the ways in which shirts might be crumpled on the table. We had some initial signs of life in late June of last year. And so in this case, the robot was able to kind of make progress on flattening the shirt. also then able to fold the shirt decently well from that initial state.

Starting point is 00:07:54 Still not perfect. And as you can see, it takes quite a while to do this. So this is a video that was sped up 8X. So not something that you might have the patience for a robot to do. So with some initial signs of life that also very low success rate, we started to transition to a slightly harder version of the task where the laundry starts in a laundry basket. We also introduced variable-sized shirts and shorts into the mix. And again, the robot really struggled.

Starting point is 00:08:22 So in many of our tests, we're getting 0% success rate across the board. We're really struggling to actually get the robots to learn how to do these tasks. At this point, we were trying to consider a lot of different things. We thought that maybe the robot needs memory, needs history in some way. Maybe we need to just train our models for longer. Maybe we should be doing control and end effector space rather than in joint space of the robot. Maybe our encoders, we knew that there were calibration issues, and maybe we need that

Starting point is 00:08:48 calibration to be more consistent. Maybe we need to condition the model on more information about the data. Maybe we need hierarchy because this is a pretty long horizon task and it needs to break it down into different subtasks. Maybe we need higher resolution images. Maybe we need to introduce kind of interventions in data collection. A lot of these things we also tried. We had around two to three months of failure where nothing was really working at addressing this task. But then at some point we actually had a bit of a breakthrough, which was that we found one thing that really seemed to make a difference in the robot's ability to do the task. And this was actually to take some inspiration from the world of language modeling to actually

Starting point is 00:09:25 instead of just training a policy on all of our data, we pre-train on all the data and then fine tune on a curated, consistent, high-quality set of demonstration data. When we did this, we found that the robot was actually able to make progress and a lot more reliably fold articles of clothing. And so I think that this video was the first video where, the robot was able to fold five items in a row and stack them. I went home very excited this day. This was in September of 2024, so multiple months after our initial tests.

Starting point is 00:09:59 Now, this is far from perfect. It takes 20 minutes to fold five items of clothes. And at the same time, though, it kind of suggested that this sort of recipe was able to unlock the capability in the robot to actually fold these articles of clothing. So you can see these sorts of failures here. In this case, it attempted to fold the blue shirt around seven times before eventually actually figuring out how to do that.

Starting point is 00:10:24 There's also other failure modes as well. So here's an example where the robot pushes the stack to the corner of the table and decides to kind of fiddle with it a bit, and then eventually slides it off the table. And then it proceeds as if nothing happened, and it's going to continue to fold. We continue to iterate on this recipe. We selected and worked on our curation strategy

Starting point is 00:10:44 for curating a higher quality set of data. demonstration data. We got it from 20 minutes down to 12 minutes for these five items. This is kind of how we were evaluating how good our robot system was. It still makes mistakes. It's still, the full quality still varies, but it's still significantly better than our previous curation recipe. Now, at this point, we were still training models largely, kind of weird, pre-training and fine-tuning only on laundry data, and we weren't leveraging kind of pre-trained models in the community. And there were some folks working at physical intelligence that were working on developing a pre-trained model, trained on all of the robot data, and we then started

Starting point is 00:11:21 to try to introduce these models into our recipe. And so we took an open source vision language model, a 3 billion parameter model called PolyGema. Previously we were using, previous videos were all with like 100 to 300 million parameters that were iterating on. This model takes as input images from the robot, also a language command, and then has a a head, a diffusion head that's going to attend to all the internal values of the vision language model and with the joint angles predict a chunk of 50 actions into the future, so about one second of action steps.

Starting point is 00:11:57 And we're using a flow matching of variant of diffusion to actually output these actions and output continuous actions. So we took this pre-trained this model and instead of pre-training only on laundry, we pre-trained on all of the robot data that we had collected. And then we just fine-tuned it with the same exact post-training recipe that we had developed without using the vision language models. When we did this, we actually saw the robot continue to actually get better when we just plugged in that new pre-trained model.

Starting point is 00:12:27 And so in the left video, it's able to do five items in nine minutes, which was faster than the 12 minutes we had before. In the right videos, we were testing with some novel clothing items and found that it was also quite efficient at folding multiple items in a row. And we also saw as a result, there's also more consistent bold quality by using this model that was about 10 times larger and had seen more robot data as input. To look at a few highlights of this, here's a pair of shorts that the robot hasn't seen before. And this is kind of a tricky scenario where to flatten it. It actually kind of needs to reach under the kind of the bottom of the shorts.

Starting point is 00:13:02 And it's able to do that. It's able to kind of figure out that it should reach under the left part of the shorts in order to eventually flatten it. And then once it actually successfully flattens it, it's able to fold it successfully. It also has to do something similar at times to fold shirts. So in this case, it needs to actually kind of fold the shirt over on itself, which actually puts it in a more crumpled state, arguably, but allows it to find the corners of the shirt

Starting point is 00:13:27 and then go ahead and fold it. And then like I mentioned, it also is able to handle unseen clothing items. So here's an example of a shirt with a V-neck that is able to fold, even though the post-trained data set didn't have, well, this shirt was completely held out, and the post-trained data set didn't have any VNX as input in the data set. It's also able to fold shirts with buttons,

Starting point is 00:13:50 so it has some degree of generalization to different clothing items. And then lastly, because this policy is a neural network, and it's kind of taking as input the current image, it's able to handle interruptions. So here, Michael is continuing to mess with the robot, and the robot figures out that it should put the shirt away, while it's trying to fold the other shirt.

Starting point is 00:14:13 In this case, Michael's going to continue messing with the robot. So Michael unfolds one side. And the robot reacts. Michael goes in again. And the robot makes some mistakes here, but it's able to recover. Michael messes it up again. So those are some results of what the robot's

Starting point is 00:14:35 able to do. Now, I talked about this pre-training and post-training recipe being really important. We can actually quantitatively measure that and actually, make sure that this is actually what's leading to improvement. So we compared this pre-training and post-training recipe to not using any pre-training and only training

Starting point is 00:14:51 on the curated data set versus no post-training where you're training on all of the data rather than fine-tuning on the curated data set. And we evaluated these models in terms of their progress on the task, where you make partial progress for getting it out of the bin, which is the easiest part, and then further progress for flattening, folding, and stacking the items.

Starting point is 00:15:09 And we see that the pre-training and post-training recipe is able to get far higher performance than omitting pre-training and omitting post-training. And notably, omitting pre-training and post-training is basically able to get it out of the bin, and make very little progress after that. Whereas when we combine pre-training and curated post-training, we get far higher performance where

Starting point is 00:15:29 it's able to reliably flatten and fold objects. And then the last thing that I'll mention on this note is that nothing in this recipe is specific to laundry. And so we took the same recipe and fine-tuned on other tasks. So here, the task is to kind of clean up a table. And the robot was also able to successfully do this task, despite the fact that we primarily were iterating a lot on laundry,

Starting point is 00:15:55 but is able to also apply this recipe to this task. It also is able to scoop coffee beans into a coffee grinder. This task is pretty hard. It has to construct the bottom part of a cardboard box, which requires quite a bit of dexterity. And then lastly, autonomously lighting a candle with a match, again with this kind of same pre-training and post-training recipe. And so this is pointing at this kind of the benefit of foundation models that I alluded to before,

Starting point is 00:16:27 which is that to do these different tasks, you don't have to start completely from scratch. You can actually leverage pre-training across multiple robots and across multiple tasks. And then we're also able to apply that same recipe to robots at other companies. This is a robot that I've actually never seen in person before. They collected data. They sent the data to us. We fine-tuned our model on their data. We actually didn't even know exactly how the model is being controlled,

Starting point is 00:16:53 exactly the representation of their actions. But by fine-tuning the model on this new robot, the model is able to control the robot in order to make a cup of coffee in this case. So some takeaways for this part. We were able to independently develop post-training and pre-training and decouple the problem and then eventually get the best of both. We found that training on all the data doesn't work

Starting point is 00:17:17 for complex tasks. And this sort of pre-training and post-training on curated data leads to far better performance. And then we broke up as really hard problem of folding laundry by gradually starting with folding single shirts and going to more and more complex versions of the task. Now, there's a number of limitations here.

Starting point is 00:17:35 And one limitation I'd like to point out is that these robots inevitably, in this case, were trained in the environments that they were tested. And so this means that in principle, you can use these methods to collect a lot of data in one environment and then deploy them in one environment. But ultimately, there's going to be things that change about an environment and scenarios

Starting point is 00:17:54 where we would want to actually apply these robots to environments that they've never seen in before. And so how can robots actually succeed in places that they've never been? The lesson we've learned from machine learning in other places is that we should collect diverse data. And so we started by collecting data. of tidying bedrooms and kitchens in many different environments.

Starting point is 00:18:14 And here's an example, kind of a sample of that data. And we collected robot data in homes across San Francisco here, and also collected data in diverse mock kitchens and mock bedrooms. And in total, we had more than 100 unique rooms represented in the data set. That ended up being part of a bigger pre-training mixture. So we trained on this diverse mobile manipulation data, including the low-level action prediction, as well as predicting high-level subtask commands for how to complete the task.

Starting point is 00:18:44 But we also trained on previously collected static manipulation data that was also fairly diverse, static manipulation data that we had collected in our office and in labs, as well as web data and high-level instructional data. And I should point out here that the mobile manipulation data of tidying bedrooms and kitchens only accounted for 2.4% of the overall pre-training mix. And so the lesson here is that you're basically able to spin up a new task and actually an entirely new robot, the rest of the mixture didn't have any mobile manipulation data

Starting point is 00:19:15 with this particular mobile manipulator in it without redoing all of the data collection. We're able to build upon everything that had been done before. And this kind of this kind of same story of foundation models being able to make it easier to spin up a new problem, a new application without starting from scratch. Now, this wasn't completely easy. We had a couple challenges.

Starting point is 00:19:37 One of the challenges that we ran into is that naively, these model can ignore language instructions. So we actually, in this case, asked it to pick up the cutting board, and it chose to pick up the plate instead. Now we're again asking it to pick up the cutting board. And instead, the robot had a mind of its own, decided to pick up the plate, and then we tell it to put the plate in the sink.

Starting point is 00:19:57 And eventually it decides that, well, after kind of moving away from the cutting board, it eventually decided that it would actually pick up the cutting board. And so in the early development of our model, we found that it often ignored language. And to solve this, we thought about how vision language models actually follow language well, and so maybe there's a way to preserve the inherent abilities of the pre-trained models when addressing this task. And so what we did is with this PISA architecture, this action head that's using diffusion is randomly initialized, and this ends up actually deteriorating the pre-trained knowledge that's present in the vision language model.

Starting point is 00:20:36 And we found that if we can prevent this deterioration, we might be able to get better language following. And so the recipe that we came up with was actually, in some ways, fairly similar, but instead we're going to be predicting tokenized actions. And then when we have the diffusion head, we'll be stopping the gradient from the randomly initialized diffusion head to prevent it from deteriorating the language following abilities of the VLM backbone. And we found that this first led to faster training because the tokenized actions are a more direct

Starting point is 00:21:05 supervision signal. And second, it also followed language far better, an 80% follow rate rather than a 20% follow rate, which suggests that we're able to preserve the kind of pre-training in the vision language model backbone. So we put those pieces together. We took that recipe and trained it, pre-trained it on all of our data, including the mobile manipulation data. We fine-tuned it on mobile manipulation data in a variety of environments. And then we tested the model in places it's never been in before. So we rented three Airbnbs that we had never been to before.

Starting point is 00:21:37 We put the robot in those homes, in this case in the kitchen, and I asked it to close the cabinet. I asked it to put away the dishes. It's also never seen these dishes or these forks, these objects. And the robot's able to succeed, even though it's never been the here before. There's different countertops, different furniture, different objects, and so forth. Lastly, I asked it to clean up the spill, and the robot is able to a block. and wipe down the spill and eventually put the sponge into the sink.

Starting point is 00:22:16 It's also able to do this for bedroom. So Laura asked it, in this case, just clean the bedroom. And it puts articles of clothing in. It throws away the trash and then is able to tidy the bed by putting the pillow at the top of the bed and tidying the blanket or the comforter of the bed. YSY's next batch is now taking applications. Got a startup in you? Apply at Ycombinator.com slash apply. It's never too early, and filling out the app will level up your idea.

Starting point is 00:22:53 Okay, back to the video. So quantitatively, I talked about how there's only 2.7% or something of the mixture. So how much does that other data actually help? Could we actually just train on that kind of 2.7%? And we find that these kind of bars on the right, which are excluding data from static robots in labs and environments and so forth, reduces performance significantly. So the performance goes down to less than 60%

Starting point is 00:23:17 when you exclude that data when evaluated in novel homes compared to if you use the full pre-training mixture, it has more than 20% higher performance. Lastly, we also looked at, is the diversity of data helpful? Is it important? And so we increase the amount of data from these environments to test this. It's always good to like, you can kind of do vibe evals,

Starting point is 00:23:38 but it's really helpful to actually measure how well these things work. And so this is what this is measuring. And we find that if we actually increase the amount of homes, the amount of locations that are represented in the data, the performance increases, which is great. And it actually gets the same level of performance as if we train on data from that target environment.

Starting point is 00:24:00 And so it means we're actually mostly closing the generalization gap and suggest that the bottlenecks at this point for this sort of task lie not in collecting more diverse data, but in actually getting higher reliability and higher performance. Now, I should also mention that there's failure modes, like this success rate was around 80%. There's lots of room for improvement.

Starting point is 00:24:20 Here are a couple examples of those failure modes. So here it's told to put the items in the drawer. It is able to put it in the drawer, but the item isn't fully in the drawer at the end, and it decides that it's done and kind of moves on to the next thing. Here, the robot needs to put the clothes in the laundry basket. It drives over the shirt, and then it gets stuck, and is not able to lift it up.

Starting point is 00:24:41 Here we asked it to put the dishes in the sink, and it successfully is able to put a number of the dishes in the sink, but it struggles to pick up the cutting board in this particular case, because it's very thin, and it's flush against the surface of the countertop. And in the last case, probably my favorite case, it's told to put the spatula into a drawer, and it decides that the oven looks a lot like a drawer. And so it opens the oven and tries to put it in there.

Starting point is 00:25:10 And beyond this, there's a lot. also challenges with regard to speed, partial observability, long-term planning. And so lots of work to do still. So the takeaway here is that with diverse data, robots can follow a variety of instructions in environments that the robot has never been in before, which is a big step up from a lot of robotic scenarios where they're trained in the scenarios that they are being tested. Now the last kind of bit I'd like to talk about is this model has a fairly limited instruction set. I can only follow kind of a certain set of commands. And if we think about how other forms of

Starting point is 00:25:47 AI technology have been deployed, people really like to customize and actually tell the robot what they want, or tell the system what they want from these kinds of models. And so just like we prompt language models, can we allow robots to respond to open-ended prompts and open-ended interjections? So to do this and actually to do the password, we're actually leveraging hierarchical vision language action models. So we're going to have a high-level policy. break down the prompt into intermediate verbal responses and intermediate atomic language commands. So the high level prompt might be,

Starting point is 00:26:22 can you make me a sandwich? And this high level policy will break it down into the subtask of pick up one slice of bread. This will be passed to a low level model that actually executes and predicts target joint angles to fulfill the low level command of picking up one slice of bread. Now, on its own, this is a little-level model. This isn't going to be able to follow all sorts of prompts.

Starting point is 00:26:44 And it's actually fairly tricky to handle open-ended language because it's going to be challenging to collect a large number of human robot interactions with the real robot in the loop. And this is also going to be fairly hard to scale. And so what we did is we kind of took all of our existing robot data and we can actually generate synthetic data for the existing robot data. And particularly, we can use language models to relabel and generate hypothetical human prompts for the scenarios that the robots are in. And so what this looks like is we'll take data that says,

Starting point is 00:27:17 here's a kind of a video, and then the next skill is to pick up a KitKat, because that's what the robot does next in terms of just like basic low-level annotation. And then for this scenario where the robot is about to pick up the KitKat, we can ask a vision language model, what is a hypothetical prompt that a human might have asked that led to this particular scenario in the robot to actually choose to pick up a KitKat. And then we can train our high-level policy. on these synthetic prompts to basically augment the robot data

Starting point is 00:27:44 with various human interactions that might have led to those different situations. And as a result of this, we're able to actually allow robots to follow a variety of different prompts. So on the left, we ask, hi robot, can you make me a ham and cheese sandwich? The robot says, sure, I'll start with the bread and add ham and cheese next. And it's able to break down this task into the various subtasks of picking up a slice of bread, putting on the cutting board, picking up a slice of cheese, putting it on the bread, picking up some ham, and so on and so forth. I can also follow more complicated prompts like,

Starting point is 00:28:16 Hi, Robot, can you make me a vegan sandwich? I don't like pickles, though. And in this case, it's able to break it down and decide that it's going to add lettuce and tomatoes to the sandwich and not add pickles, not add cheese, not add meat as well. In addition to prompts, we're also able to train the robot to handle different interjections. Actually, here's an case where I have a different kind of prompts,

Starting point is 00:28:39 So on the left, we train the robot to clean tables, so put trash away and put dishes into the bin. And on the right, we asked the robot clean up only the trash, but not the dishes. And the robot is able to understand what that means and connect that to its low-level actions and only put away the trash and complete when the trash is all put away. And then lastly, it's able to handle interjections

Starting point is 00:29:00 and situated corrections. So in this case, the robot is kind of getting items for a user. The user interjects and said, get me something sweet that's not in the basket right after it had put a Kit Kat into the basket. And the robot says, sure, let me get you some Skittles and reasons through kind of basic reasoning of what to fulfill the user's request. And is able to respond to those kinds of corrections

Starting point is 00:29:24 situated in the world that the robot isn't. Now, you might also wonder, like, maybe some existing foundation models could serve as a high-level planner for robots and do this sort of high-level reasoning without actually training a separate model. And so we also evaluated that. And we found that in blue, the performance following instructions and making progress on the task was

Starting point is 00:29:44 substantially lower than the performance of our system, which is shown in green. And in general, we found that these frontier models generally struggle with visual understanding as it pertains to robotics, which makes sense because in general, these models aren't kind of really targeting many physical applications and have very little data in the physical world. Okay, so to start to wrap up, and then we'll have some time for questions, I talked about We talked a bit about how robots can do a variety of dexterous long-horizon tasks with pre-training and post-training, how robots can succeed in places that they've never been, and how they can respond to open-ended prompts and interjections by leveraging synthetic data from language models on top of the robot data that we had collected. Now, with some closing notes, we've seen a few different scenarios in this talk where general-purpose robots might be more successful than specialist robots.

Starting point is 00:30:35 But because we can essentially, rather than start from scratch for every single application, actually build upon a much broader foundation for physical intelligence in the real world. We also saw that large-scale data in the real world is really helpful for developing these things. And we found that, and I think that is necessary, but not sufficient for physical intelligence.

Starting point is 00:30:56 And there's a lot of challenges. We need more research to be done ourselves and through open source contributions before robots I think will be truly ready. to tackle the open world. I'd also like to mention that at Physical Intelligence, we're hiring a number of roles. If you're excited about some of the things that we talked about,

Starting point is 00:31:13 you can see a list of the open roles on the PI. website as well. Awesome. Happy to take some questions. Let's start on the left. Hi, Chelsea. So first, I want to say thank you for all your work on robot learning. They're all really impressive. And so mainly I have two questions on,

Starting point is 00:31:40 especially regarding the post-training party mentioned. So the first thing is you mentioned that in post-training, the most important part is to have high-quality action data. So I'm wondering what the components of that would be. And then the second question is, what do you think RL will play into the part of post-training? Yeah, absolutely. So I think that the different components of it, a lot of it comes down to consistency of the data and the strategy being followed. and whether the data completes the task efficiently and with a reliable strategy. And then on the second question, I think that reinforcement learning can play a very large role

Starting point is 00:32:20 in post-training. I think that online data from the robots, which reinforcement learning allows you to use, can allow robots to have a much higher success rate and also be faster than if they're just trained with imitation learning. Yeah. Thank you. Hi. Thank you so much for your talk.

Starting point is 00:32:39 So your work is really fascinating and there is no doubt that it will have a lot of impact in the future. But can I ask you at this stage, how can you find the fundings? Because honestly, I can't imagine how hard it can be to convince people to invest in the robots that falls close

Starting point is 00:32:59 and deal with the dishes. Yeah, so it's a good question. I think that, well, I guess first I'll mention that we aren't just focused on applications in the home. We really want to solve this broader problem of physical intelligence. And we've been starting with those applications because they're ones that are kind of easy to make progress on. But we've also been doing tasks like inserting an Ethernet cable, which I put in the talk,

Starting point is 00:33:23 as well as constructing a cardboard box. And generally, I think that this sort of problem has a ton of potential for like making impact in all sorts of realms, not just in domestic tasks, but all sorts of realms as well. And even in domestic tasks, I think there's a huge. huge market for this kind of technology. We ourselves haven't had a lot of challenge with fundraising, and I think that a lot of robotics companies recently have also done a great job and found that there's actually a lot of excitement around this sort of technology because I think things are actually starting to work. I started working on this technology more than 10 years ago at this

Starting point is 00:33:57 point, and things really weren't working then. And so, yeah, I think that there's a lot of excitement that is starting to mature and actually be ready for the real world. I think that there's a lot more work to do. But generally, it seems like there's a lot of people excited about this technology and eager to actually put funds behind it. Okay, thank you so much. Yeah. Hi.

Starting point is 00:34:17 Thank you so much. I have two questions, like one more broad and one more technical. So the technical one like is VLAs, in my opinion, like at least to my understanding, are a framework that is a bit separate from world modeling. And I wonder how the two of them will interplay among each other and whether like you have actually plans to somehow like use them together. As I see right now like VLAs as more of policies that could actually benefit a lot from world modeling.

Starting point is 00:34:52 And from a broader perspective, I wonder like which kind of infrastructure layers could be the most, useful to work home, such as like explainability, traceability, or safety in general to deploy such models like in the real world? Yeah, great questions. So on the first point, there's actually fairly natural ways to incorporate world model objectives

Starting point is 00:35:18 into vision language action models. And we've done some work where instead of only predicting the next action, you predict some intermediate sub-goal image, like what should happen in the future in order to accomplish the task and then predict an action from there. And we've seen some kind of signs of light that seems to be quite promising. So I think there's ways to merge the two paradigms.

Starting point is 00:35:39 At the same time, I think there's a lot of challenges that come up with world modeling with regard to the ways in which, basically the data that you put into it, not necessarily being kind of reflective of the ways in which you're going to use it. You might train it on demonstration data of successful data of completing the task and then evaluate it on to try to actually use it to evaluate actions that are not optimally completing the task. And then the world model will hallucinate a video of completing the task successfully, even if the actions that you provided as input didn't,

Starting point is 00:36:08 weren't actually going to successfully lead to a good outcome. So there's challenges there to overcome. And so it's not like, yeah, there's various challenges, but there's also ways to integrate it into the VLA paradigm. And then, can you remind me your second question? What are the infrastructure layers? Like, you watch the chest to work on in the short to bring like the most improvements, let's say.

Starting point is 00:36:34 But actually run these models on robots, we have like a real-time system that needs to actually be hitting a certain frequency to actually like execute actions successfully. And if you have lag in that system and so forth, it introduces all sorts of challenges. And so thinking about fast inference and infrastructure for like that's actually going to be on the robot is a big part of what our software team does. And then also thinking about like large-scale machine learning infrastructure,

Starting point is 00:36:59 training large models, adjusting large amounts of data. The data that we have is different from a lot of typical data sets because it's very multimodal in nature. It's kind of videos, actions, language segments, and various other components as well. So yeah, some interesting infrastructure problems, I think, both on the robot side and on the model training side. Thank you so much.

Starting point is 00:37:24 Hi, I'm Frederick, and I have got a question about model sizes. in general. So I think what we're seeing right now is that in general, larger model sizes lead to better accuracy, for example, also in your experiments. Or it's also what Open AI and Anthroping and others are doing right now with their LLMs.

Starting point is 00:37:42 However, there's also the approach of using a quite small model and then outsourcing the world knowledge into a database of some sort with which the model can interact. What is your take on that? Do you think that's like a valid approach, or do you think encapsulating all the world knowledge inside of the world knowledge inside the model is better or works better?

Starting point is 00:38:01 Yeah, it's an interesting question. So in my experience, working on like retrieval-based systems, is that it actually is a little bit tricky to, well, first figure out what should be offloaded versus actually done by the model. And second, sometimes the model will ignore the retrieved content and try to generate something itself. And it actually seems to be very quite tricky

Starting point is 00:38:22 to get that technically to work exactly the way you want it. I think it's probably going to depend on the application in the use case in terms of how best to, like, whether that might make sense. But in my experience, it ends up being quite tricky to figure out what the division of labor is. And even the model part of it will need to have some degree of intelligence in order to, like, actually make use of the retrieved information and so forth. So I think it's an really fascinating research problem, but it also needs, like, a lot of research

Starting point is 00:38:52 to make that work successfully. Thank you. Yeah. Hi, Chelsea. My name is Charu Thomas. First off, really appreciate the talk. It was really fascinating and have been a big fan of your work since meta learning. When you think about how software and hardware are going to continue to evolve, what are the biggest opportunities for builders today for your vision of physical intelligence? I mean, I think that, yeah, there's lots of different, like, opportunities to make things work a lot better and a lot of, like, open questions. I think kind of like I was mentioning before, thinking about better ways of having infrastructure on, like, kind of the robot side. I think that there isn't a lot of, like, there's some open source code for that sort of thing, but there's a lot of opportunities to make robot infrastructure better. And not a lot of people, I think, are working on that aspect of the problem.

Starting point is 00:39:48 also lots of opportunities. I guess one of the things I love about AI and computer science as a whole is there's a really big open source community and I think that there's a ton of opportunity to actually do open source work and contribute to a broader community that's trying to like collect

Starting point is 00:40:04 data, open source models, fix bugs on those models, fine tune those models, figure out new recipes for fine tuning those models. So yeah, all sorts of questions also like on the research side, especially in the open source realm. Yeah. Thank you. Hi, Chelsea. I also, just like everyone else, am a big fan of all your work, so thank you for putting that all out.

Starting point is 00:40:24 I've been reading through a lot of your group's work recently and particularly enjoyed reading Saraj's PhD thesis. It taught me a lot about scaling real-world robotics with data. And a question I have is, how do you think synthetic data will sort of scale for robotics in the future? As we've seen with LMs, we've moved away from sort of not moved away from pre-training, but moved away from human collected data into more creating synthetic data and a lot of filtering and a lot of self-grading. So how do you think using generative synthetic data for creating environments or reward models will impact robotics?

Starting point is 00:41:02 Yeah, I have many thoughts on this topic. I think that at the end of the day, there's going to be no replacement for real data. And so large amounts of real robot data is going to be a necessary component of any system that's going to work in a generalizable, So we're going to need that. At the same time, I do think that there's tools for like simulation and synthetic data, especially to potentially play on the evaluation side. It's very tricky to actually, as you, for example, are generalizing too many environments. It's very tricky to actually evaluate how well that model generalizes not just in one new environment, but in 10

Starting point is 00:41:34 new environments. Because then you actually need to bring the robot to those 10 environments or construct 10 environments, whereas in simulation, that gets a lot easier. So I think I'm really excited about kind of simulation and synthetic data for that use case. I should also mention that I think that the analog of synthetic data in language models is actually not necessarily simulation in robotics, but closer to something like reinforcement learning. I think that a lot of synthetic data is generated by the model that's actually trying to do the task and then trying to kind of reason through different ways of doing the task. And I think that the analogy there is a robot that's trying to attempt the task and learn from its own attempts and get better from its own attempts.

Starting point is 00:42:10 And that sort of online data from the model, I think, will also play a really critical role in post-training and something that we're working on quite a bit. And so, yeah, that I think is like really important and really helpful. Thank you. Cool. I think we have time for one more question. Sorry, we won't be able to get to everyone. Hi, it's super cool to see you as an MIT Eeks alumni now working in a really cool robotic stuff

Starting point is 00:42:32 in talking to us about robotics and entrepreneurship. But I've been wondering how robotics research that involves hardware components plays out differently in academia versus industry. And are there typically more resources, fewer constraints, or broader applications in one setting over the other? And what kind of people or goals do you think might be better suited for each path? Yeah, it's an interesting question.

Starting point is 00:42:54 I still love both kind of startup and academic environments and industry environments. I think they all have various pros and cons. Certainly, I think that any, I think that generally academic environments aren't quite as well resourced in terms of data collection, throughput, e-val throughput, and compute as like startups and industry labs.

Starting point is 00:43:14 But at the same time, I think that there's a lot of problems that you can solve without large amounts of resources that we need to figure out on the algorithm side. So I think that there's a lot of really interesting work to be done there. And then in industry and in startups, I think, actually trying to do some of the research on these big models, scaling up data, seeing what things happen at large scales

Starting point is 00:43:37 is really great to do there. Yeah, I think that there's a place for both. I also think that the gap isn't as large as often people make it seem. And oftentimes people in industry environments kind of wish they had more compute. Like you kind of always wish that you had more resources. And sometimes when you have a lot of resources, you don't actually think as carefully and as critically about what runs you're going to be doing and so forth.

Starting point is 00:44:00 And you end up being sometimes more wasteful of compute than if you were kind of more compute constrained. So there's also actually downsides to having more resources in my experience. I'm really sorry. Can I just ask you one quick question on architecture? I know that the scaling laws have worked well for transformer-based architectures. And I was thinking, do you see currently limits in VLM-based architecture, which are kind of made for like text tokens because they don't have like modules for physical awareness?

Starting point is 00:44:33 Yeah, and how do you deal with that? Yeah, so we tokenize the actions. And so I'd encourage you to take a look at the fast tokenizer paper that we put out as kind of a way to accomplish that. And yeah, we should wrap up there. Thanks, everyone. And yeah, I hope you enjoy the event.

Y Combinator Startup Podcast - Chelsea Finn: Building Robots That Can Do Anything

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.