Y Combinator Startup Podcast - Chelsea Finn: Building Robots That Can Do Anything
Episode Date: July 22, 2025Chelsea Finn on June 17th, 2025 at AI Startup School in San Francisco.From MIT through her PhD at Berkeley, where she pioneered meta‑learning methods, and Google Brain, Chelsea Finn has built her ca...reer around teaching machines how to learn. Now an Assistant Professor at Stanford and co‑founder of Physical Intelligence, she’s using that foundation to bring learning-driven robotics into messy, real-world environments rather than confined lab setups.In this talk, Chelsea traces the evolution of her team’s work—from early experiments on robotic grasping and vision to today’s ambitious efforts at folding laundry, tidying kitchens, and generalizing across tasks—all without hand-crafted code. Instead, they used scalable foundation models and massive datasets, teaching robots physical common sense as they learn by doing. She shares stories of the rocky setbacks, the surprises hidden in data, and the moment it all clicked: robots equipped with generalizable physical intelligence can indeed adapt and assist in the unpredictable world around us.
Transcript
Discussion (0)
Hi everyone. I'm really excited to talk about developing general purpose robots and how we might
actually like truly develop and bring intelligence into the physical world. So to start off, I'd like to talk about this problem
which is that if you want to truly solve a robotics application, you essentially need to build an entire company around that application.
You need to build a different company for logistics for wet lab automation, for robots and kitchens, for
surgical robots and so on. And this is really, really hard to do because that company needs to
make new hardware, develop custom software, design unique movement primitives for that application,
handle edge cases, and so on. You have to do all of that from scratch if you want to solve
a robotics problem. And as a result, a lot of robotics companies haven't been very successful
in actually bringing robots into the physical world successfully in our daily lives. I co-founded
a company called Physical Intelligence that's trying to solve this problem.
And in particular, we're trying to develop a general purpose model that can enable any robot to do any task in any environment.
And we think that this sort of generalist model may work better and be easier to use than purpose-built models,
just like we've seen in the development of foundation models for language and other applications.
For example, if you want to build a coding assistant, you don't nowadays develop something specifically for coding, but you develop and you build on models that were trained on large amounts of data, not just on code.
And essentially, this is the problem of trying to develop these sorts of foundation models and bring this sort of intelligence into the physical world rather than the digital world where they largely are today.
So how do we do this?
In this talk, I'd like to talk about how we go about doing this.
And if we were to take a lesson from language models,
we know that language models have taught us the importance of scale.
And so one possible conclusion would be that perhaps scale
is the most important ingredient for developing these models.
And if you were to say this conclusion is true,
then you might look to certain data sources for large-scale data.
So for example, we might look at data from industrial automation.
And you get tons and tons of data of robots doing tasks over and over again like this,
but the sort of data isn't going to allow robots to go into disaster zones or to make a sandwich or to bag groceries.
And so this massive scale doesn't have the diversity of behaviors that we need in order to solve this general problem.
Alternatively, maybe we look at data from YouTube, which has also a massive data source and many videos of humans doing tasks,
that can be useful for training robots.
But at the same time, we don't learn how to write by watching other people write,
and we don't become expert tennis players by watching Wimbledon.
And so even though there's a massive scale of data here, it's very challenging to use,
and there's also a gap between the embodiment of robots and humans.
And lastly, we might look at data from simulation.
You can also get a massive scale of data here,
but this data lacks realism and also has a gap from reality.
And so I think the lesson here is that scale is necessary,
for developing these models that can generalize in open world conditions,
but they're subordinate to actually solving the problem.
So you need scale, but it's not sufficient for the entire problem.
And so at physical intelligence, we've been,
this is an example of a data episode that we've collected.
This is in order of our first anniversary, which was a few months ago,
where we, here you can see a teleoperator in person
who's operating some leader arms to control the robot to light,
to light a match and light a candle with the match.
And with this sort of data, we can train robots
to do a variety of different tasks.
And so what I'd like to talk about is some of our recent results
at trying to develop sort of physical intelligence
with large-scale real robot data.
I should mention this is large-scale by today's robot standards,
and arguably a minuscule amount of data compared to the sorts of robot data
that we should have in the years to come.
And so in particular, we'll be looking at whether we're
can do a variety of dexterous long horizon tasks,
whether robots can succeed in places they've never been,
whether robots can respond to open-ended prompts and interjections.
And even if you're not excited about robotics,
I think that the lessons that we've learned
from trying to address these problems
are applicable outside of the physical world.
So can we develop robots that can have complete dexterous
long horizon tasks?
And in particular, in this first part,
I'd like to talk about how we trained a Pi-0 foundation model
to do this task,
which is to unload a dryer and fold laundry.
And to date, I think this is the most impressive thing
that I've seen a robot do in the physical world.
It's really hard.
This is an incredibly difficult problem.
You can see that it's not perfect.
Here it's making some misgrap, making some mistakes.
But it's really, really hard because you have to deal with
the variability in the clothes and the way in which they might be positioned
and crumpled and be able to handle all those sorts of things.
And as you're doing this task, which takes about 10 minutes for the robot,
There's many opportunities to fail, to fail catastrophically.
For example, dropping things on the ground, which is hard to recover from.
And you have to be able to recover from even small mistakes.
I was personally actually working quite a bit on this laundry folding robot,
along with Michael and Saraj, of course, supported and with contributions
from the whole physical intelligence team.
So how do you even approach this sort of problem?
This is a really, really hard thing for a robot to do.
And what we did is we started simple.
We started with, can a robot fold a single-size, single-brand shirt?
And can a robot dynamically flatten one shirt, again, single-brand, single-sized?
And if you start simple, this makes the problem quite a bit easier.
We collected some data with teleoperation and trained a policy with imitation learning.
And our model had around 100 million parameters, mapping from images, from the robot's cameras,
to target joint positions on the robot arms.
And we do this source of control at 50 hertz on the robot.
And we founded the company in mid-March of 2024.
And a couple months later, after we had set everything up,
we were able to get a policy that could barely reliably fold a single-size,
single-brand shirt.
You can see that I'm testing the policy right here.
And we also wanted to test some dynamic motions,
because you need to be able to match the control frequency accurately
in order to do these sorts of dynamic motions.
And so these were some of our very initial tests,
at addressing this sort of laundry folding problem.
Then from there, we wanted to make the problem incrementally harder.
And so we, instead of starting from the shirt flat on the table,
we started in a crumpled position like these.
And it turns out that this actually makes it a lot harder.
And so here are some videos of some of our initial attempts
at trying to train the robot to fold these shirts.
And the robot struggles.
The robot does some things that kind of look somewhat sensitive,
somewhat sensible, but generally isn't able to make progress on the task.
With many tests, we frequently were getting 0% success rate in our tests of this system and
really struggling to make progress.
So really here is the, it introduces this challenge of handling the sorts of variability in
the ways in which shirts might be crumpled on the table.
We had some initial signs of life in late June of last year.
And so in this case, the robot was able to kind of make progress on flattening the shirt.
also then able to fold the shirt decently well from that initial state.
Still not perfect.
And as you can see, it takes quite a while to do this.
So this is a video that was sped up 8X.
So not something that you might have the patience for a robot to do.
So with some initial signs of life that also very low success rate, we started to transition
to a slightly harder version of the task where the laundry starts in a laundry basket.
We also introduced variable-sized shirts and shorts into the mix.
And again, the robot really struggled.
So in many of our tests, we're getting 0% success rate across the board.
We're really struggling to actually get the robots
to learn how to do these tasks.
At this point, we were trying to consider a lot of different things.
We thought that maybe the robot needs memory, needs history in some way.
Maybe we need to just train our models for longer.
Maybe we should be doing control and end effector space rather than in joint space of the robot.
Maybe our encoders, we knew that there were calibration issues, and maybe we need that
calibration to be more consistent. Maybe we need to condition the model on more information about the data.
Maybe we need hierarchy because this is a pretty long horizon task and it needs to break it down
into different subtasks. Maybe we need higher resolution images. Maybe we need to introduce
kind of interventions in data collection. A lot of these things we also tried. We had around
two to three months of failure where nothing was really working at addressing this task. But then at
some point we actually had a bit of a breakthrough, which was that we found one thing that really
seemed to make a difference in the robot's ability to do the task.
And this was actually to take some inspiration from the world of language modeling to actually
instead of just training a policy on all of our data, we pre-train on all the data and then
fine tune on a curated, consistent, high-quality set of demonstration data.
When we did this, we found that the robot was actually able to make progress and a lot more reliably
fold articles of clothing.
And so I think that this video was the first video where,
the robot was able to fold five items in a row and stack them.
I went home very excited this day.
This was in September of 2024, so multiple months after our initial tests.
Now, this is far from perfect.
It takes 20 minutes to fold five items of clothes.
And at the same time, though, it kind of suggested that this sort of recipe was able
to unlock the capability in the robot to actually fold these articles of clothing.
So you can see these sorts of failures here.
In this case, it attempted to fold the blue shirt
around seven times before eventually actually figuring out
how to do that.
There's also other failure modes as well.
So here's an example where the robot pushes the stack
to the corner of the table and decides to kind of fiddle with it a bit,
and then eventually slides it off the table.
And then it proceeds as if nothing happened,
and it's going to continue to fold.
We continue to iterate on this recipe.
We selected and worked on our curation strategy
for curating a higher quality set of data.
demonstration data. We got it from 20 minutes down to 12 minutes for these five items.
This is kind of how we were evaluating how good our robot system was. It still makes mistakes.
It's still, the full quality still varies, but it's still significantly better than our previous
curation recipe. Now, at this point, we were still training models largely, kind of weird,
pre-training and fine-tuning only on laundry data, and we weren't leveraging kind of pre-trained
models in the community. And there were some folks working at physical intelligence that were working
on developing a pre-trained model, trained on all of the robot data, and we then started
to try to introduce these models into our recipe.
And so we took an open source vision language model, a 3 billion parameter model called PolyGema.
Previously we were using, previous videos were all with like 100 to 300 million parameters
that were iterating on.
This model takes as input images from the robot, also a language command, and then has a
a head, a diffusion head that's going to attend to all the internal values of the vision language
model and with the joint angles predict a chunk of 50 actions into the future, so about
one second of action steps.
And we're using a flow matching of variant of diffusion to actually output these actions
and output continuous actions.
So we took this pre-trained this model and instead of pre-training only on laundry, we pre-trained
on all of the robot data that we had collected.
And then we just fine-tuned it with the same exact post-training recipe that we had developed
without using the vision language models.
When we did this, we actually saw the robot continue to actually get better when we just plugged
in that new pre-trained model.
And so in the left video, it's able to do five items in nine minutes, which was faster than
the 12 minutes we had before.
In the right videos, we were testing with some novel clothing items and found that it was also
quite efficient at folding multiple items in a row.
And we also saw as a result, there's also more consistent bold quality by using this model that was about 10 times larger and had seen more robot data as input.
To look at a few highlights of this, here's a pair of shorts that the robot hasn't seen before.
And this is kind of a tricky scenario where to flatten it.
It actually kind of needs to reach under the kind of the bottom of the shorts.
And it's able to do that.
It's able to kind of figure out that it should reach under the left part of the shorts in order to eventually flatten it.
And then once it actually successfully flattens it,
it's able to fold it successfully.
It also has to do something similar at times to fold shirts.
So in this case, it needs to actually kind of fold the shirt over on itself,
which actually puts it in a more crumpled state, arguably,
but allows it to find the corners of the shirt
and then go ahead and fold it.
And then like I mentioned, it also is able to handle unseen clothing items.
So here's an example of a shirt with a V-neck that is able to fold,
even though the post-trained data set didn't have,
well, this shirt was completely held out,
and the post-trained data set didn't have any VNX
as input in the data set.
It's also able to fold shirts with buttons,
so it has some degree of generalization
to different clothing items.
And then lastly, because this policy is a neural network,
and it's kind of taking as input the current image,
it's able to handle interruptions.
So here, Michael is continuing to mess with the robot,
and the robot figures out that it should put the shirt away,
while it's trying to fold the other shirt.
In this case, Michael's going to continue messing with the robot.
So Michael unfolds one side.
And the robot reacts.
Michael goes in again.
And the robot makes some mistakes here,
but it's able to recover.
Michael messes it up again.
So those are some results of what the robot's
able to do.
Now, I talked about this pre-training and post-training
recipe being really important.
We can actually quantitatively measure that
and actually,
make sure that this is actually what's leading to improvement.
So we compared this pre-training and post-training recipe
to not using any pre-training and only training
on the curated data set versus no post-training
where you're training on all of the data
rather than fine-tuning on the curated data set.
And we evaluated these models in terms
of their progress on the task, where you make partial progress
for getting it out of the bin, which is the easiest part,
and then further progress for flattening,
folding, and stacking the items.
And we see that the pre-training and post-training recipe
is able to get far higher performance
than omitting pre-training and omitting post-training.
And notably, omitting pre-training and post-training
is basically able to get it out of the bin,
and make very little progress after that.
Whereas when we combine pre-training and curated post-training,
we get far higher performance where
it's able to reliably flatten and fold objects.
And then the last thing that I'll mention on this note
is that nothing in this recipe is specific to laundry.
And so we took the same recipe and fine-tuned
on other tasks.
So here, the task is to kind of clean up a table.
And the robot was also able to successfully do this task,
despite the fact that we primarily were iterating a lot on laundry,
but is able to also apply this recipe to this task.
It also is able to scoop coffee beans into a coffee grinder.
This task is pretty hard.
It has to construct the bottom part of a cardboard box,
which requires quite a bit of dexterity.
And then lastly, autonomously lighting a candle with a match,
again with this kind of same pre-training and post-training recipe.
And so this is pointing at this kind of the benefit of foundation models that I alluded to before,
which is that to do these different tasks, you don't have to start completely from scratch.
You can actually leverage pre-training across multiple robots and across multiple tasks.
And then we're also able to apply that same recipe to robots at other companies.
This is a robot that I've actually never seen in person before.
They collected data.
They sent the data to us.
We fine-tuned our model on their data.
We actually didn't even know exactly how the model is being controlled,
exactly the representation of their actions.
But by fine-tuning the model on this new robot,
the model is able to control the robot in order to make a cup of coffee in this case.
So some takeaways for this part.
We were able to independently develop post-training and pre-training
and decouple the problem and then eventually
get the best of both.
We found that training on all the data doesn't work
for complex tasks.
And this sort of pre-training and post-training
on curated data leads to far better performance.
And then we broke up as really hard problem
of folding laundry by gradually starting
with folding single shirts and going
to more and more complex versions of the task.
Now, there's a number of limitations here.
And one limitation I'd like to point out
is that these robots inevitably, in this case,
were trained in the environments that they were tested.
And so this means that in principle,
you can use these methods to collect a lot of data in one environment
and then deploy them in one environment.
But ultimately, there's going to be things that
change about an environment and scenarios
where we would want to actually apply these robots
to environments that they've never seen in before.
And so how can robots actually succeed in places
that they've never been?
The lesson we've learned from machine learning
in other places is that we should collect diverse data.
And so we started by collecting data.
of tidying bedrooms and kitchens in many different environments.
And here's an example, kind of a sample of that data.
And we collected robot data in homes across San Francisco here,
and also collected data in diverse mock kitchens and mock bedrooms.
And in total, we had more than 100 unique rooms represented in the data set.
That ended up being part of a bigger pre-training mixture.
So we trained on this diverse mobile manipulation data,
including the low-level action prediction,
as well as predicting high-level subtask commands for how to complete the task.
But we also trained on previously collected static manipulation data that was also fairly diverse,
static manipulation data that we had collected in our office and in labs,
as well as web data and high-level instructional data.
And I should point out here that the mobile manipulation data of tidying bedrooms and kitchens
only accounted for 2.4% of the overall pre-training mix.
And so the lesson here is that you're basically able to spin up
a new task and actually an entirely new robot,
the rest of the mixture didn't have any mobile manipulation data
with this particular mobile manipulator in it
without redoing all of the data collection.
We're able to build upon everything that had been done before.
And this kind of this kind of same story of foundation models
being able to make it easier to spin up a new problem,
a new application without starting from scratch.
Now, this wasn't completely easy.
We had a couple challenges.
One of the challenges that we ran into is that naively,
these model can ignore language instructions.
So we actually, in this case, asked it to pick up the cutting board,
and it chose to pick up the plate instead.
Now we're again asking it to pick up the cutting board.
And instead, the robot had a mind of its own,
decided to pick up the plate,
and then we tell it to put the plate in the sink.
And eventually it decides that, well, after kind of moving away from the cutting board,
it eventually decided that it would actually pick up the cutting board.
And so in the early development of our model,
we found that it often ignored language.
And to solve this, we thought about how vision language models actually follow language well,
and so maybe there's a way to preserve the inherent abilities of the pre-trained models when addressing this task.
And so what we did is with this PISA architecture, this action head that's using diffusion is randomly initialized,
and this ends up actually deteriorating the pre-trained knowledge that's present in the vision language model.
And we found that if we can prevent this deterioration, we might
be able to get better language following.
And so the recipe that we came up with was actually, in some ways, fairly similar, but instead
we're going to be predicting tokenized actions.
And then when we have the diffusion head, we'll be stopping the gradient from the randomly
initialized diffusion head to prevent it from deteriorating the language following abilities
of the VLM backbone.
And we found that this first led to faster training because the tokenized actions are a more direct
supervision signal.
And second, it also followed language far better, an 80% follow rate rather than a 20% follow rate,
which suggests that we're able to preserve the kind of pre-training in the vision language model backbone.
So we put those pieces together.
We took that recipe and trained it, pre-trained it on all of our data, including the mobile manipulation data.
We fine-tuned it on mobile manipulation data in a variety of environments.
And then we tested the model in places it's never been in before.
So we rented three Airbnbs that we had never been to before.
We put the robot in those homes, in this case in the kitchen,
and I asked it to close the cabinet.
I asked it to put away the dishes.
It's also never seen these dishes or these forks, these objects.
And the robot's able to succeed, even though it's never been the here before.
There's different countertops, different furniture, different objects, and so forth.
Lastly, I asked it to clean up the spill, and the robot is able to a block.
and wipe down the spill and eventually put the sponge into the sink.
It's also able to do this for bedroom.
So Laura asked it, in this case, just clean the bedroom.
And it puts articles of clothing in.
It throws away the trash and then is able to tidy the bed by putting the pillow at the top of the bed
and tidying the blanket or the comforter of the bed.
YSY's next batch is now taking applications.
Got a startup in you? Apply at Ycombinator.com slash apply.
It's never too early, and filling out the app will level up your idea.
Okay, back to the video.
So quantitatively, I talked about how there's only 2.7% or something of the mixture.
So how much does that other data actually help?
Could we actually just train on that kind of 2.7%?
And we find that these kind of bars on the right,
which are excluding data from static robots in labs and environments and so forth,
reduces performance significantly.
So the performance goes down to less than 60%
when you exclude that data when evaluated in novel homes
compared to if you use the full pre-training mixture,
it has more than 20% higher performance.
Lastly, we also looked at, is the diversity of data helpful?
Is it important?
And so we increase the amount of data
from these environments to test this.
It's always good to like, you can kind of do vibe evals,
but it's really helpful to actually measure
how well these things work.
And so this is what this is measuring.
And we find that if we actually increase the amount of homes,
the amount of locations that are represented in the data,
the performance increases, which is great.
And it actually gets the same level of performance
as if we train on data from that target environment.
And so it means we're actually mostly closing
the generalization gap and suggest that the bottlenecks at this point
for this sort of task lie not in collecting more diverse data,
but in actually getting higher reliability
and higher performance.
Now, I should also mention that there's failure modes,
like this success rate was around 80%.
There's lots of room for improvement.
Here are a couple examples of those failure modes.
So here it's told to put the items in the drawer.
It is able to put it in the drawer,
but the item isn't fully in the drawer at the end,
and it decides that it's done and kind of moves on to the next thing.
Here, the robot needs to put the clothes in the laundry basket.
It drives over the shirt, and then it gets stuck,
and is not able to lift it up.
Here we asked it to put the dishes in the sink,
and it successfully is able to put a number of the dishes in the sink,
but it struggles to pick up the cutting board in this particular case,
because it's very thin, and it's flush against the surface of the countertop.
And in the last case, probably my favorite case,
it's told to put the spatula into a drawer,
and it decides that the oven looks a lot like a drawer.
And so it opens the oven and tries to put it in there.
And beyond this, there's a lot.
also challenges with regard to speed, partial observability, long-term planning.
And so lots of work to do still.
So the takeaway here is that with diverse data, robots can follow a variety of instructions
in environments that the robot has never been in before, which is a big step up from a lot
of robotic scenarios where they're trained in the scenarios that they are being tested.
Now the last kind of bit I'd like to talk about is this model has a fairly limited instruction
set. I can only follow kind of a certain set of commands. And if we think about how other forms of
AI technology have been deployed, people really like to customize and actually tell the robot what they want,
or tell the system what they want from these kinds of models. And so just like we prompt language
models, can we allow robots to respond to open-ended prompts and open-ended interjections?
So to do this and actually to do the password, we're actually leveraging hierarchical
vision language action models. So we're going to have a high-level policy.
break down the prompt into intermediate verbal responses
and intermediate atomic language commands.
So the high level prompt might be,
can you make me a sandwich?
And this high level policy will break it down
into the subtask of pick up one slice of bread.
This will be passed to a low level model
that actually executes and predicts target joint angles
to fulfill the low level command of picking up one slice of bread.
Now, on its own, this is a little-level model.
This isn't going to be able to follow all sorts of prompts.
And it's actually fairly tricky to handle open-ended language because it's going to be challenging
to collect a large number of human robot interactions with the real robot in the loop.
And this is also going to be fairly hard to scale.
And so what we did is we kind of took all of our existing robot data and we can actually
generate synthetic data for the existing robot data.
And particularly, we can use language models to relabel and generate hypothetical
human prompts for the scenarios that the robots are in.
And so what this looks like is we'll take data that says,
here's a kind of a video, and then the next skill is to pick up a KitKat,
because that's what the robot does next in terms of just like basic low-level
annotation.
And then for this scenario where the robot is about to pick up the KitKat,
we can ask a vision language model, what is a hypothetical prompt that a human might have asked
that led to this particular scenario in the robot to actually choose to pick up a KitKat.
And then we can train our high-level policy.
on these synthetic prompts to basically augment the robot data
with various human interactions that might have led to those different situations.
And as a result of this, we're able to actually allow robots to follow a variety of different prompts.
So on the left, we ask, hi robot, can you make me a ham and cheese sandwich?
The robot says, sure, I'll start with the bread and add ham and cheese next.
And it's able to break down this task into the various subtasks of picking up a slice of bread,
putting on the cutting board, picking up a slice of cheese, putting it on the bread,
picking up some ham, and so on and so forth.
I can also follow more complicated prompts like,
Hi, Robot, can you make me a vegan sandwich?
I don't like pickles, though.
And in this case, it's able to break it down
and decide that it's going to add lettuce and tomatoes to the sandwich
and not add pickles, not add cheese, not add meat as well.
In addition to prompts, we're also able to train the robot
to handle different interjections.
Actually, here's an case where I have a different kind of prompts,
So on the left, we train the robot to clean tables,
so put trash away and put dishes into the bin.
And on the right, we asked the robot clean up only the trash,
but not the dishes.
And the robot is able to understand what that means
and connect that to its low-level actions
and only put away the trash and complete when the trash is all put away.
And then lastly, it's able to handle interjections
and situated corrections.
So in this case, the robot is kind of getting items for a user.
The user interjects and said, get me something sweet
that's not in the basket right after it had put a Kit Kat into the basket.
And the robot says, sure, let me get you some Skittles
and reasons through kind of basic reasoning of what
to fulfill the user's request.
And is able to respond to those kinds of corrections
situated in the world that the robot isn't.
Now, you might also wonder, like,
maybe some existing foundation models
could serve as a high-level planner for robots
and do this sort of high-level reasoning
without actually training a separate model.
And so we also evaluated that.
And we found that in blue, the performance following instructions and making progress on the task was
substantially lower than the performance of our system, which is shown in green.
And in general, we found that these frontier models generally struggle with visual understanding as it pertains to robotics,
which makes sense because in general, these models aren't kind of really targeting many physical applications and have very little data in the physical world.
Okay, so to start to wrap up, and then we'll have some time for questions, I talked about
We talked a bit about how robots can do a variety of dexterous long-horizon tasks with pre-training and post-training,
how robots can succeed in places that they've never been, and how they can respond to open-ended prompts and interjections
by leveraging synthetic data from language models on top of the robot data that we had collected.
Now, with some closing notes, we've seen a few different scenarios in this talk where general-purpose robots might be more successful than specialist robots.
But because we can essentially, rather than start from scratch
for every single application,
actually build upon a much broader foundation
for physical intelligence in the real world.
We also saw that large-scale data in the real world
is really helpful for developing these things.
And we found that, and I think that is necessary,
but not sufficient for physical intelligence.
And there's a lot of challenges.
We need more research to be done ourselves
and through open source contributions before robots
I think will be truly ready.
to tackle the open world.
I'd also like to mention that at Physical Intelligence,
we're hiring a number of roles.
If you're excited about some of the things that we talked about,
you can see a list of the open roles on the PI. website as well.
Awesome.
Happy to take some questions.
Let's start on the left.
Hi, Chelsea.
So first, I want to say thank you for all your work on robot learning.
They're all really impressive.
And so mainly I have two questions on,
especially regarding the post-training party mentioned.
So the first thing is you mentioned that in post-training, the most important part is to have high-quality action data.
So I'm wondering what the components of that would be.
And then the second question is, what do you think RL will play into the part of post-training?
Yeah, absolutely.
So I think that the different components of it, a lot of it comes down to consistency of the data and the strategy being followed.
and whether the data completes the task efficiently and with a reliable strategy.
And then on the second question, I think that reinforcement learning can play a very large role
in post-training.
I think that online data from the robots, which reinforcement learning allows you to use,
can allow robots to have a much higher success rate and also be faster than if they're
just trained with imitation learning.
Yeah.
Thank you.
Hi.
Thank you so much for your talk.
So your work is really fascinating
and there is no doubt that it will have
a lot of impact in the future.
But can I ask you at this stage,
how can you find the fundings?
Because honestly, I can't imagine
how hard it can be to convince people
to invest in the robots that falls close
and deal with the dishes.
Yeah, so it's a good question.
I think that, well, I guess first I'll mention
that we aren't just focused on applications in the home.
We really want to solve this broader problem of physical intelligence.
And we've been starting with those applications because they're ones that are kind of easy
to make progress on.
But we've also been doing tasks like inserting an Ethernet cable, which I put in the talk,
as well as constructing a cardboard box.
And generally, I think that this sort of problem has a ton of potential for like making
impact in all sorts of realms, not just in domestic tasks, but all sorts of realms as well.
And even in domestic tasks, I think there's a huge.
huge market for this kind of technology. We ourselves haven't had a lot of challenge with fundraising,
and I think that a lot of robotics companies recently have also done a great job and found that
there's actually a lot of excitement around this sort of technology because I think things are
actually starting to work. I started working on this technology more than 10 years ago at this
point, and things really weren't working then. And so, yeah, I think that there's a lot of
excitement that is starting to mature and actually be ready for the real world.
I think that there's a lot more work to do.
But generally, it seems like there's a lot of people excited about this technology
and eager to actually put funds behind it.
Okay, thank you so much.
Yeah.
Hi.
Thank you so much.
I have two questions, like one more broad and one more technical.
So the technical one like is VLAs, in my opinion, like at least to my understanding,
are a framework that is a bit separate from world modeling.
And I wonder how the two of them will interplay among each other
and whether like you have actually plans to somehow like use them together.
As I see right now like VLAs as more of policies
that could actually benefit a lot from world modeling.
And from a broader perspective,
I wonder like which kind of infrastructure layers could be the most,
useful to work home, such as like explainability,
traceability, or safety in general to deploy such models
like in the real world?
Yeah, great questions.
So on the first point, there's actually
fairly natural ways to incorporate world model objectives
into vision language action models.
And we've done some work where instead of only
predicting the next action, you predict
some intermediate sub-goal image,
like what should happen in the future in order
to accomplish the task and then predict an action from there.
And we've seen some kind of signs of light that seems to be quite promising.
So I think there's ways to merge the two paradigms.
At the same time, I think there's a lot of challenges that come up with world modeling
with regard to the ways in which, basically the data that you put into it,
not necessarily being kind of reflective of the ways in which you're going to use it.
You might train it on demonstration data of successful data of completing the task
and then evaluate it on to try to actually use it to evaluate actions that are not
optimally completing the task.
And then the world model will hallucinate a video of completing the task successfully,
even if the actions that you provided as input didn't,
weren't actually going to successfully lead to a good outcome.
So there's challenges there to overcome.
And so it's not like, yeah, there's various challenges,
but there's also ways to integrate it into the VLA paradigm.
And then, can you remind me your second question?
What are the infrastructure layers?
Like, you watch the chest to work on in the short
to bring like the most improvements, let's say.
But actually run these models on robots,
we have like a real-time system that needs to actually be hitting a certain frequency
to actually like execute actions successfully.
And if you have lag in that system and so forth,
it introduces all sorts of challenges.
And so thinking about fast inference and infrastructure for like that's actually going to be on the robot
is a big part of what our software team does.
And then also thinking about like large-scale machine learning infrastructure,
training large models, adjusting large amounts of data.
The data that we have is different from a lot of typical data sets
because it's very multimodal in nature.
It's kind of videos, actions, language segments,
and various other components as well.
So yeah, some interesting infrastructure problems,
I think, both on the robot side and on the model training side.
Thank you so much.
Hi, I'm Frederick, and I have got a question about model sizes.
in general.
So I think what we're seeing right now
is that in general, larger model sizes
lead to better accuracy, for example,
also in your experiments.
Or it's also what Open AI and Anthroping
and others are doing right now with their LLMs.
However, there's also the approach of using a quite small model
and then outsourcing the world knowledge
into a database of some sort with which the model can interact.
What is your take on that?
Do you think that's like a valid approach,
or do you think encapsulating all the world
knowledge inside of the world knowledge inside
the model is better or works better?
Yeah, it's an interesting question.
So in my experience, working on like retrieval-based systems,
is that it actually is a little bit tricky to,
well, first figure out what should be offloaded
versus actually done by the model.
And second, sometimes the model will ignore the retrieved content
and try to generate something itself.
And it actually seems to be very quite tricky
to get that technically to work exactly the way you want it.
I think it's probably going to depend on the application
in the use case in terms of how best to, like, whether that might make sense.
But in my experience, it ends up being quite tricky to figure out what the division of labor
is.
And even the model part of it will need to have some degree of intelligence in order to, like,
actually make use of the retrieved information and so forth.
So I think it's an really fascinating research problem, but it also needs, like, a lot of research
to make that work successfully.
Thank you.
Yeah.
Hi, Chelsea. My name is Charu Thomas. First off, really appreciate the talk. It was really fascinating and have been a big fan of your work since meta learning. When you think about how software and hardware are going to continue to evolve, what are the biggest opportunities for builders today for your vision of physical intelligence?
I mean, I think that, yeah, there's lots of different, like, opportunities to make things work a lot better and a lot of, like, open questions.
I think kind of like I was mentioning before, thinking about better ways of having infrastructure on, like, kind of the robot side.
I think that there isn't a lot of, like, there's some open source code for that sort of thing, but there's a lot of opportunities to make robot infrastructure better.
And not a lot of people, I think, are working on that aspect of the problem.
also lots of opportunities.
I guess one of the things I love about
AI and computer science
as a whole is there's a really big open
source community and I think that there's a ton of
opportunity to actually do open source
work and contribute to a
broader community that's trying to like collect
data, open source models, fix
bugs on those models, fine tune those
models, figure out new recipes for fine tuning those
models. So yeah, all sorts of questions
also like on the research side, especially
in the open source realm. Yeah.
Thank you.
Hi, Chelsea. I also, just like everyone else, am a big fan of all your work, so thank you for putting that all out.
I've been reading through a lot of your group's work recently and particularly enjoyed reading Saraj's PhD thesis.
It taught me a lot about scaling real-world robotics with data.
And a question I have is, how do you think synthetic data will sort of scale for robotics in the future?
As we've seen with LMs, we've moved away from sort of
not moved away from pre-training, but moved away from human collected data
into more creating synthetic data and a lot of filtering and a lot of self-grading.
So how do you think using generative synthetic data for creating environments or reward models
will impact robotics?
Yeah, I have many thoughts on this topic.
I think that at the end of the day, there's going to be no replacement for real data.
And so large amounts of real robot data is going to be a necessary component of any system
that's going to work in a generalizable,
So we're going to need that. At the same time, I do think that there's tools for like
simulation and synthetic data, especially to potentially play on the evaluation side. It's very
tricky to actually, as you, for example, are generalizing too many environments. It's very tricky
to actually evaluate how well that model generalizes not just in one new environment, but in 10
new environments. Because then you actually need to bring the robot to those 10 environments
or construct 10 environments, whereas in simulation, that gets a lot easier. So I think I'm really
excited about kind of simulation and synthetic data for that use case.
I should also mention that I think that the analog of synthetic data in language models is actually not necessarily
simulation in robotics, but closer to something like reinforcement learning.
I think that a lot of synthetic data is generated by the model that's actually trying to do the task
and then trying to kind of reason through different ways of doing the task.
And I think that the analogy there is a robot that's trying to attempt the task and learn from its own attempts and get better from its own attempts.
And that sort of online data from the model, I think, will also play a really critical role in post-training
and something that we're working on quite a bit.
And so, yeah, that I think is like really important and really helpful.
Thank you.
Cool.
I think we have time for one more question.
Sorry, we won't be able to get to everyone.
Hi, it's super cool to see you as an MIT Eeks alumni now working in a really cool robotic stuff
in talking to us about robotics and entrepreneurship.
But I've been wondering how robotics research that involves hardware components plays out
differently in academia versus industry.
And are there typically more resources, fewer constraints,
or broader applications in one setting over the other?
And what kind of people or goals do you think
might be better suited for each path?
Yeah, it's an interesting question.
I still love both kind of startup
and academic environments and industry environments.
I think they all have various pros and cons.
Certainly, I think that any,
I think that generally academic environments
aren't quite as well resourced in terms of data collection,
throughput, e-val throughput, and compute
as like startups and industry labs.
But at the same time, I think that there's a lot of problems
that you can solve without large amounts of resources
that we need to figure out on the algorithm side.
So I think that there's a lot of really interesting work
to be done there.
And then in industry and in startups, I think,
actually trying to do some of the research on these big models,
scaling up data, seeing what things happen at large scales
is really great to do there.
Yeah, I think that there's a place for both.
I also think that the gap isn't as large as
often people make it seem.
And oftentimes people in industry environments kind of wish they had more compute.
Like you kind of always wish that you had more resources.
And sometimes when you have a lot of resources,
you don't actually think as carefully and as critically about what runs you're going to be doing and so forth.
And you end up being sometimes more wasteful of compute than if you were kind of more compute
constrained.
So there's also actually downsides to having more resources in my experience.
I'm really sorry.
Can I just ask you one quick question on architecture?
I know that the scaling laws have worked well for transformer-based architectures.
And I was thinking, do you see currently limits in VLM-based architecture,
which are kind of made for like text tokens because they don't have like modules for physical awareness?
Yeah, and how do you deal with that?
Yeah, so we tokenize the actions.
And so I'd encourage you to take a look at the fast tokenizer paper that we put out as kind of a way to accomplish that.
And yeah, we should wrap up there. Thanks, everyone. And yeah, I hope you enjoy the event.
