a16z Podcast - Tesla's Road Ahead: The Bitter Lesson in Robotics

Starting point is 00:00:00 This is the big race in robotics. The smarter your brain, so to speak, the less specialized your appendages have to be. AI has pushed every single one of these kind of to its limit into a new state of the art. The way they're solving precision is instead of throwing more sensors on the car, is to basically throw more data at the problem.

Starting point is 00:00:20 Data is absolutely eating the world. What is good enough? We used to have the Turing test, which obviously we're blown past now. His short hand for it was like the AWS of AI. He's got this idea of this distributed swarm of unutilized inference computers. Whether that's an oil rig, whether that's a mine, whether that's a battlefield, there's so many different use cases for a lot of this underlying technology that are really starting to see

Starting point is 00:00:47 the light of day. It's basically an if not a win. In inevitability. Earlier this month, Elon Musk and the team at Tesla held their Wii robot event, where they unveiled their plans for the unsupervised full self-driving, cybercab, and robovan. Plus, Optimus, their answer to consumer-grade humanoid robots, and also what Musk himself predicted would be, quote, the biggest product ever of any kind. Now, of course, none of these products are on the market yet, but several demos were on show at the event. Naturally, the response was mixed. Supporters said we got a glimpse of the future, while critics said the details were missing. But in today's episode, we're not here to debate that.

Starting point is 00:01:29 What we do want to talk about is what this indicates about the intersection of where hardware and software meet. So what does Rich Sutton's 2019 blog post, The Bitter Lesson, tell us about the decisions that Tesla's making in autonomy? And how realistic is the quoted $30,000 price range? Also, what are the different layers of the autonomy stack? And where do we get the data to power it? And what does any of this look like when you exit the consumer sphere? We cover all this and more with A16C partners, Anjani Mita, and Aaron Price Wright. Anjane previously founded Ubiquity 6, a pioneering computer vision and multiplayer technology company

Starting point is 00:02:04 that sat right at this intersection of hardware and software, and was eventually acquired by Discord. Aaron, on the other hand, invests on our American Dynamism team with a focus on AI for the physical world. And if you'd like to dig even deeper here, Aaron has penned several articles on the topic that we've linked in our show notes. All right, let's get to it. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see A16c.com slash disclosures. So last week, Tesla had their we robot event, and Musk announced the cybercab, the Robovan, or as he liked to call it, the Robovin, and the Optimus.

Starting point is 00:03:07 You guys are so immersed in this hardware software world. I'd love to just get your initial reaction. From my perspective, it wasn't that there was anything in particular that was super surprising, but what was exciting as just sort of a culmination of one thing that Elon Musk does really well and Tesla has done really. well, which is continue to pour love and energy and money in time into a dream and a vision that's been going on for a really long time, like well past when most financial investors and most people kind of lost the luster of self-driving cars after their initial craze in the mid to late 2010s. And they've just continued to plot along and to continue to make developments and now we're finally seeing like this glimpse of the future for the first time in a really long time. I think that's right. I think it was very impressive, but unsurprising.

Starting point is 00:03:57 Yeah. So I think the two schools of thought when people watch the event was, one was absolutely this whole, oh my God, this is such vaporware. He shared literally nothing on engineering details. What the hell? Come on. Give us the meat on timelines and dates and prices. And then the opposing view was like, holy shit, they're still going. They haven't given up on any of this autonomy stuff that he's been talking about for years. And I'm absolutely more empathetic towards the view that it was a lean towards the latter, which is that it saw it as an homage to the bitter lesson. It's a sort of amazing blog post that I'm going to do a terrible job of summarizing by this great computer scientist Rick Sutton, which basically says that over the last 70 years or so

Starting point is 00:04:33 of computer science history, what we've learned is that general purpose methods basically beat out any specific methods in artificial intelligence in particular. Basically, the idea that if you're working on solving a task that requires intelligence, you're usually better off leveraging Moore's law and more compute and more data than trying to hand engineer a technique or a set of algorithms to solve a particular task. And broadly speaking, that's been the big grand debate in self-driving and autonomy, I would say, for the last two decades, right? It's the sort of general purpose, bitter lesson school versus the let's model self-driving as a specific task. As a set of discrete decision-making algorithms unconnected to each other.

Starting point is 00:05:11 A system to solve, let's say, edge detection around stop signs, right? Where self-driving is a really hard problem. And you could totally say, well, there's so many edge cases in the world. that we should map out each of those edge cases. And I think it was an homage to the bitter lesson. So that's what I was most excited about is he did share actually details that their pipeline is basically an end-to-end deep learning approach. Right, which is incredible and probably true only for the last, my guess is 18 to 24 months.

Starting point is 00:05:37 Right. Yeah. Yeah. And I mean, in the bitter lesson, he also talks about the fact that it's really appealing to do the opposite because in the short term, you will get the benefit, but the broader deep learning approach ends up winning out in the long term. And a lot of people talk about Musk. Musk says it about himself that the timeline

Starting point is 00:05:54 sometimes are off, but he's basically banking on that premise in the long term. It's basically an if not a when. In inevitability. And I think the event was the first time that it really did feel in an emotional sense for the average American consumer. I'm not talking about the super duper tech literate people who wanted the details of the underlying models and their weights. But like for the average American consumer the first time that this version of the future felt like in inevitability. And before we get into maybe the specifics around where else hardware and software are intersecting, I'd love to just talk about that, that average person who's watching because you guys are meeting with companies and investors and this has been going on for quite some time.

Starting point is 00:06:39 So I'm just curious if maybe you noticed anything under the hood or maybe the meta in that announcement or event that maybe the average person watching is, you know, what are they seeing? They're saying things like, oh, maybe it was human controlled and not like fully AI and device or other people are commenting on the fact that these humanoids are shaped like humans. Like, why do we need that? On the topic of humanoids, I think humanoid are a great choice of embodiment for a robot to really emotionally connect to and speak to a human being watching because I can relate to a human form factor. Obviously, we've found out that it was teleoperated, which is, in my opinion, still doesn't take away from like how cool and amazing it was. The human form factor is a way to connect what is happening

Starting point is 00:07:27 with robotics to a regular person who is like, okay, yes, I like see myself in that. This looks like Star Wars or some other sci-fi movie. In reality, maybe this is like a controversial opinion. I don't see the vast majority of economic impact over the next decade from robotics coming from the humanoid form factor. But that doesn't take away from the power of the symbol of having a humanoid make a drink at this event because it just like connects back to the sort of science fiction promise of our childhoods getting sort of finally delivered. The opening sequence, he started with like a sci-fi. I think it was a Blade Runner visual. And he was like, we all love sci-fi and I want to be wearing that jacket that he's wearing in

Starting point is 00:08:07 the picture, but we don't want any of the other dystopian stuff. And so that definitely stuck out to me is that he did not start the way he usually does. It's often a technical first sort of story. But he started with a, here's a vision for where I think the world should go. So it was much more Disney-esque in that. And it was quite poetic. I think they literally did it on the Warner Brothers studio a lot. And so they like recreated a bunch of cities. And I think they had on site of the event, the robo vans, taking people around from these simulated cities. There was a sort of theatricality to it all that stuck out to me, which I thought was quite different. And I thought it was refreshing because the core problem with this branch of

Starting point is 00:08:43 AI, which is largely deep learning based and bitter lesson based, is that. it's an empirical field. Unlike, call it Moore's Law, which was predictive, where you basically know if you double the number of transistors, you get this much more performance on the chip, and it's just about pure execution. A.I. is much more empirical. You don't really know when the model is going to get done training and when it does get training, whether it'll converge or not. Or even what is converged mean, like, what is good enough? We used to have the turning test, which obviously we're blown past now. It's a feeling more than it is a set of discrete metrics that you can really point to. Right. So it made a lot of sense.

Starting point is 00:09:16 to me that he's trying to decouple this idea of progress from a specific timeline. I see. Because I just think we're setting ourselves up for every time you ask a deep learning researcher. So when's that GPD5 model going to show up? It's like the most frustrating question ever, right? Because they don't know. We don't know. And frankly, sometimes they show up earlier than schedule and show up later.

Starting point is 00:09:38 And by the way, you can look at the stock market's reaction. It's a prime example of how people have been so conditioned by, I would say, the Steve Jobsian, Apple-like cadence year on year of like, here's your new iPhone. This is incremental but predictable. I think forecasting that the tech industry keeps trying to reward. And I think what he's doing is pretty refreshing, which is saying, look, here's a vision for where we want to go, but it's decoupled. The second thing on the humanoid piece that I was quite impressed by is actually the quality of the teleoperation. So everybody's talking about how, oh, this is fake. This is all smoke and mirrors. It's just people. It's so hard. I was going to say, really hard problem. Why is no one talking about

Starting point is 00:10:12 that? Have you ever tried? I mean, unfortunately. It's so hard. We were at a company two weeks ago, and they've got these teleop robots. And the founder was demoing a mechanical arm that he was teleoperating with a game pad. And he was folding clothes with it. And I was like, oh, that looks simple. He's like, here, try it. It was one of the hardest manipulation things I've ever tried. And by the way, we tried that with, you know, VR headset with six-d-off motion controllers,

Starting point is 00:10:37 almost harder to do. Teleoperating something, especially over the Internet, in a smooth fashion with precision, is incredibly hard. And I don't think people appreciate the degree to which they've really solved that pipeline. Yeah, I was actually really impressed by that. And, you know, I think that there's huge opportunity for teleop in sort of production applications that will have, like, massive economic benefit. Right. Even before we have true robots running around managing themselves. Because if you think about there's all these really hard and really dangerous or hard to get to jobs or there's labor different credentials where it's a lot harder to hire people to do certain things in certain locations. And if we can imagine, in a future where the teleop that we saw last week at the event is something that's widely available. That's incredible. Imagine not having to go and service a power line, but you can actually teleop a robot to do that for you, but still have the level of sort of human training and

Starting point is 00:11:31 precision needed to make a really detailed and specific evaluation. The promise of that is really cool, even before we get to robots. So that was really exciting. Yeah, it's like a stop along this journey. And so if we talk about that journey, the arc of hardware and software coming together in maybe a different way than we've seen in the past, just as an example. So Mark famously said software is eating the world. That was in 2011. We're in 2024. And it does feel like the last decade has been a lot of traditional software, not so much integrating with the physical world around us. And so where would you place us in that trajectory? Because we're seeing it with autonomous vehicles, but I got the sense that's not the only place where this is happening.

Starting point is 00:12:10 Yeah. This is where I spend 95% of my time in all of these industries that are just starting to see the glimmers of what autonomy and sort of software-driven hardware can bring. What's really interesting is just actually a dearth of skills of people who know how to deal with hardware and software together. You have a lot of people that went and got computer science degrees over the last decade and relatively speaking a lot fewer that went and got electrical engineering or mechanical engineering degrees. And we're starting to see the rise of, oh, shoot, we actually need people who understand not just maybe how the software works in the cloud with Wi-Fi, where you have unlimited access to compute and you can retry things as many times as you want and you can ship code releases all day every day. But you actually have kind of a hardware deployment where you have limited compute in an environment where you maybe can't rely on Wi-Fi all the time, where you have to tie your software timelines to your hardware production timelines. Like these are a really difficult set of challenges to solve and write now, like there just isn't a lot of standardized tooling for developers and how to do that.

Starting point is 00:13:16 So it's interesting. We're starting to see portfolio companies of ours across really different industries that are trying to use autonomy, whether it's oil and gas or water treatment or HVAC or defense. They're like sharing random libraries that they wrote to connect to like particular sensor types because there's not this like rich ecosystem of tooling that exists for the software world. So we're really excited about what we're starting to see emerge in the space. Even Elon said when he's talking about these two different products that he's unveiling, right, Optimus, and then you have the robobans or cybercabs. And those seem like two completely different things. But he even said in the announcement, he said everything we've

Starting point is 00:13:56 developed for our cars, the batteries, power electronics, advanced motors, gearboxes, AI inference computer. It all applies to both, right? So you're seeing this overlap. That's super exciting. When I was watching it, I was just nerding out because my last company, and he was a computer vision 3D mapping and localization company. So I'd unfortunately bet spent too much in my life calibrating LIDAR sensors to our computer vision sensors. Because our whole thesis when I started back in 2017 was that you could do really precise positioning

Starting point is 00:14:21 just off of computer vision and that you didn't need fancy hardware like LIDARs or depth sensors. And to be honest, not a lot of people thought that we could pull it off. And frankly, I think there were moments when I doubted that too. And so it was just really fantastic to see that his bet and the company's bet on computer visions and a bunch of these sensor fusion techniques that would not need specialized hardware

Starting point is 00:14:40 would ultimately be able to solve a lot of the hard navigation problems, which basically means that the way they're solving precision is instead of throwing more sensors on the car is to basically throw more data at the problem. And so in that sense, data is absolutely eating the world. And you asked where on the trajectory are we of software eating the world? And I think we're definitely on an exponential

Starting point is 00:15:01 that has felt like a series of stacked sigmoids. Often it feels like you're on a plateau, But a series of plateaus totally make up an exponential if you zoom out enough. And earlier in the conversation, we talked about the bitter lesson. A number of other teams in the autonomy space decided to tackle it as a hardware problem, not a software problem, right? Where they said, well, more LiDAR, more expensive LiDAR, more GPS, more GPS. Right.

Starting point is 00:15:22 And Elon's like, you know, actually I want cheap cars that just have computer vision sensors. And what I'm going to do is use a bunch of the custom, really expensive sensors that many other companies put on the car, which is at inference time. And he's just going to use them at train time. So Tesla does have a bunch of like really custom hardware that's not scalable that drives around the world in their parking lots and simulation environments and so on. And then they distill the models they train on that custom hardware to a test time package. And then they send that test time package to their retail cars, which just have computer vision sensors. And the reality is that's a raw arbitrage, right, between sensor stack.

Starting point is 00:15:58 And it allows the hardware out in the world to be super cheap. The result there is software is eating the sensor stack out in the world that makes the cost of these cars so much cheaper that you can have a $30,000 fully autonomous car versus $100 plus $1,000 of cars that are fully loaded with these LIDAR sensors and so on. But I think in order to have the intuition that you can even do that, you really actually have to understand hardware. If you just understand software and hardware is like a sort of a scary monster that lives over here and maybe you have a special hardware team that. that does it, it's going to be hard for you to have the confidence to say, no, we can do it this way. I think you're totally right, which is that the superpower that Tesla has is his ability to go full stack, right? Because a lot of other industries often segment out software versus hardware like you're saying. And that means that people working on algorithms and the

Starting point is 00:16:52 autonomy part just treat hardware as like an abstraction, right? You throw over a spec, it's an API, it's an interface that I program against and I have no idea what's going on. You don't have to worry about the details. don't have to worry about it. It doesn't matter. Which, by the way, is super powerful. It's unlocked this whole general purpose wave of models like chat GPT and so on, right? Because it allows people who specialize in software to not have to think about the hardware. It's also what's driven sort of the software renaissance of the last 15 years. Absolutely. Decoupling, right? Composition and abstraction is sort of the fundamental basis of the entire computing revolution. But I

Starting point is 00:17:23 think when you're like him and you're trying to bring a new device to market, kind of like what jobs did with the iPhone, by going full stack, you end up unlocking massive efficiency. of cost. And I think that's what this event may have been lost in the sort of theatricality of all is the fact that he's able to deliver an autonomous device to retail consumers at a cost profile through vertical integration that would just not be possible if it was just a software team buying hardware from somebody else and building on top. Can we talk about those economics, by the way, just attacking that head on? Both Optimus and Cybercab were quoted as being under the 30K range. Is that really realistic? And then tied into that.

Starting point is 00:18:01 to what you were saying, we see other autonomous vehicles, which are betting more on the LiDAR and the sensors, which also have come down in price pretty substantially. My guess is Elon is backing in to the cost based on what people are willing to pay, and he will do whatever it takes to get those costs to line up. I mean, it's the same thing he did with SpaceX. He will operate within whatever cost constraints he needs to operate within, even if the rest of the market or the research community is telling him it's not possible. Obviously, like a 30K humanoid robot is way less than what most production industrial robotic arms cost today, which I think are more in the 100K range for the ones that are used in like the high-end factory. So if he can get it down to 30K,

Starting point is 00:18:45 that's really exciting. I also don't necessarily think you need even a 30K humanoid robot to accomplish a wide swath of the automation tasks that would pretty radically transform the way our economy functions today. Yeah, I think Aaron's right in that there's probably a top-down director, which is do whatever it takes to get into the cost footprint. This car has to cost $30K. Right. But I think if you do a bottoms-up analysis,

Starting point is 00:19:06 I don't think you end up too far because actually if you just break down the kind of bomb on a Tesla Model 3, you're not dramatically far off from the sensor stack you need to get to a $30,000 car, right? This is the beauty of solving your hardware problems with software. You don't need a $4,000 scanning a lighter on the car. So I think on the cyber cab, I feel much more confident that the cost footprint is going to fall in that range because it's, frankly, we kind of have at least an ancestor on the streets, right? The thing that gets prices up is custom sensors because it's really expensive to build custom sensors in short production runs. And so you either have scale of manufacturing like an Apple and you make a new CMOS sensor or a new face ID sensor and you get cost economies of scale because you're shipping more like 30 million devices in your first run.

Starting point is 00:19:53 or you just lean on commodity sensors from the past era and you tackle most of your problems in software, which is what he's doing. And to that point, when he's betting on software, another interesting thing that he announced was really overspecking these cars to almost change the economics potentially based on the fact that those cars could be used

Starting point is 00:20:11 for distributed computing. To your point, Ange, if you put a bunch of really expensive sensors on the car, you can't really distribute the load of that in any other way than driving the car, right? But if you actually have this computing layer, that's, again, in his case, he's saying he's planning to over spec, that actually can fundamentally change, like, what this asset is. And you kind of saw the same thing even with Tesla's today where he's talking about this distributed grid, right? Where all of a sudden, these large batteries are being used not just for the individual asset. So do you have any thoughts on that idea or if we've seen that elsewhere? He was a bit skimpy on details on that. But I think he did say that the AI5 chip is over spec. It's probably going to be four to five times. more powerful than the HW4, which is their current chip,

Starting point is 00:20:56 it's going to draw four times more power, which probably puts it in that like 800 watts or so range, which for context, your average hair dryer is at about 1,800 watts. I mean, it's hard to run power on the edge. But I think what he said was something to the effect of like, your car's not working 24 hours a day. So if you're driving, call it eight hours a day in L.A. traffic. God bless whoever's having to do that.

Starting point is 00:21:19 For real. Hopefully they're using self-driving. One would hope. Actually, he opened that his pitch with a story about driving to El Segundo and he's saying you can fall asleep and wake up on the other side. But I think the t-shirt size he gave was about 100 gigawatts of unallocated inference compute just sitting out there in the wild. And I think his short hand for it was like the AWS of AI. He's got this idea of this distributed swarm of unutilized inference computers. And it's a very sexy vision.

Starting point is 00:21:50 I really want to believe in it. Ground us on, is this realistic? Well, I know. I think it's realistic for workloads that we don't know yet in the following sense, right? That the magic of AWS is that it's centralized. And it abstracts away a lot of the complexity of hardware footprints for developers. And by centralizing almost all their data centers in single locations with very cheap power and electricity and cooling, these clouds are able to pass on very cheap inference costs to the developer. Now, what he's got to figure out is how do you compensate for that in a decentralized fashion? And I think we have kind of

Starting point is 00:22:20 prototypes of this today, like there are these vast decentralized clouds, I think one is literally called vast, of people's unallocated gaming rigs. People have millions of Nvidia 490 gaming cards sitting on their desks that aren't used. And historically, those have not yet turned into great businesses or high-utilized networks because developers are super sensitive to two things, cost and reliability. And by centralizing things, AWS is able to ensure very high uptime and reliability, whereas somebody's GPU sitting on their... Maybe available. Maybe they're driving to Elsinco.

Starting point is 00:22:52 Right. Right. And there are just certain things, especially with AI models that are hard to do on highly distributed compute where you actually need good interconnect and you need things to be reasonably close to each other. Maybe in his vision, there's a world where you have optimist robots in every home. And somehow your home optimist robot can take advantage of additional compute or additional inference with your Tesla car that's like sitting outside in your driveway.

Starting point is 00:23:16 who knows. Right. Okay, well, this event clearly was focused on different models that are consumer-facing. So again, Cybercab, that's for someone using an autonomous vehicle. Optimus is a human-eyed robot probably in your home. But, Erin, you've actually been looking at the hardware software intersection in a bunch of other spaces, right? And as you alluded to earlier, maybe different applications with better economics at least today. I think long term, there's no market bigger than the consumer market. So everyone having a robot in their home and a Tesla car in their driveway that's also a robotaxie has huge economic value. But that's also a really long-term vision. And there's just so much happening in autonomy that's taking advantage of the momentum

Starting point is 00:24:04 and the developments that companies like Tesla have put forward into the world over the last decade that actually have the potential to have meaningful impact on our economy in the short term. I think the biggest broad categories for me are largely the sort of dirty and unsexy industries that have very high cost of human labor often because of safety or location access, whether that's an oil rig out in the middle of Oklahoma somewhere, that's three hours drive from the New York City, whether that's a mine somewhere in rural Wyoming that freezes over for six months out of the year so humans can't live there and mine, whether that's a battlefield where you're So we're starting to see autonomous vehicles go out and clear bombs and mines from battlefields to protect human life.

Starting point is 00:24:51 There's so many different use cases for a lot of this underlying technology that are really starting to see the light of day. So very excited about that. And as we think about that opportunity, you've also talked about this software-driven autonomy stack. So as you think about the stack, what are the layers? Can you just break that down? Yeah, sure. So across whether it's a self-driving car or sort of an autonomous control system, we're seeing, the stack break down into pretty similar categories. So first is perception. You have to see the world

Starting point is 00:25:22 around you, know what's going on, be able to see if there's a trash can, be able to understand if there's a horizon, if you're a boat. The second is something Ongo knows really well, which is location and mapping. So, okay, what do I see? How do I find out where I am within that world based on what I can see and what other sensors I can detect, whether it's GPS, which often isn't available in battlefields or in warehouses, et cetera. The third is planning and coordination. So that's, okay, how do I take a large task and turn it into a series of smaller tasks? So what is more of an instant reaction?

Starting point is 00:25:58 I don't have to really think about how to take a drink of water, but I might have to think about how to make a glass of lemonade from scratch. So how do I think about compute across those different types of fridge? regimes when something is more of an instinct versus when something has to be sort of taken down and processed into discrete operations. And then the last one is control. So that's like how does my brain talk to my hand? Like how do I know what are the nerve endings doing in order to pick up this water bottle and take a drink out of it? And that's a really interesting kind of field that's existed for decades and decades. But for the first time, probably since the 70s, we're starting to see really interesting stuff happen in the space of controls around autonomy and robotics. And I would say, like, all of these are pre-existing areas. None of this is wildly new, but I think in the last two years, especially with everything that's happening with deep learning video language models, broadly speaking, AI has pushed every single one of these kind of to its limit and to a new state of the art.

Starting point is 00:27:00 And there just aren't tools that exist to tie all that together. So every single robotics company, every single autonomous vehicle company is basically like rebuilding this entire stack from scratch, which we see as. investors as a really interesting opportunity as the ecosystem evolves. And as you think about that ecosystem, people kind of say that as soon as you touch hardware, you're literally working on hard mode compared to just a software-based business. So what are the unique challenges, even with maybe that AI wave today that's pushing things ahead? How would you break down what becomes so much harder?

Starting point is 00:27:32 I think Ange touched on this a little bit before, but the more you can commoditize the hardware stack, the better. So the most successful hardware companies are the ones that aren't necessarily inventing a brand new sensor but are just taking stuff off the shelf and putting it together. But still, like, tying everything together is really hard. Like, when you think about releasing a phone, for example, Apple has a pretty fast shipping cadence, and they're still releasing a new phone only every once a year. So you have to essentially tie a lot of your software timelines to hardware timelines in a way that doesn't exist when you can just sort of ship whenever you want

Starting point is 00:28:07 in a cloud. If you need a new sensor type or you need a different kind of computer, you know, construct or you need something fundamentally different in the hardware. You're bound by those timelines. You're bound by your manufacturer's availability. You're bound by how long it takes to quality engineer and test a product. You're bound by supply chains. You're bound by figuring out how these things have to integrate together. So the cycles are often just quite a lot slower. And then the other thing is when you're interacting with the physical world, you get into use cases that touch safety in a really different way than we think about with pure software alone. And so you have to design things for a level of like heartiness and reliability that you don't

Starting point is 00:28:50 always have to think about with software by itself. If your chat GPT is a little slow, it's fine. You can just try again. But if you have an autonomous vehicle that's like driving a tank on a battlefield autonomously and something doesn't work, like you're kind of screwed. So you have to have a much higher level of rigor and testing and safety built into your products, which slows down the software cycles. The holy grail is sort of general purpose intelligence for robotics, which we still don't have. When you train a general model, you basically get the ability to build hardware systems that don't have to be particularly customized. And that reduces hardware iteration cycles dramatically because you can basically say, look, roughly speaking, these are the

Starting point is 00:29:34 four or five commodity sensors you need, the smarter your brain, so to speak, the less specialized your appendages have to be. And I think what a number of really talented teams are trying to solve today is, can you get models to generalize across embodiments, right? Can you train a model that can work seamlessly on a humanoid form factor or a mechanical arm, a quadruped, whatever it might be? And I'm quite bullish that it will happen. I think the primary challenge there that teams are struggling with today is the lack of really high quality data, right? The big on known is just how much data, both in quantity and quality, do you really need to get models to be able to reason about the physical world spatially in a way that abstracts across any

Starting point is 00:30:15 hardware? I'm completely convinced that once we unlock that, the applications are absolutely enormous, right? Because it frees up hardware teams, like Aaron was saying, from having to couple their software cycles from hardware cycles. It decouples those two things. And I think that's the holy grail. I think what Tesla, the victory of the autonomy team, there is having realized eight years ago, the efficacy of what we call early fusion foundation models, right, which is the idea that you take a bunch of sensors at training time and different inputs of vision, depth, you take in video, audio, you take a bunch of different six-d-off sensors, and you tokenize all of those, and you fuse them all at the point of training, and you build

Starting point is 00:30:56 an internal representation of the world for that model. In contrast, the LLM world does what's called late fusion. You often start with a language model that's trained just on language data and then you duct tape on these other modalities, right, like image and video and so on. And I think the world has now started to realize that early fusion is the way forward. But of course, they have an eight-year head start. And so I get really excited when I see teams either tackling the data sort of challenge for general spatial reasoning or teams that are taking these early fusion foundation model approaches to reasoning that then allow the most talented hardware teams to focus really on what they know best.

Starting point is 00:31:31 Where are these companies getting training data? You mentioned Tesla, for example, yes, we've had cars on the road, tons of them with these cameras and sensors. I still think that one of the smartest things Elon did was turn on full self-driving for everybody for like a month-long trial period last summer. I have a Tesla and I turned it on for my free month and it was like a life-changing experience using and I obviously couldn't get rid of it. And so now, not only do I now pay for full self-driving, but I also, I'm just, I'm giving him all my data.

Starting point is 00:32:04 So to me, that's really clever. And so I'm curious if you talk about some of these other applications, do they have the number of devices, or in this case, cars for Tesla, capturing this data? Or how else are we going to get this spatial data? This is the big race in robotics right now. I think there are several different approaches. Some people are trying to use video data for training. Some people are investing a lot in simulation and creating digital 3D world. And then there's a mad rush for every kind of generated data that you could possibly have.

Starting point is 00:32:39 So whether that's robotic teleoperated data, whether that's robotic arms and offices, most of these robotics companies have pretty big outposts where they're collecting data internally. They're giving humanoids to their friends to have in their homes. It's a yes-and scenario right now where everyone is just trying to get their hands on data. literally however they can. I think it's the Wild West, but if you're Tesla, then the secret weapon you have is you've got your own factories, right? So the Optimus team has a bunch of humanoids walking around the factories, constantly learning

Starting point is 00:33:10 about the factory environment, and that gives them this incredible self-fulfilling sort of compounding loop. And then, of course, he's got the Tesla fleet, like Aaron was singing earlier with FSD. I'm proud to have been a month one subscriber for it. And I'm happy that I'm contributing to that training cycle because it makes my Model X smarter next time around. So the challenge then is for companies that don't have their own full stack sort of fully integrated environment, right, where they don't have deployments out in the field. And to Aaron's point, you can either take the simulation route for that and say we're going to

Starting point is 00:33:37 create these sort of synthetic pipelines, or we're seeing this, yeah, huge buildout of teleop fleets. Like with language models, you had people all around the world in countries showing up and labeling data. You have teleop fleets of people piloting mechanical arms halfway around the world. I think there's an interesting sort of third new category of efforts we're tracking, which is crowdsourced coalitions, right? So an example of this is the deep mind team put out this maybe a year and a half ago, robotics data set called RTX, where they partnered with a bunch of academic labs and said, hey, you send us your data. We've got compute and researchers. We'll train the model on your data and then send it back to you. And what's

Starting point is 00:34:14 happening is there's just different labs around the world who have different robots of different kinds. Some are arms, some are quadruped, some are bipeds. And so instead of needing all of those to be centralized in one place, there's a decoupling happening where some people, people are saying, well, we'll specialize in providing the compute and the research talent, and then you guys bring the data. And then it's a give-to-get model, right, which we saw in some cases with the internet early on. And Vita is an example of this where their research team is stacking a bunch of robots in-house. So they instead partnering with people like pharma labs who have arms doing the betting and wet lab experiments and saying, you send us the

Starting point is 00:34:45 data. We've got a bunch of GPUs. We've got some talented deep learning folks. We'll train the model, send it back to you. And I think it's an interesting experiment. And there's reason to believe this sort of give-to-get model might end up actually having the highest diverse. of data, but we're definitely in sort of full experimentation land right now. Yeah, and my guess is we'll need all of it. So it sounds like data is a big gap, and it sounds like some builders are working on that. But where would you guys like to see more builders focused in this like hardware software arena, especially because I do think there are some consumer-facing areas where people are drawn to. They see an event like this and they're like, oh, I want to work on that.

Starting point is 00:35:17 Yeah, I'm pretty excited about the long tail of really unsexy industries that have outsized impact on our GDP and are often really critical industries, where people haven't really been building for a while, things like energy, manufacturing, supply chain, defense. These industries that really carry the U.S. economy, where we have underinvested from a technology perspective, probably in the last several decades, are poised to be pretty transformed by this sort of hardware software melding and autonomy. I'd love to see more people there. I'm very excited for all the applications they're in talked about. And I think to unlock those, we really need a way to solve this data bottleneck. So startups, builders who are figuring out really novel ways to collect

Starting point is 00:36:01 that data in the world, get it to researchers, make sense of it, curate it. I think that's sort of a fundamental limiter on progress across all of these industries. We just need to sort of 10x the rate of experimentation in that space. All right, that is all for today. If you did make it this far, first of all, thank you. We put a lot of thought into each of these episodes, whether guests, the calendar tetras, the cycles with our amazing editor, Tommy, until the music is just right. So if you like what we put together, consider dropping us a line at rate thispodcast.com slash A16C. And let us know what your favorite episode is. It'll make my day, and I'm sure Tommy's too. We'll catch you on the flip side.

a16z Podcast - Tesla's Road Ahead: The Bitter Lesson in Robotics

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.