a16z Podcast - What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

Episode Date: December 5, 2025

Fei-Fei Li is a Stanford professor, co-director of Stanford Institute for Human-Centered Artificial Intelligence, and co-founder of World Labs. She created ImageNet, the dataset that sparked the deep ...learning revolution. Justin Johnson is her former PhD student, ex-professor at Michigan, ex-Meta researcher, and now co-founder of World Labs.Together, they just launched Marble—the first model that generates explorable 3D worlds from text or images.In this episode Fei-Fei and Justin explore why spatial intelligence is fundamentally different from language, what's missing from current world models (hint: physics), and the architectural insight that transformers are actually set models, not sequence models. Resources:Follow Fei-Fei on X: https://x.com/drfeifeiFollow Justin on X: https://x.com/jcjohnssFollow Shawn on X: https://x.com/swyxFollow Alessio on X: https://x.com/fanahova Stay Updated:If you enjoyed this episode, please be sure to like, subscribe, and share with your friends.Follow a16z on X: https://x.com/a16zFollow a16z on LinkedIn:https://www.linkedin.com/company/a16zFollow the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYXFollow the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details, please see http://a16z.com/disclosures. Stay Updated:Find a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Transcript
Discussion (0)
Starting point is 00:00:00 I think the whole history of deep learning is in some sense the history of a scaling up compute. When I graduated from grad school, I really thought the rest of my entire career would be towards solving that single problem, which is... A lot of AI as a field, as a discipline, is inspired by human intelligence. We thought we were the first people doing it. It turned out that was also simultaneously doing it. So Marble, like, basically one way of looking at it, it's the system, it's a generative model, of 3D worlds.
Starting point is 00:00:30 So you can input things like text or image or multiple images, and it will generate for you a 3D world that kind of matches those inputs. So while Marble is simultaneously a world model that is building towards this vision of spatial intelligence, it was also very intentionally designed to be a thing that people could find useful today. And we're starting to see emerging use cases in gaming, in BFX, in film, where I think there's a lot of really interesting stuff that Marvel can do today as a product. then also set a foundation for the grand world models that we want to build going into the future.
Starting point is 00:01:06 Fafi Lee is a Stanford professor, the co-director of the Stanford Institute for Human-Centered Artificial Intelligence, and co-founder of World Labs. She created ImageNet, the data set that sparked the deep learning revolution. Justin Johnson is her former PhD student, ex-professor at Michigan, ex-Meta research, and now co-founder of World Labs. together they just launched marble the first model that generates explorable 3D worlds from texture images in this episode
Starting point is 00:01:34 Befei and Justin explore why spatial intelligence is fundamentally different from language what's missing from current world models hit physics and the architectural insight that transformers are actually set models not sequence models Hey, everyone, welcome to the Leading Space podcast.
Starting point is 00:01:52 This is Alessio, funder of Kernel Labs, and I'm joined by Swix, editor of Layden Space. And we are so excited to be in the studio with Fifei and Justin of World Labs. Welcome. We're excited, too. I know you said Marble. Yeah, thanks for having us. I think there's a lot of interest in world models, and you've done a little bit of publicity around spatial intelligence and all that. I guess maybe one of the part of the story that is a rare opportunity for you to tell is how.
Starting point is 00:02:17 you two came together to start building world labs. That's very easy because Justin was my former student. So Justin came to my, you know, in my, the other hat I wear is a professor of computer science at Stanford. Justin joined my lab when? Which year? 2012. Actually, the semester that I, the quarter that I joined your lab was the same quarter that Alex Nat came out. Yeah. Yeah. So Justin is my first. Were you involved in the whole announcement drama? No, no, not at all. But I was sort of watching all the image and excitement around AlexNet at that quarter.
Starting point is 00:02:52 So he was one of my very best students. And then he went on to have a very successful early career as a professor in Michigan, University of Michigan, and Arbor in MEDA. And then when we, I think around, you know, more than two years ago, for sure, I think both independently, both of us have been looking at the. development of the large models and thinking about what's beyond language models and this idea of building world models, spatial intelligence, really was natural for us. So we started talking and decided that we should just put all the eggs in one basket and focus on solving this problem and started world apps together. Yeah, pretty much. I mean, like, after that, seeing that kind of ImageNet era during my PhD, I had the sense that the next sort of decade of computer vision
Starting point is 00:03:49 was going to be about getting AI out of the data center and out into the world. So a lot of my interests post-PHD kind of shifted into 3D vision, a little bit more into computer graphics, more into generative modeling. And I thought I was kind of drifting away from my advisor post-PHD, but then when we reunited a couple years later, it turned out she was thinking of very similar things. So if you think about AlexNet, the core pieces of it were obviously ImageNet, It was the move to GPUs and neural networks. How do you think about the AlexNet equivalent model for world models? In a way, it's an idea that has been out there, right?
Starting point is 00:04:22 There's been, you know, Young Lagoon is maybe like the most, the biggest proponent, most prominent of it. What have you seen in the last two years that you were like, hey, now it's the time to do this? And what are maybe the things fundamentally that you want to build as far as data and kind of like maybe different types of algorithms or approaches to compute to make world models really come to life? Yeah, I think one is just there is a lot more data in compute generally available. I think the whole history of deep learning is in some sense the history of a scaling up compute. And if we think about AlexNet required this jump from CPUs to GPUs, but even from AlexNet to today, we're getting about a thousand times more performance per card than we had on AlexNet days.
Starting point is 00:05:02 And now it's common to train models not just on one GPU, but on hundreds or thousands or tens of thousands or even more. So the amount of compute that we can marshal today on a single model is about a million. unfold more than we could have even at the start of my PhD. So I think language was one of the really interesting things that started to work quite well the last couple of years. But as we think about moving towards visual data and spatial data and world data, you just need to process a lot more. And I think that's going to be a good way to soak up this new compute that's coming online more and more. Does the model of having a public challenge still work or should it be centralized
Starting point is 00:05:37 inside of a lab? I think open science still is important. You know, So AI obviously compared to the image that Alexnet time has really evolved, right? That was such a niche computer science discipline. Now it's just like civilizational technology. But I'll give you an example, right. Recently, my Stanford lab just announced a open dataset and benchmark called Behavior, which is for benchmarking robotic learning in simulating. environment. And that is a very clear effort in still keeping up this open science model
Starting point is 00:06:21 of doing things, especially as in academia. But I think it's important to recognize the ecosystem is a mixture, right? I think a lot of the very focused work in industry, some of them are more seeing the daylight in the form of a product rather than a open challenge per se. Yeah, and that's just a matter of the, like, funding in the business model. Like, you have to see some ROI from it. I think it's just a matter of the diversity of the ecosystem, right? Even during the so-called Alex ImageNet time, I mean, there were closed models,
Starting point is 00:07:02 there were proprietary models, there were open models, you know, you know, or you think about iOS versus Android, right? they're different business model. I wouldn't say it's just a matter of funding per se. It's just how the market is. They're a different place. Yeah. But do you feel like you could redo ImageNet today
Starting point is 00:07:22 with the commercial pressure that some of these labs has? I mean, to me, that's like the biggest question, right? It's like, what can you open versus what should you keep inside? Like, you know, if I put myself in your shoes, right? It's like, you raise a lot of money. You're building all of this. If you had the best dataset for this, what incentives do you really have to publish it and it feels like the people at the labs are getting more and more pulled in the PhD
Starting point is 00:07:46 programs are getting pulled earlier and earlier into these labs. So I'm curious if you think there's like an issue right now with like how much money is at stake and how much pressure puts on like the more academia open research space or if you feel like that's not really a concern. I do have concerns about less about the pressure. It's more about the resourcing and the imbalanced resourcing of academia. This is a little bit of a different conversation from World Labs. You know, I have been in the past few years advocating for resourcing a healthy ecosystem, you know, as the founding director, co-director of Stanford's Institute for Human Center AI,
Starting point is 00:08:30 Stanford High, I've been, you know, working with policy makers about resourcing public sector and academic AI work, right? We work with the first Trump administration on this bill called National AI Research Resource Nair bill, which is scoping out a national AI compute cloud as well as data repository. And I also think that open source, open data sets continue to be important part of the ecosystem.
Starting point is 00:09:04 Like I said, right now, In my Stanford lab, we are doing the open data set, open benchmark on robotic learning called behavior, and many of my colleagues are still doing that. I think that's part of the ecosystem. I think what the industry is doing, some startups are doing are running fast with models creating products is also a good thing. For example, when Justin was a PhD student with me, none of the computer vision programs work that well, right? Right. We could write beautiful papers. Justin has... I mean, actually, even before grad school, like, I wanted to do computer vision, and I reached out to a team at Google and, like, wanted to, I'd potentially go and try to do computer vision, like, out of undergrad.
Starting point is 00:09:48 And they told me, like, what are you talking about? Like, you can't do that. Like, I go to a PhD first and come back. What was the motivation that got you so interested in... Oh, because I had done some computer vision research during my undergrad with actually Fei-Fei's PhD advisor. With a lineage. Yeah, there's a lineage here. Yeah. So I had done some computer vision, even as an undergrad. And I thought it was really cool, and I wanted to keep doing it.
Starting point is 00:10:09 So then I was sort of faced with this sort of industry academia choice, even coming out of undergrad, that I think a lot of people in the research community are facing now. But to your question, I think like the role of academia, especially in AI, has shifted quite a lot in the last decade. And it's not a bad thing. It's a sense of, it's because the technology has grown and emerged. Right. Like five or ten years ago, you really could train state-of-the-art models in the lab, even with just a couple GPUs. But, you know, because that technology was so successful and scaled up so much, then you can't train state-of-the-art models with a couple of GPUs anymore.
Starting point is 00:10:42 And it's not a bad thing. It's a good thing. It means the technology actually worked. But that means the expectations around what we should be doing as academics shifts a little bit. And it shouldn't be about trying to train the biggest model and scaling up the biggest thing. It should be about trying wacky ideas and new ideas and crazy ideas, most of which won't work. And I think there's a lot to be done there. If anything, I'm worried that too many people in academia are hyper-focused on this notion of trying to pretend like we can train the biggest models or treating it as almost a vocational training program to then graduate and go to a big lab and be able to play with all the GPUs. I think there's just so much crazy stuff you can do around like new algorithms, new architectures, like new systems that, you know, there's a lot you can do as one person. And also just academia has a role to play in understanding the theoretical underpinning of these large models.
Starting point is 00:11:31 We still know so little about this. Or extend to the interdisciplinary, you know, just the cause wacky ideas. There's a lot of basic science ideas. There's a lot of blue sky problems. So I don't think the problem is open versus close, productization versus open sourcing. I think the problem right now is that academia by itself is severely under-resourced so that, you know, the researchers and the students do not have enough resources to try these ideas. Yeah.
Starting point is 00:12:08 Just for people to nerds-knit, what's a wacky idea that comes to mind when you talk about wacky ideas? Oh, like, I had this idea that I kept pitching to my students at Michigan, which is that I really like hardware and I really like new kinds of hardware coming online. And in some sense, the emergence of the neural networks that we use today and transformers are really based around matrix multiplication, because matrix multiplication fits really well with GPUs. But if we think about how GPUs are going to scale, how hardware is likely to scale in the future, I don't think the current system that we have, like the GPU, like, hardware design is going to scale infinitely. And that we start to see that even now that, like, the unit of compute is not the single device anymore. It's this whole cluster of devices. So if you imagine...
Starting point is 00:12:49 A node. Yeah, it's a node or a whole cluster. but the way we talk about neural networks is still as if they are a monolithic thing that could be coded like in one GPU in Pi Torch. But then in practice they could distribute over thousands of devices. So are there like just as
Starting point is 00:13:02 transformers are based around Matmole and Matmole is sort of the primitive that works really well on GPUs. As you imagine hardware scaling out are there other primitives that make more sense for large scale distributed systems that we could build our neural networks on. And I think it's possible that there could be drastically different architectures that fit with the next generation
Starting point is 00:13:19 or like the hardware that's going to come 10 or 20 years down the line. And we could start imagining that today. It's really hard to make those kinds of bets because there's also the concept of the hardware lottery where, let's just say, you know, Nvidia has won and we should just, you know, scale that out in infinity and write software to patch up any gaps we have in the mix, right? I mean, yes and yes and no. Like, if you look at the numbers, like even going from Hopper to Blackwell, like the performance per watt is about the same. Yes. They mostly make the number of transistors go up and they make the chip size go up and they make the power usage go up. But even from Hopper to,
Starting point is 00:13:51 to Blackwell, we're kind of already seeing like a scaling limit in terms of what is the performance per watt that we can get. So I think there is room to do something new. And I don't know exactly what it is. And I don't think you can get it done like in a three month cycle as a startup. But I think that's the kind of idea that if you sit down and sit with for a couple years, like maybe you could come up with some breakthroughs. And I think that's the kind of long range stuff that is a perfect match for academia. Coming back to the little bit of background in history, we have this sort of research note on the scene storytelling work that you did or newer image captioning
Starting point is 00:14:22 that you did with Andre. And I just wanted to hear you guys tell that story about, you know, you were like sort of embarking on that for your PhD and Fei Fei, you were like having that reaction that you had. Yeah, so I think that line of work started between me and Andre and then Justin joined, right?
Starting point is 00:14:40 So Andre started his PhD. He and I were looking at what is beyond image, that object recognition. And at that time, we, you know, the convolutional neural network has proven some power in ImageNet tasks. So ConVNet is a great way to represent images. In the meantime, I think in the language space,
Starting point is 00:15:05 a early sequential model is called LSTM was also being experimented. So Andrea and I were just talking about this has been a long-term dream of I thought it would take 100 years to solve, which is telling the story of images. When I graduated from grad school, I really thought the rest of my entire career would be towards solving that single problem, which is given a picture or given a scene,
Starting point is 00:15:35 tell the story in natural language. But things evolved so fast when Andre started, we're like maybe combining the representation of convolutional neural network as well as the language sequential model of LSTM, we might be able to learn through training to match caption with images. So that's when we started that line of work,
Starting point is 00:16:02 and it was 2014 or 2015. It was a CVPR 2015 was the captioning paper. So it was our first paper that Andre got it to work that was, you know, given an image. The image is represented with ConfNet. The language model is the LSTM model, and then we combine it, and it's able to generate one sentence.
Starting point is 00:16:29 And that was one of the first time. It was pretty, I think I wrote it in my book. We thought we were the first people doing it. It turned out that Google at that time was also simultaneously doing it, and a reporter was John Markov from New York Times was breaking the Google story, but he by accident heard about us, and then he realized that we really independently got there together at the same time. So he wrote the story of
Starting point is 00:16:56 both the Google research as well as Andre and my research. But after that, I think Justin was already in the lab at that time. Yeah, yeah. I remember the group meeting where Andre was presenting some of those results and explaining this new thing called LSTMs and RNNs that I had never heard of before and I thought like wow this is really amazing stuff I want to I want to work on that so then he had the paper at 20 CVPR 2015 on the first image captioning results then after that we started working together and we did it we first we did a paper actually just on language modeling yeah back in 2050 I clear 2015 yeah yeah I should have stuck with language modeling that turned out pretty lucrative in retrospect but we did this language modeling paper together me and
Starting point is 00:17:37 Andre in 2015 where it was like really cool we train these little art these little R&NN language models that could, you know, spit out a couple sentences at a time and poke at them and try to understand what the neurons inside, the neural network inside the things we're doing. You guys were doing analysis on the different, like, memory and... Yeah, yeah, it was really cool. Even at that time, we had these results where you could, like, look inside the LSTM and say, like, oh, this thing is reading code. So one of the, like, one of the data sets that we trained on for this one was the Linux source code, right?
Starting point is 00:18:06 Because the whole, the whole thing is, you know, open source and you could just download this. So we train an RNN on this data set, and then as the network is trying to predict the tokens there, then, you know, try to correlate the kinds of predictions that it's making with the kind of internal structures in the RNN. And there, we were able to find some correlations between, oh, like, this unit and this layer of the LSTM fires when there's an open pran and then, like, turns off when there's a closed pran and try to do some empirical stuff like that to figure it out. So that was pretty cool. And that was just, like, that was kind of like cutting out the CNN from this language modeling part and just looking at the language models and isolation. But then we wanted to extend the image captioning work. And remember at that time we even have a sense of space because we feel like captioning does not capture different parts of the image.
Starting point is 00:18:52 So I was talking to Justin and Andrea about can you go what we end up calling dense captioning, which is, you know, describe the scene in greater details, especially different parts of the scene. So that's... Yeah. And so then we built a system. So it was me and Andre and Fefe on a paper the following year is CVPR in 2016 where we built this system that did dense captioning. So you input a single image and then it would draw boxes around all the interesting stuff in the image and then write a short snippet about each of them.
Starting point is 00:19:21 It's like, oh, it's a green water bottle on the table. It's a person wearing a black shirt. And this was a really complicated neural network because that was built on a lot of advancements that had been made in object detection around that time, which was a major topic in computer vision for a long time. And then it was actually like one joint neural network that was both, you know, learning. to look at individual images, because they actually had, like, then three different representations inside this network. One was the representation of the whole image to kind of get the gestalt of what's going on. Then it would propose individual regions that it wants to focus on, and then represent each region independently. And then once you look at the region, then you need to
Starting point is 00:19:55 spit out text for each region. So that was a pretty complicated neural network architecture. This was all pre-Py torch. And does it do it in one pass? Yeah, yeah. So it was a single forward pass that did all of it. Not only it was doing in one pass, you also optimize inference. You're doing it on a webcam, I remember. Yeah, yeah. Yeah, so I had built this like crazy real-time demo where I had the network running like on a server at Stanford and then a web front end that would stream from a webcam and then like send the image back to the server. The server would run the model and stream the predictions back. So I was just like walking around the lab with this laptop that would just like show people this like this network running real time. Identification and labeling as well.
Starting point is 00:20:33 Yeah, yeah. It was it was pretty impressive because most of my graduate student would be satisfied if they can publish the paper, right? They package the research, put it in a paper. But just the window step further, he's like, I want to do this real-time web demo. Well, actually, I don't know if I told you this story. But then we had, there was a conference that year in Santiago at ICCV. It was ICCB 15.
Starting point is 00:20:59 And then, like, I had a paper at that conference for something different. But I had my laptop. I was, like, walking around the conference with my laptop, showing everybody this, like, real-time captioning demo. And the model was running on a server in California. so it was like actually able to stream like all the way from California down to Santiago well latency does
Starting point is 00:21:14 it was terrible it was like one FPS but the fact that it worked at all was pretty amazing I was going to briefly equip that you know maybe vision and language modeling are not that difference you know deep SQL CR recently
Starting point is 00:21:26 tried the crazy thing of let's language let's model text from pixels and just like train on that and it might be the future I don't know if you guys have any takes on whether language is actually necessary at all. I just wrote a whole manifesto space. This is my segue into this. Yes. I think they are different. I do think the architecture of these generative models will share
Starting point is 00:21:54 a lot of shareable components, but I think the deeply 3D, 4D spatial world has a level of structure that is fundamentally different from a purely generative signal that is one-dimensional. Yeah, I think there's something to be said for pixel maximalism, right? Like there's this notion that language is this different thing, but we see language with our eyes, and our eyes are just like, you know, basically pixels, right? Like we've got sort of biological pixels in the back of our eyes that are processing these things. And, you know, we see text and we think of it as this discrete thing,
Starting point is 00:22:31 but that really only exists in our minds. Like the physical manifestation of text and language in our world, are, you know, physical objects that are printed on things in the world and we see it with our eyes. Well, you can also think it's sound. But even sound, even sound you can translate into a visualized.
Starting point is 00:22:46 Yeah, you get a corallogram, which is a 2D signal. Right, and then like you actually lose something if you translate to this like purely tokenized representations that we use in LLMs, right? Like you lose the font, you lose the line breaks, you lose sort of the 2D arrangement on the page. And for a lot of cases, for a lot of things,
Starting point is 00:23:02 maybe that doesn't matter. But for some things it does. And I think pixels are this. this sort of more lossless representation of what's going on in the world. And in some ways, a more general representation that more matches what we, what we humans see as we navigate the world. So, like, there's an efficiency argument to be made. Like, maybe it's not super efficient to, like, you know, render your text to an image
Starting point is 00:23:23 and then feed that to a vision model. That's exactly what Deepak did. It was like, kind of worked. I think this ties into the whole world model. Like, one of the my favorite papers that I saw this year was about inductive bias to pro world model. So it was a Harvard paper where they fed a lot of orbital patterns
Starting point is 00:23:41 into an LLM and then they asked the LLM to predict the orbit of a planet around the sun. And the model generated it looked good, but then if you asked it to draw the force vectors, it would be all wacky. You know, it wouldn't actually follow it. So how do you think about
Starting point is 00:23:57 what's embedded into the data that you get? And we can talk about maybe tokenizing for 3D world models. Like what are like the dimensions of information, there's the visual, but like how much of like the underlying hidden forces, so to speak, you need to extract out of this data and like what are some of the challenges there? Yeah, I think there's different ways you could approach that problem. One is like you could try to be explicit about it and say like, oh, I want to, you know, measure all the forces and feed
Starting point is 00:24:24 those as training data to your model, right? Then you could like sort of run a traditional physics simulation and, you know, then know all the forces in the scene and then use those as training data to train a model that's now going to hopefully predict those. Or you could hope that something emerges more latently, right? That you kind of train on something end to end and then on a more general problem and then hope that somewhere, something in the internals of the model, must learn to model something like physics in order to make the proper predictions. And those are kind of the two big paradigms that we have more generally.
Starting point is 00:24:53 But there's no indication that those latent modeling will get you to a causal law of space and dynamic. right? That's where today's deep learning and human intelligence actually start to bifurcate because fundamentally the deep learning is still fitting patterns. There you sort of get philosophical and you say that
Starting point is 00:25:15 like we're trying to fit patterns too but maybe we're trying to fit a more broad array of patterns like over a with a longer time horizon a different reward function but like basically the paper you mentioned is sort of you know that problem that it learns to fit the specific patterns of orbits but then it doesn't actually generalize in the way that you'd like
Starting point is 00:25:32 it doesn't have a sort of causal model of gravity Right, because even in marble, you know, I was trying it in, it generates this beautiful sceneries and there's like arches in them. But does the model actually understand how, you know, the arch is actually, you know, drawing on the center kind of like stone and like, you know, the actual physical structure of it? And the other question is like, does it matter that it does understand it as long as they always render something that would fit the physical model that we imagine? If you use the word understand the way you understand, I'm pretty sure the model doesn't understand it. The model is learning from the data, learning from the pattern. Yeah, does it matter, especially for the use cases, it's a good question, right? Like, for now, I don't think it matters because it renders out what you need, assuming it's perfect.
Starting point is 00:26:25 Yeah, I mean, it depends on the use case. Like if the use case is I want to generate sort of a backdrop for virtual film or production or something like that, all you need is something that looks plausible. And in that case, probably it doesn't matter. But if you're going to use this to like, you know, if you're an architect and you're going to use this to design a building that you're then going to go build in the real world, then yeah, it does matter that you model the forces correctly because you don't want the thing to break when to actually build it. But even there, right, like even if your model has the semantics in it, let's say, I still don't think the understanding of. the signal or the output on the model part and the understanding on the human part is a different word. But this gets, again, philosophical. Yeah, I mean, there's this trick with understanding, right? Like, these models are a very different kind of intelligence than human intelligence. And human
Starting point is 00:27:13 intelligence is interesting because, you know, I think that I understand things because I can introspect my own thought process to some extent. And then I believe that my thought process probably works similar to other peoples so that when I observe someone else's behavior then I infer that their internal mental state is probably similar to my own internal mental state that I've observed and therefore I know that I understand things so there I assume that you understand something.
Starting point is 00:27:37 But these models are sort of like this alien form of intelligence where they can do really interesting things they can exhibit really interesting behavior but whatever kind of internal, the equivalent of internal cognition or internal self-reflection that they have if it exists at all is totally different from what we do. It doesn't have the self-awareness
Starting point is 00:27:53 Right But what that means is that When we observe seemingly interesting Or intelligent behavior out of these systems We can't necessarily infer other things about them Because they're their model of the world And the way they think is so different from us So would you need two different models
Starting point is 00:28:08 To do the visual one and the architectural Generation you think Eventually like there's not anything fundamental About the approach that you've taken On the model building It's more about scaling the model And the capabilities of it Or, like, is there something about being very visual that prohibits you from actually learning the physics behind this, so to speak, so that you could trust it to generate a cat design that then is actually going to work in the real world?
Starting point is 00:28:36 I think this is a matter of scaling data and bettering model. I don't think there's anything fundamental that separates these two. Yeah, I would like it to be one model. But I think, like, the big problem in deep learning in some sense is how do you get emergent capabilities, beyond your training data. Are you going to get something that understands the forces while it wasn't trained to predict the forces, but it's going to learn them implicitly internally? And I think a lot of what we've seen in other large models is that a lot of this emergent behavior does happen at scale. And will that transfer to other modalities and other use cases
Starting point is 00:29:06 and other tasks? I hope so, but that'll be a process that we need to play out over time and see. Is there a temptation to rely on physics engines that already exist out there that are, basically the gaming industry has saved you a lot of this work or do we have to reinvent things for some fundamental mismatch? I think that's sort of like climbing the ladder of technology right, like in some sense, the reason
Starting point is 00:29:29 that you want to build these things at all is because maybe traditional physics engines don't work in some situations. If a physics engine was perfect, we would have sort of no need to build models because the problem would have already been solved. So in some sense, the reason why we want to do this is because classical physics engines don't solve problems in the generality that we want.
Starting point is 00:29:46 But that doesn't mean we need to throw them way and start everything from scratch, right? We can use traditional physics engines to generate data that we then train our models on. And then you're sort of distilling the physics engine into the weights of the neural network that you're training. I think that's a lot of what, if you compare the work of other labs, people are speculating that, you know, Sora had a little bit of that. Genie 3 had a bit of that.
Starting point is 00:30:09 And Genie 3 is like explicitly like a video game. Like you have controls to walk around in. And I always think like it's really funny how the things that we invent for fun, actually does eventually make it into serious work. Yeah, the whole AI revolution started by graphics chips, partially. Misusing the GPU for generating a lot of triangles into generating a lot of everything else, basically. We touched on marble a little bit. I think you guys chose marble as I kind of feel like you're sort of a little bit coming out of a stealth moment, if you can call it that.
Starting point is 00:30:42 Maybe we can get a concise explanation from you on what people should take away, because everyone here can. try Marble, but I don't think they might be able to link it to the differences between what your vision is versus other, I guess, generative worlds they may have seen from other labs. So Marble is a glimpse into our model, right? We are a model of spatial intelligence, model company. We believe spatial intelligence is the next frontier. In order to make spatially intelligent models, the model has to be very powerful in terms of its ability to understand, reason, generate in very multimodal fashion of worlds, as well as allow the level of interactivity that we eventually hope to be as complex as how humans can interact with the
Starting point is 00:31:31 world. So that's the grand vision of spatial intelligence as well as the kind of world models we see. Marble is the first glimpse into that. It's the first part of that journey. It's the first in-class model in the world that generates 3D worlds in this level of fidelity that is in the hands of the public. It's the starting point, right? We actually wrote this tech blog. Justin spent a lot of time writing that tech block. I don't know if you had time to browse it. I mean, Justin really broke it down into what are the inputs. We can multimodal inputs of marble, what are the kind of editability, which allows user to be interactive with the model and what are the kind of outputs we can have. Yeah. So Marble, like basically one way of looking at it, it's the
Starting point is 00:32:23 system, it's a generative model of 3D worlds, right? So you can input things like text or image or multiple images, and it will generate for you a 3D world that kind of matches those inputs. And it's also interactive in the sense that you can interactively edit scenes. Like I could generate this scene and then say, I don't like the water bottle, make it blue instead. like take out the table, like change these microphones around. And then you can generate new worlds based on these interactive edits and export in a variety of formats. And with Marble, we were actually trying to do sort of two things simultaneously, and I think we managed to pull off the balance pretty well.
Starting point is 00:32:55 One is actually build a model that goes towards the grand vision of spatial intelligence. And models need to be able to understand lots of different kinds of inputs, need to be able to model worlds in a lot of situations, need to be able to model counterfactuals of how they could change over time. So we wanted to start to build models that have these capabilities, and Marble. Marble today does already have hints of all of these. But at the same time, we're a company, we're a business. We were really trying not to have this be a science project, but also build a product that would be useful to people in the real world today. So while Marble is simultaneously a world model that is building towards this vision of spatial intelligence, it was also very intentionally designed to be a thing that people could find useful today. And we're starting to see emerging use cases in gaming, in VFX, in film, where I think there's a lot of really interesting stuff that Marble can do today as a product, and then also set a foundation for the grand world models that we want to build going into the future.
Starting point is 00:33:50 Yeah, I know this one tool that was very interesting, so you can record your scene inside. Yes. Yes, it's very important. The ability to record means a very precise control of camera placement. In order to have precise camera placement, it means you have to have a sense of 3D space. Otherwise, you don't know how to orient your camera and how to move your camera. So that is a natural consequence of this kind of model. And this is why this is just one of the examples.
Starting point is 00:34:21 Yeah, I find when I play with video generative models, I'm having to learn the language of being a director because I have to move them out, like pan, you know, like dolly out. But even there, you cannot say pan 63 degrees to the north, right? You just don't have that control. Whereas in marble, you have precise control in terms of placing a camera. Yeah, I think that's one of the first things people need to understand. It's like you're not generating frame by frame, which is like what a lot of the other models are. What are, you know, people understand that NLM generates one token.
Starting point is 00:34:56 What are like the atomic units? There's kind of like, you know, the meshes, there's like the splasts, the voxels. There's a lot of pieces in a truly world. What should be the mental model that people have of like your generations? Yeah, I think there's like what exists today and what could exist in the future. So what exists today is the model natively output splats. So Gaussian splats are these like, you know, each one is a tiny, tiny particle that's semi-transparent, has a position orientation in 3D space, and the scene is built up from a large number of these Gaussian splats.
Starting point is 00:35:24 And Gaussian splats are really cool because you can render them in real time really efficiently, so you can render on your iPhone, render everything. And that's how we get that sort of precise camera control because the splats can be rendered real time on pretty much any client-side device that we want. So for a lot of the scenes that we're generating today, that kind of atomic unit is that individual splat. But I don't think that's fundamental. I could imagine
Starting point is 00:35:45 other approaches in the future that would be interesting. So there are other approaches that even we've worked on at World Labs, like our recent RTFM model, that does generate frames one at a time. And there the atomic unit is generating frames one at a time as the user interacts with the system.
Starting point is 00:36:01 Or you could imagine other architectures in the future where the atomic unit is a token where that token now represents some chunk of the 3D world. And I think there's a lot of different architectures that we can experiment with here over time. I do want to press on, double click on this a little bit.
Starting point is 00:36:15 My version of what Alessio was going to say was like, what is the fundamental data structure of a world model? Because exactly, like you said, it's either a Gossip or it's like the frame or what have you. You also, in the previous statements, focus a lot on the physics and the forces, which is something over time,
Starting point is 00:36:31 which is loosely. I don't see that in marble. I presume it's not there yet. Maybe if there was like a marble two, you would have movement? Or is there a modification to gotten splats? That makes sense? Or would it be something completely different? Yeah, I think there's a couple modifications that make sense. And there's actually a lot of interesting ways to integrate things here, which is another nice place of working in this space. Then there's actually been a lot of research work on this. Like when you talk about wacky ideas, like there's actually been a lot of really interesting academic work on different ways to imbue physics. You can also do wacky ideas in the industry. All right, but then it's like Gaussian splats are themselves little particles. There's been a lot of approaches where you basically attach physical properties to those splats and say that each one has a mass or like maybe you treat each one as being coupled with some kind of virtual spring to nearby neighbors. And now you can start to do sort of physics simulation on top of splats. So one kind of avenue for adding physics or dynamics or interaction to these things would be to, you know,
Starting point is 00:37:26 predict physical properties associated with each of your splat particle. and then simulate those downstream, either using classical physics or something learned. Or, you know, the kind of the beauty of working in 3D is things compose and you can inject logic in different places. So one way is sort of like we're generating a 3D scene, we're going to predict 3D properties of everything in the scene, then we use a classical physics engine to simulate the interaction.
Starting point is 00:37:48 Or you could do something where, like, as a result of a user action, the model is now going to regenerate the entire scene in splats or some other representation. And that could potentially be a lot more general, because then you're not bound to whatever sort of, you know, physical properties you know how to model already. But that's also a lot more computationally demanding because then you need to regenerate the whole scene
Starting point is 00:38:08 in response to user actions. But I think this was a really interesting area for future work and for adding on to into potential marble too, as you say. Yeah, there's opportunity for dynamics, right? What's the state of like splats density, I guess? Like, do we, can we render enough to have very high resolution when we zoom in? Are we limited by the amount that you can generate,
Starting point is 00:38:32 the amount that we can render? How are these going to get super high fidelity, so to speak? You have some limitations, but depending on your target use case. So, like, one of the big constraints that we have on our scenes is we wanted things to render cleanly on mobile, and we wanted things to render cleanly in VR headsets. So those devices have a lot less compute than you have in a lot of other situations. And, like, if you want to get a splat file to render at high resolution,
Starting point is 00:38:56 height like 30 to 60 FPS on like an iPhone from four years ago, then you are a bit limited in like the number of splats that you can handle. But if you're allowed to like work on a recent, like even this year's iPhone or like a recent MacBook or even if you have a local GPU or if you don't need that 60 FPS, 1080P, then you can relax the constraints and get away with more splats and that lets you get a good high resolution in your scenes. One use case I was expecting but didn't hear from you was embodied use cases. You're just focusing on virtual for now?
Starting point is 00:39:29 If you go to World Labs homepage, there is a particular page called Marble Labs. There we showcase different use cases, and we actually organize them in more visual effect use cases or gaming use cases as well as simulation use cases. And in that, we actually show this is a technology that can help a lot in robotic training, right? This goes back to what I was talking about earlier. In speaking of data starvation, robotic training really lack data. You know, high fidelity, real world data is absolutely very critical, but you're just not going to get a ton of that. Of course, the other extreme is just purely internet video data,
Starting point is 00:40:14 but then you lack a lot of the controllability that you want to train your embodied agents with. So simulation and synthetic data is actually a very important middle ground. For that, I've been working in this space for many years. One of the biggest pain point is where do you get the synthetic simulated data? You have to curate assets and build these, compose these complex situations. And in robotics, you want a lot of different states. You want the embodied agent to interact in the synthetic environment. Marble actually is a really potential for helping to generate these synthetic simulated worlds
Starting point is 00:40:58 for embodied agent training. Obviously, that's on the home page. It'll be there. I was like trying to make the link to, as you said, like you also have to build like a business model. The market for robotics obviously is very huge. Maybe you don't need that or maybe we need to build up and solve the virtual worlds first before we go to embodied. And obviously, I think that's a big stepping zone.
Starting point is 00:41:20 That is to be decided. I do think that a... Because everyone else is going straight there. Right? Not everyone else, but there is an excitement, I would say. But, you know, I think the world is big enough to have different approaches. Yeah, approaches. Yeah, I mean, and we always view this as a pretty horizontal technology
Starting point is 00:41:42 that should be able to touch a lot of different industries over time. And, you know, Marble is a little bit more focused on creative industries for now. But I think the technology that powers it should be applicable to a lot of different things over time. And robotics is one that, you know, is maybe going to happen sooner than later. Also design, right? It's very adjacent to creative. Oh, yeah, definitely. Like, I think it's like the architecture stuff? Yes.
Starting point is 00:42:02 Yeah, I mean, I was joking online. I posted this video on Slack of like, oh, who wants to use marble to plan your next kitchen remodel? It actually works great for this already. Just like take two images of your kitchen, like reconstruct it in marble, and then use the editing features to see what would that space look like if you change the countertops or change the floors or change the cabinets. And this is something that's, you know, we didn't necessarily build anything specific for this use case.
Starting point is 00:42:24 But because it's a powerful horizontal technology, you kind of get these emergent use cases that just fall out of the model. We have early beta users using a API key that is already building for interior design use case. I just did my garage. I should have known about this. Next time you remodel, we can be of help. Kitchen is next, I'm sure.
Starting point is 00:42:49 Yeah. Yeah, I'm curious about the whole spatial intelligence space. I think we should dig more into that. One, how do you define it? And what are, like, the gaps between traditional intelligence that people might think about an LLMS when, you know, Dario says we have a data center full of Einstein's that's like traditional intelligence, it's not spatial intelligence. What is required to be specially intelligent?
Starting point is 00:43:14 First of all, I don't understand that sentence, a data center full of science. Einstein's. I just don't understand that. It's not a deep... It's an analogy. Well, so a lot of AI as a field, as a discipline, is inspired by human intelligence, right? Because we are the most intelligent animal we know in the universe for now. And if you look at human intelligence, it's very multi-intelligent, right?
Starting point is 00:43:44 There is a psychologist, I think his name is Howard Gardner, in the 19th. in the 1960s actually literally called multiple intelligence to describe human intelligence. And there is linguistic intelligence, there's spatial intelligence, there is a logical intelligence, and an emotional intelligence. So for me, when I think about spatial intelligence, I see it as complementary to language intelligence. So I personally would not say it's spatial versus traditional because I don't know tradition means, what does that mean? I do think spatial is complementary to linguistic. And how do we define spatial intelligence? It's the capability that allows you to reason, understand, move and interact
Starting point is 00:44:34 in space. And I use this example of the deduction of DNA structure, right? And of course, I'm simplifying this story, but a lot of that had to do with the spatial reasoning of the molecules and the chemical bonds in a 3D space to eventually conjecture a double helix. And that ability that humans or Francis Crick and Watson had done, it is very, very hard to reduce that process into pure language. And that's a pinnacle of a civilizational moment. But every day, right, I'm here trying to grasp a mug. This whole process of seeing the mug, seeing the context where it is,
Starting point is 00:45:24 seeing my own hand, opening of my hand that geometrically would match the mug and touching the right of Fortin's points, all this is deeply, deeply spatial. It's very hard. I'm trying to use language to narrate it. But on the other hand, that narrative language itself cannot get you to pick up a mug. Yeah, bandwidth constrained. Yes.
Starting point is 00:45:49 I did some math recently on, like, if you just spoke all day, every day for 24 hours a day, how many tokens do you generate? At the average speaking rate of like 150 words per minute, it roughly rounds around around to about 250,000 tokens per day. And like your world that you live in is so much higher bandwidth than that. Well, I think that is true, but if I think about Sir Isaac Newton, right, it's like you have things like gravity at the time that have not been formalized in language that people inherently spatially understand, that things fall, right?
Starting point is 00:46:23 But then it's helpful to formalize that in some way or like, you know, all these different rules that we use language to like really capture something that empirically and spatially you can also understand, but it's easier to like describe in a way. So I'm curious, like, the interplay of like spatial and linguistic intelligence, which is like, okay, you need to understand. Some rules are easier to write in language for then the spatial intelligence to understand. But you cannot, you know, you can not write, put your hand like this and put it down this amount. So I'm always curious about how you leverage each other together. I mean, if anything, like the example of Newton, like Newton only thinks to write down those laws because he's had a lot of embodied experience in the world.
Starting point is 00:47:05 Right, yeah, exactly. And actually, it's useful to distinguish between the theory building that you're mentioning versus like the embodied, like the daily experience of being embedded in the three dimensional world, right? So to me, spatial intelligence is sort of encapsulating that embodied experience of being there in 3D space, moving through it, seeing it, actioning it. And as Fei-Fei said, you can narrate those things, but it's a very lossy channel. It's just like the notion of, you know, being in the world and doing things in it is a very different modality from trying to describe it. But because we as humans are, animals who have evolved interacting in space all the time. Like, we don't even think that that's a hard thing, right? And then we sort of naturally leap to language and then theory building as mechanisms to abstract above that sort of native spatial understanding. And in some sense, LLMs have just like jumped all the way to those highest forms of abstracted reasoning, which is very interesting and very useful.
Starting point is 00:47:56 But spatial intelligence is almost like opening up that black box again and saying maybe we've lost something by going straight to that fully abstracted form of language and reasoning and communication. You know, it's funny as a vision scientist, right? I always find that vision is underappreciated because it's effortless for humans. You open your eyes as a baby, you start to see your world. We're somehow born with it. We're almost born with it.
Starting point is 00:48:21 But you have to put effort in learning language, including learning how to write, how to do grammar, how to express. And that makes it feel hard. Whereas something that nature's been way. more time actually optimizing, which is perception and spatial intelligence, is underappreciated by humans. Is there proof that we are born with it? You said almost born. So it sounds like we actually do learn after we're born.
Starting point is 00:48:48 When we are born, our visual acuity is less. And our perceptual ability does increase. But we are, most humans are born with the ability to see. And most humans are born with the ability to link. perception with motor movements, right? I mean, the motor movement itself takes a while to refine, and then animals are incredible, right?
Starting point is 00:49:15 Like I was just seeing Africa earlier this summer, these little animals, they're born, and within minutes they have to get going, and otherwise, you know, the lions will get them. And in nature, you know, it took 540 million years to optimize perception and spatial intelligence and language, the most generous estimation of language development is probably half a million years.
Starting point is 00:49:41 Wow. That's longer than I would have going to say. I'm being very generous. Yeah. Yeah, no, I was sort of going through your book and I was really realizing that one of the interesting links to something that we covered on the podcast is language model benchmarks and how Winogrand actually put in all these sort of physical impossibilities that require spatial intelligence, right?
Starting point is 00:50:03 Like A is on top of B, therefore A kind of fall through B is obvious to us, but to a language model, it could happen. I don't know. Maybe it's like a part of the next token prediction. And that's kind of what I mean about like unwrapping the subtraction, right? Like if your whole model of the world is just like saying sequences of words after each other, it's really kind of hard to like, why not? It's actually unfair. Right. But then the reason it's obvious to us is because we are internally mapping it back to some three-dimensional representation of the world that we're familiar with.
Starting point is 00:50:30 The question is, I guess, like, how hard is it, you know, how long is it going to take us to distill from, like, I use the word distill, I don't know if you agree with that, to distill from your world models into a language model, because we do want our models to have social intelligence, right? And do we have to throw the language model out completely in order to do that? No. No, right? Yeah, I don't think so. I think they're multimodal. I mean, even our model, marble today takes language as an input. Right.
Starting point is 00:50:57 Right. So it's deeply multi-modal. And I think in many use cases, these models will work together. Maybe one day we'll have a universal model. I mean, even if you do, like, there's sort of a pragmatic thing where people use language and people want to interact with systems using language. Even pragmatically, it's useful to build systems and build products and build models that let people talk to them. So I don't see that going away.
Starting point is 00:51:20 I think there's a sort of intellectual curiosity of saying how, like, intellectually, how much could you build a model that only uses vision or only uses spatial intelligence. I don't know that that would be practically useful, but I think it'd be an interesting intellectual or academic exercise to see how far you could push that. I think, I mean, not to bring it back to physics, but I'm curious, like, if you had a highly precise world model and you didn't give it any notion of, like,
Starting point is 00:51:45 our current understanding of the standard model of physics, how much of it will be able to come up with and, like, recreate from scratch and what level of, like, language understanding it would need. Because we have so many notations that, like, we kind of use that, like, we create it. But, like, maybe we'll come up with a very different model of it and still be accurate. And I wonder how much we're kind of limited, but I, you know, how people say,
Starting point is 00:52:07 human always need to be like humans because the world is built for humans. And in a way, it's like, the way we would build language, constrain some of the outputs that we can get from these other modalities as well. So I'm super excited to follow your work. Yeah, I mean, like, there's another engine. You actually don't even need to be doing AI to answer that question. You could discover aliens and see what kind of physics they have. Right.
Starting point is 00:52:26 Right. And they might have a totally different. said, we are so far the smartest animal in the universe. But that is a really interesting question, right? Like, is our knowledge of the universe and our understanding of physics? Is it constrained in some way by our own cognition or by the path dependence of our own technological evolution? And one way to sort of do an experiment. Like, you almost want to do an experiment and say, like, if we were to rerun human civilization again,
Starting point is 00:52:50 would we come up with the same physics in the same order? And I don't think that's a very practical experiment. You know, one experiment, I wonder if people could run. is that we have plenty of astrophysical data now on the planet or celestial body movements. Just feed the data into a model and see if Newtonian law emerges. My guess is it probably won't. That's my guess. It's not.
Starting point is 00:53:17 The abstraction level of Newtonian law is at a different level from what these language LLMs represents. So I wouldn't be surprised. that giving enough celestial movement data, an LLM would actually predict pretty accurate movement trajectories. Let's say I invent a planet surrounding a star, and giving enough data, my model would tell you, you know, on day one where it is, day two where it is.
Starting point is 00:53:48 I wouldn't be surprised. But F equals MA or, you know, action equals reaction, that's just a whole different abstraction level. That's beyond just the today's LLM. Okay, what world model would you need to not have it be a geocentric model? Because if I'm training just on visual data, it makes sense that you think the sun rotates around the earth, right? But obviously, that's not the case. So how would it learn that?
Starting point is 00:54:18 Like, I'm curious about all these, like, you know, forces that we talk about is like sometimes maybe you don't need them because as long as it looks right, it's right. but like as you make the jump to like trying to use this models to do more high level tasks how much can we rely on them i think you can need kind of a different learning paradigm right so like you know there's a bit of conflation here happening where saying is it LLMs and language and symbols versus you know human theory building and human human physics and they're very different because an L like the human objective function is to understand the world and thrive in your life and the way that you do that is by you know sometimes you observe data and then you think about it
Starting point is 00:54:56 and then you try to do something in the world and it doesn't match your expectations and then you want to go and update your understanding of the world online and people do this all the time constantly like whether it's you know I think my keys are downstairs so I go downstairs and I look for them
Starting point is 00:55:10 and I don't see them and oh no they're actually up in my bedroom so we're like because we're constantly interacting with the world we're constantly having to build theories about what's happening in the world around us and then falsify or add evidence to those theories and I think that that kind of process writ large and scaled up is what gives us f equals m a newtonian physics and i think that's a little orthogonal to you know the modality of model that we're training whether it's
Starting point is 00:55:33 language or or spatially the way i put it is almost like this is almost more efficient learning because you have a hypothesis of here are the different possible worlds that are granted by my available data and then you do experiments to eliminate the worlds that are not possible and you resolve to the one that's right to me that's also how i also have theory of mind which is like I have a few thesis of what you're thinking, what you're thinking, and I try to create actions to resolve that or check my intuition as to what you're thinking. And obviously, all of those don't do any of these.
Starting point is 00:56:09 A theory of mind possibly also will break into even emotional intelligence, which today's AI is really not touching at all, right? And we really need it. People are starting to depend on these things probably too much. and that's a whole topic of other debate. I do have to ask because a lot of people have sent this to us, how much do we have to get rid of? Is sequence to sequence modeling out the window?
Starting point is 00:56:36 Is attention out the window? How much are we re-questioning everything? I think you stick with stuff that works, right? So attention is still there. I think attention is still there. I think there's a lot. You don't need to fix things that aren't broken. And there's a lot of hard problems in the world to solve,
Starting point is 00:56:53 but let's focus on one at a time. I think it is pretty interesting to think about new architectures or new paradigms or drastically different learning ways to learn. But you don't need to throw away everything just because you're working
Starting point is 00:57:04 on new modalities. I think sequence to sequence is actually in world models, I think we are going to see algorithm or architecture beyond sequence to sequence. Oh, but here actually, I think there's a little bit of
Starting point is 00:57:18 technological confusion and transformers already solved that for us. right like transformers are actually not a model of sequences a transformer is natively a model of sets and that's very powerful but because a lot of the transformers grew out of earlier architectures based around recurrent neural networks and RNNs definitely do have like a built-in architectural like they do model one-dimensional sequences okay but transformers are just objects of sets and they can model a lot of those sets could be you know 1D sequences they could be other things as well do you literally mean set theory like yeah yeah so a transformer yeah yeah yeah so a transformer is actually not a model of a sequence of tokens. Transformer is actually a model of a set of tokens. The only thing that gives, that injects the order into it, in the standard transformer architecture, the only thing that differentiates the order of the things
Starting point is 00:58:04 is the positional embedding that you give the tokens. So if you choose to give a sort of 1D positional embedding, that's the only mechanism that the model has to know that it's a 1D sequence. But all the operators that happen inside a transformer block are either token-wise, right? So that either you have an FFN, you have QKV projections, like you have per token normalization.
Starting point is 00:58:24 All of those happen independently per token. Then you have interactions between tokens through the attention mechanism. But that's also sort of, it's permutation equivariant. So if I permute my tokens, then the tension operator gets a permuted output in exactly the same way. So it's actually a natively an architecture of sets of tokens. Literally, I transform. Yeah. In the math term.
Starting point is 00:58:43 I know we're out of time, but we just want to give you the floor for some call to action, either on people that would enjoy working at World Labs. What kind of people should apply? What research people should be doing outside of world labs that would be helpful to you or anything else on your mind? I do think it's very exciting time to be looking beyond just language models and think about the boundless possibilities of spatial intelligence. So we are actually hungry for talent ranging from very deep researchers, right? Thinking about the problems like Justin just described, you know,
Starting point is 00:59:19 training large models of world models. We are hungry for engineers, good engineers, building systems, you know, from training optimization to inference to product. And we're also hungry for good business, you know, product thinkers and go to market and, you know, business talents. So we are hungry for talent. We, especially now that we have exposed the model to the world. through Marble, I think we have a great opportunity to work with even a bigger pull of talent
Starting point is 00:59:55 to solve both the model problem as well as deliver the best product to the world. Yeah, I think I'm also excited for people to try Marble and do a lot of cool stuff with it. I think it has a lot of really cool capabilities, a lot of really cool features that fit together really nicely. In the car coming here, Justin and I were saying, people have not totally discovered the, okay, it's only 24 hours. I have not totally discovered some of the advanced mode
Starting point is 01:00:21 of editing, right? Like turn on the advanced mode you can like Justin said change this color of the bottle you know change your floor and change the trees and well I actually try to get there but when it says create
Starting point is 01:00:34 it just makes me create a completely different world You need to click on the advanced mode We can improve on our UIUX but remember to click Yeah we need to hire people We'll work on the product But one thing we got
Starting point is 01:00:47 It was clear from you guys are looking for It's also intellectual fearlessness Which is something that I think you guys hold as principle Yeah I mean we are literally The first people who are trying this Both on the model side as well as on the product side Thank you so much for joining us Thank you guys
Starting point is 01:01:05 Thanks for having us Yeah Thanks for listening to the A16Z podcast If you enjoyed the episode Let us know by leaving a review at rate thispodcast.com slash A16Z. We've got more great conversations coming your way. See you next time.
Starting point is 01:01:22 This information is for educational purposes only and is not a recommendation to buy, hold, or sell any investment or financial product. This podcast has been produced by a third party and may include pay promotional advertisements, other company references, and individuals unaffiliated with A16Z. Such advertisements, companies, and individuals
Starting point is 01:01:38 are not endorsed by AH Capital Management LLC, A16Z, or any of its affiliates. Information is from sources deemed reliable on the date of publication, but A16Z does not guarantee its accuracy.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.