Latent Space: The AI Engineer Podcast - Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun

Episode Date: April 2, 2026

We’ve been on a bit of a mini World Models series over the last quarter: from introducing the topic with Yi Tay, to exploring Marble with World Labs’ Fei-Fei Li and Justin Johnson, to previewing W...orld Models learned from massive gaming datasets with General Intuition’s Pim de Witte (who has now written down their approach to World Models with Not Boring), to discussing the Cosmos World Model with with Andrew White of Edison Scientific on our new Science pod, to writing up our own theses on Adversarial World Models. Meanwhile Nvidia, Waymo and Tesla have published their own approaches, Google has released Genie 3, and Yann LeCun has raised $1B for AMI and published LeWorldModel.Today’s guests have a radically different approach to World Modeling to every player we just mentioned — while Genie 3 is impressive, its many flaws demonstrate the issues with their approach - terrain clipping, noninteractivity (single player, no physics/no objects other than the player move), and maximum of 60 second immersion. Moonlake AI (inspired by the Dreamworks logo) is the diametric opposite - immediately multiplayer, incredibly interactive, indefinite lifetime, capable of MANY different kinds of world models by simulating environments, predicting outcomes, and planning over long horizons. This is enabled by bootstrapping from game engines and training custom agents: In Towards Efficient World Models, Chris Manning and Ian Goodfellow join Fan-Yun in explaining why their approach to efficiency with structure and casuality instead of just blind scaling is sorely needed:SOTA models still show physical or spatial understanding glitches, such as solid objects floating in mid-air or moving “inside” other solid objects.If the goal is to plan for the next action, how often is a high-resolution pixel view necessary for modeling the world? Our bet is that there is a disproportionately large share of economically valuable tasks where such detail is not required. After all, humans with a wide variety of sensory limitations have little difficulty doing almost everything in the world. Furthermore, for a large number of purposes, describing a scene or a situation in a few words of language (“the car’s tires squealed as it cornered sharply”) is sufficient for understanding and planning.Experiments also show that humans only partially process visual input in a top-down, task-directed way, often making use of abstracted object-level modeling. In almost all cases, partial representations combined with semantic understanding are sufficient.…If the goal is to facilitate the understanding of causality in multimodal environments, then the world model—whether it is used in the virtual world or the physical world—must prioritize properties such as spatial and physical state consistency maintained over long time periods, and an ability to evolve the world that accurately reflects the consequences of actions. That’s what Moonlake is building.Game engines are the right starting point abstraction to efficiently extract causal relationships, and building the interfaces and community (including their new $30,000 Creator Cup) to kickstart the flywheel of actions-to-observations.We were fortunate enough to attend their sessions at GDC 2026 (the Mecca of Game Devs), and were impressed by the huge variety and flexibility of the worlds people were building with Moonlake’s tools already! Live videos on the pod.Full Video Pod on YouTube!Timestamps00:00 Benchmarking Gets Hard00:47 Meet Moonlake Founders01:26 Why Build World Models03:12 Structure Not Just Scale05:37 Defining Action Conditioned Worlds07:32 Abstraction Versus Bitter Lesson14:39 Language Versus JEPA Debate20:27 Reasoning Traces And Rendering Layer37:00 Gameplay Over Graphics38:02 Fiction Rules And World Tweaks39:15 Code Engines Beat Learned Priors41:10 Diffusion Scaling Limits43:23 Symbolic Versus Diffusion Boundary46:14 Platform Vision Beyond Games50:24 Spatial Audio And Multimodal Latents54:23 NLP Roots Hiring And Moon Lake NameTranscript[00:00:00] Cold Open[00:00:00] Chris Manning: Think this whole space is extremely difficult as things are emerging now. And I mean, it’s not only for world models, I think it’s for everything including text-based models, right? ‘cause in the early days it seemed very easy to have good benchmarks ‘cause we could do things like question answering benchmarks.[00:00:20] But these days so much of what people are wanting to do is nothing like that, right? You’re wanting to get some recommendations about which backpack would be best for you for your trip in Europe next month. It’s not so easy to come up with a benchmark, and it’s the same problem with these world models.[00:00:41] Meet the Founders[00:00:41] swyx: Okay. We’re back in the studio with Moon Lake’s, two leads. I, I guess there’s other founders as well, but, sun and Chris Manning. Welcome to the studio.[00:00:54] Fan-yun Sun: Thanks. Thanks, Chris. Thanks for having us.[00:00:56] swyx: You’ve got, you guys have, come burst onto the scene with a really refreshing [00:01:00] new take of mold models.[00:01:01] I would just want to, I guess ask how you, the two of you came together. Chris, you’re a legend in NLP and just AI in, in, in general. You’re, you’re his grad student, I guess[00:01:10] Fan-yun Sun: Actually my co-founder.[00:01:11] swyx: Oh, yeah.[00:01:12] Fan-yun Sun: I should give a lot of credit to my co-founder, Sharon. Yeah. She was, she was actually working with Professor Fe Androgyn and then she ended up working with, Ron and Chris Manning here.[00:01:22] And then, so I got connected through to Chris initially, actually through my co-founder,[00:01:26] What is Moon Lake?[00:01:26] swyx: what is Moon Lake? What, what is, actually, I’m also very curious about the name, but like why going into world models?[00:01:33] Fan-yun Sun: So I was working a lot. With actually Nvidia research during my PhD years on essentially generating interactive worlds to train reinforcement learning agents or embody EA agents.[00:01:44] And then there’s two observations. One in academia and one in industry. An industry like folks at Nvidia are actually paying a lot of dollars to purchase these types of interactive worlds, whether it’s for the sake of evaluation or training the robots, or policies or models. And [00:02:00] then, in academia, same thing is happening.[00:02:02] And more specifically, when I was actually working with Nvidia on the synthetic data foundation model training project, we were actually generating a lot of these synthetic data and showing that, hey, you can actually, these synthetic data are actually as useful as real world data when it comes to multimodal pre-training.[00:02:16] But then, like I said, there’s a lot of dollars being paid out to like external vendors or, or like. Other folks to manually curate these types of data. It was very clear to us that, okay, on our way to, let’s call it embody general intelligence models need to learn the consequences behind their actions, which means that they need interactive data and the demand for those types of data are growing exponentially.[00:02:38] But everybody’s sort of thinking about it from a pure, say, video generation perspective or something else. But we feel like the true actually opportunity is actually building reasoning models that can do these things, like how humans do these things today. So that’s a little bit on the genesis of Moon Lake, and I think the reason I got into world models was partly.[00:02:59] A philosophical [00:03:00] take of the on the world where I like, believe the simulation theory and stuff like that. But on the other, on the other hand, it’s really just like, oh, like there’s an opportunity there that I feel like nobody’s doing it the way I think should be done.[00:03:10] Structure, Not Scale: The Vision[00:03:10] Chris Manning: I can say a little bit about that.[00:03:12] Yeah. So of the overall goal is the pursuit of artificial intelligence and most of my career has been doing that in the language space and that’s been just extremely productive. As we all know, the story of the last few years, I don’t have to tell about how much we’ve achieved with large language models, but, uh.[00:03:31] Although they have been extremely effective for ramping language and general intelligence, it’s clearly not the whole world. There’s this multimodal world of vision, sound, taste that you’d like to be dealing with more than just, language. And then the question is how to do it. And despite, a huge investment in the computer vision space, right, as the research field computer [00:04:00] vision has been for decades, far, far larger than the language space, actually.[00:04:05] I think it’s fair. Say that, vision, understanding sort of stalled out, right? You got to object recognition and then progress just wasn’t being made right? If you look at any of these, vision language models, it’s the language that’s doing 90% of the work and the vision barely works. And so there’s really an interesting research question as to why that is and at heart, the ideas behind Moon Lake are an attempt to answer that, believing that there can be a really rich connection between a more symbolic layer of abstracted understanding of visual domains, which aren’t in the mainstream vision models, which are still trying to operate on the surface level of pixels.[00:04:50] swyx: I think one of your blog posts, you put it as structure, not scale. Is that, a general thesis?[00:04:57] Chris Manning: Yeah. Well, scale is good too.[00:04:58] swyx: Yeah. Scale is good. Too[00:04:59] lot,[00:04:59] Chris Manning: [00:05:00] lots of data is good as well and scale, but nevertheless, you want the structure Yeah. To be able to much more efficiently learn.[00:05:07] swyx: Yeah. The other thing I really liked also is you put out an example of what your kind of reasoning traces look like.[00:05:12] Right. Which you would distill is the word that comes to mind. I don’t even think that’s a good, good description, but it would involve, for example, geometry, physics, affordances, symbolic logic, perceptual mappings, and what, what have you. But like that, that is the kind of example that involves, let’s call it spatial reasoning, role model reasoning as as compared to normal LM reasoning.[00:05:35] Yeah.[00:05:36] Defining World Models vs Video Generation[00:05:36] Vibhu: But also like taking it a step back. So how do you guys define world models? A lot of people see okay, you can do diffusion, you can do video generation. But, you guys put out quite a few blog posts. You put out a essay recently, we can even pull it up about efficient world models. You have a pretty like structural definition here, but for the general audience that don’t super follow the space, right.[00:05:55] What’s, what’s the difference in what we see from like a video generation model to [00:06:00] a world gen A simulator? How do you kind of paint that last[00:06:02] Chris Manning: year? Yeah, so I think this is actually a little bit subtle because, people look at these amazing generative AI video models, SAWA VO three, one of these things, and they think Genie, they think, oh, this is amazing.[00:06:17] This is we’ve solved understanding the world because you can produce these generative AI videos, but. The reality is that although the visuals do look fantastic, those visuals actually are accompanied by an understanding of the 3D world, understanding how objects can move, what the consequences of different actions are, and that’s what’s really needed for spatial intelligence.[00:06:49] So I mean, a term we sometimes use is that you need action condition, world models. That you only actually have a world model if you can predict, [00:07:00] given some action is taken, what is going to change in the world because of it. And in particular, that becomes hard over longer time scales. So if you’re simply, trying to.[00:07:12] Predict the next video frame. That’s not so difficult. But what you actually want to do is understand the consequences, likely consequences of actions minutes into the future. And to do that, you actually much more of an abstracted semantic model of the world.[00:07:32] The Bitter Lesson & Data Abstraction[00:07:32] swyx: Yeah, the question comes where you want to have more structure than is available in just predicting the next token.[00:07:41] And typically, well, let’s, let’s call it the experience of the last five years has been that is just washed away by scale, right? So what is the right middle ground here that, you don’t ignore the bitter lesson, but also you. Can be more efficient than what we’re doing today.[00:07:57] Chris Manning: One possibility [00:08:00] is, look, if we just collect masses and masses and masses and masses of video data, this problem will be solved.[00:08:11] Under certain assumptions that could be true, but there are sort of multiple avenues in which it could not be true. The first is what’s really essential is understanding the, the consequences of actions producing an action conditioned world model. And if you are simply, collecting observational video data, which is the easy stuff to collect, when you’re sort of mining online videos, you don’t actually.[00:08:41] Know the actions that are being taken to see how the video is changing. And so if you are never collecting directly actions and you are having to try and infer them from what happened in the observed video, that’s not impossible. But it’s very [00:09:00] hard and it’s not really established that you can get that to work at any scale yet.[00:09:05] And so there’s a lot of premium on collecting action condition video data, which is part of why there’s been a lot of interest in using simulation so that you can be collecting data where you do know the actions, which isn’t quite limited supply, but there’s also in the limit of as much data as you could possibly have.[00:09:28] Maybe the problem is eventually solvable, but. Even though we collect huge amounts of text data is always at a great level of abstraction, right? Language is a human designed, abstracted representation where there’s meaning in each token and it’s representing and abstraction of the world, right?[00:09:51] As soon as you are describing someone as a professor, and as soon as you are saying that they’re condescending, right? These are very [00:10:00] abstracted descriptions of the world. It’s not at what you’re observing as pixel level, and to get to that kind of degree of abstraction, starting from pixels is orders and magnitude of extra data and processing.[00:10:14] And so, although, we absolutely want to exploit, get as much data as possible, use the bitter lesson. Nevertheless, if there are ways in which you can work with five orders of magnitude less data than people working purely from pixels, you’re gonna be able to make a lot more progress, a lot more quickly.[00:10:34] And that’s the bet here. And so you could just say that’s only wanting to be able to, do it more efficiently, do it more quickly, do it more cheaply. But I think it’s actually more than that, I think. One should be making the analogy to how human beings work at one level. You know? Yes, we have these high [00:11:00] resolution eyes and we can look and see a scene like a video, but all of the evidence from neuroscience and psychology is that most of what comes into people’s eyes is never processed.[00:11:13] Right. That you are doing fairly fine ated processing of exactly what you’re focusing on. But as soon as it’s away from that of yeah, there’s another guy over there that you’ve sort of only processing top down this very abstracted semantic description of the world around you. And so, that’s what human beings are doing.[00:11:33] They’re working with semantic abstractions and so. I think it is just the right representation. ‘cause we also have other goals we want to be able to do, real time worlds. So that means there’s a limit to how much processing you can do and we want to do long-term planning and consistency. And again, that favors abstraction.[00:11:55] I mean, I guess there was actually a recent. Blog posts that [00:12:00] came out from our Friends of physical intelligence and, they were sort of heading in the same direction they were saying Oh, to the pay[00:12:06] swyx: pay model.[00:12:07] Chris Manning: Yeah. Yeah. To maintain a long term memory of what’s happening in the world. So we can, do longer term we actually storing text of what is, been happening in the world.[00:12:19] Right. It is not such a successful strategy of trying to keep it all at a pixel level.[00:12:24] Vibhu: And yeah, I mean, you can see it in video models like that Temporal consistency. We’re at a scale of train on, all the video data we have. We have it for maybe 30 seconds, a few minutes. That’s not the same as a game state played for half an hour.[00:12:37] Right. I thought you guys break it down pretty well. You have a, you have a blog post about. Building multimodal worlds with an agent. I dunno if you guys wanna talk about this. This is one of the things I read, I[00:12:48] swyx: thought, yeah, it’s the thing I talked about with the reasoning chain. Yeah.[00:12:51] Vibhu: So there’s like different phases to this.[00:12:53] It seems like it’s more of an agent, a scaffold, very different approach than just, type in a prompt and you, you don’t have the same consistency. [00:13:00] It also, like, for people that are listening, I, I would highly recommend reading it. It breaks down the problem in a different light, right?[00:13:06] So like, what do you need to consider when you’re talking about video, like world game models, right? How would, what do you need to consider? What are the factors? What are the elements? What’s the state? So I don’t know if you guys have stuff to talk about for this one.[00:13:19] Fan-yun Sun: Yeah. Actually, I wanted to add on a little bit Yeah.[00:13:22] On our previous point, which is just like, change topics so quickly. I, I do feel like sometimes people confuse like, oh, like we’re taking an an, an method with abstraction. That means they don’t believe in bitter lesson. Like that’s just false, right? Like we are believed is a bitter lesson. But then I feel like the question that we always discuss is like, what is the right abstraction level today?[00:13:42] The analogy I like to make is like, let’s just say we can encode and decode. Represent all of images, videos, audio and bytes. Then the most bitter lesson approached is to train a next byte prediction model as opposed to the next token prediction model where it’s just like, okay, it’s natively multimodal, can just, but it’s like, yeah, like [00:14:00] to, to Chris’s point, it’s like the scale and computing you need to achieve that.[00:14:03] So that’s why we always come back to like, okay, what is the most efficient way to do it? And reasoning models to the point of this blog post is a showcase of like, Hey, we’re actually just like reasoning about the world and reasoning about. The aspects of the world that CAGR that matter for me to learn what I want to learn from this role model.[00:14:21] swyx: Yeah, it’s like you’re improving the en encoder of whatever you’re, trying to model. And like a better representation would just represent the important things in less space. Yeah. Which would just be more efficient.[00:14:33] Fan-yun Sun: Yeah.[00:14:34] swyx: So yeah, I, I, I fully agree that it is not, antagonistic to, bitter lesson.[00:14:38] I do wanna wanna mention one more thing. Is there any philosophical differences with the JPA stuff that, Yun is working on? I gotta go there. You, you, you, you’re, you’re imagining like some latent abstraction. I’m like, okay, fine. Let’s, let’s talk about it, right? Like it’s an elephant in the room.[00:14:52] Chris Manning: Yeah.[00:14:53] JEPA & Philosophical Differences with LeCun[00:14:53] Chris Manning: There are philosophical differences. Jan Lacoon is a dear friend of mine, but. [00:15:00] He has never appreciated the power of language in particular, or symbolic representations in general. Yarn is a very visual thinker. He always wants to claim that he thinks visually and there are no words, symbols, or math in his head.[00:15:21] Maybe that’s true of yarn. It’s certainly not the way I think. Um. But at any rate, the world according to yarn is the basic stuff of the, the world and of intelligence is visual and language is just. This low bit rate communication mechanism between humans and it doesn’t have much other utility and it’s far inferior to the high bit rate video, that comes into your eyes.[00:15:53] And I think he’s fundamentally missing a number of important things [00:16:00] there. Think of this evolutionary argument looking at animals, right? That the closest analogies, the things with chimps, right? So chimpanzees, have fairly similar brains to human beings. They have great vision systems, they have great memory systems.[00:16:18] They’ve got, better memory than we do of short term memories. They can plan, they can build primitive tools that, humans. Massively ahead in what we understand about the world, what we can plan, what we can build. And essentially what took off for us was that humans managed to develop language and that gave a symbolic knowledge, representation, and reasoning level, which just, okay if this sort of vaulting of what could be done with the intelligence in brains.[00:16:59] So the [00:17:00] philosopher Dan de refers to language as a cognitive tool and argues that, humans unique among the creatures in the world have managed to build their own cognitive tools and language is the famous first example. But other things like, mathematics and programming languages are also cognitive tools.[00:17:21] They give you an ability to. Think in abstractions, in extended causal reasoning chains. And that allows you to do much more. And we use that for spatial representation and intelligence and planning and gameplay as well. So we believe, and this is, underlying the specific technologies that Moon Lake is making, that symbolic representations are powerful.[00:17:50] And you want to use that in your understanding of the visual world when you want a causal understanding, when you want to maintain long-term [00:18:00] consistency and prediction. And as I understand it, that’s just not in ya Koon’s worldview. So I think that’s the fundamental philosophical difference. Then there’s the specific model.[00:18:11] He’s been advancing jpa, that’s a reasonable. Research bed is a direction as to, to head for building out a model of the visual world. To my mind, it’s sort of one reasonable research bed. It’s not really established. It’s the best one that everyone should be following,[00:18:32] swyx: at least developed at scale, at Meta.[00:18:34] But it’s not just vision, right? Like, I mean, JPA is a, just joint admitting prediction can be applied to anything really. And people have done it. The argument is that there is a latent representation or that is probably more. Suited to the task, then why not let machines do it for us instead of predefining it at all?[00:18:50] And isn’t something like a JPA shaped thing the right answer? And if not, why not?[00:18:55] Chris Manning: So I think there’s a part of jpa that’s right, which is [00:19:00] you do want to have a joint. Embedding that gives you a consistent model of the world. And Jan’s argument is you can never get that from auto aggressive language models ‘cause they’re sort of left to right churning out one token at a time.[00:19:22] I guess this is where we’re the research arguments of the field, I’m not actually convinced that’s right. ‘cause although the token production is this auto aggressive, process that’s heading, left to right, I guess don’t have to be left to right. But anyway, in sequence of tokens we could have right to left Arabic.[00:19:40] But although that’s true, all of the weights of the model that are internal to the transformer, they are a joint model of the model’s understanding of the world. And so I think you can think of the weights of the model as a form of. Joint representation, [00:20:00] and therefore it is plausible to think that could be the basis of a world model, which avoids, ya’s objections.[00:20:10] swyx: I think I follow, and obviously that would touch on what Moon Lake eventually ends up doing as well. Right. Like, which it’s hard to tell because you put out the end results, but we don’t know the inputs that go into it. So it’s, it’s, that’s something that we have to figure out over time.[00:20:25] Vibhu: Yeah. I mean, I guess this kind of breaks down some of the outputs. Do you wanna walk us through it?[00:20:31] Reasoning Traces & Interactive Worlds[00:20:31] Fan-yun Sun: Yeah. So this, this really just walks us through the reasoning traces of like, okay. So that just say, if we wanna build a world in this context, it’s really just a game demo that, that shows the, the variety of interactions that this world model can build.[00:20:45] And yeah, it’s really just a reasoning traces of like, okay it prompted to create a bowling game. Like how did it achieve what you saw? That level of causality, interaction and consistency, right? So yeah, this is almost just like a, an example of [00:21:00] like a reasoning traces. Very[00:21:01] swyx: detailed.[00:21:01] Fan-yun Sun: Yeah.[00:21:01] Vibhu: Very, very detailed.[00:21:02] You gotta you don’t even realize it, right? Like when a video is generated, what happens when a ball strikes a pin, right? So first, like you, there’s audio in that, like audio triggers happens, score increments, the world changes. Like pins have to start dropping. There’s a timer that goes on. It’s just like very similar to how now we’re used to reasoning for language models.[00:21:20] There’s a whole state of what happens. So geometry, physics, all this stuff. And then yeah, there’s kind of that single prompt. So asset, ation all this stuff. It’s like a, it’s a nice view to see what’s going on.[00:21:32] swyx: I think Sun is also too polite to point out that, both like Google’s genie, demos as well as world Labs is marble, do not have interactive worlds.[00:21:41] Fan-yun Sun: That’s the benefit of having a reasoning model, right? Like, because you can, you can say, oh, like maybe in this particular context, I want to learn how to bowl. And then you can say, okay, then what is it important when it comes to learning how to bowl? Okay, maybe it’s like I need to understand the, the basic of like, physics and I want to throw it over [00:22:00] them.[00:22:00] I wanna know that when I, when it resets it’s a new game. So I know that yeah, basically, you know to pick up the ball, you know that ball’s gonna cause the pins to fall down. You know that what’s important to this particular bowling game is to score and you know that the score corresponds to the number of pins that fell down.[00:22:19] So it’s just like, if it’s a model that sort of knows what it. Looks like, knows what a bowling game looks like, but doesn’t actually allows you to practice over and over again and to understand that, oh, like what it takes to actually get a high score. Then it sort of doesn’t actually allow you to learn what you set out to learn within the world model.[00:22:38] And I think this is really just one example of showing like the advantages of the approach that we’re taking over most the, let’s call it the zeitgeist, is today, when people talk about clinical role models,[00:22:51] Chris Manning: right? So it sort of seems like the question to ask when there’s a world model is.[00:22:58] Can I not [00:23:00] only just wander around the world and look at the beautiful graphics, can I interact with the objects in the world and see the right consequences of actions?[00:23:11] Vibhu: And you also understand what the consequences would be if you do something right. So it’s not just like, okay, there’s one thing if I pick it up, something will happen.[00:23:19] But, there’s 50 options and I know I can expect, I can infer what would happen if I do any of them. Right. So very different when you can actually see it play around with it.[00:23:28] swyx: There,[00:23:28] Beyond Unity: Cognitive Tools for World Building[00:23:31] swyx: there’s two cheeky elements of that. I mean, the, the, the I guess, less ambitious one is, let’s really establish for listeners, why is this fundamentally different than writing Unity code, right?[00:23:40] Like just creating a model to translate a prompt into Unity code[00:23:44] Fan-yun Sun: so there is an underlying physics engine. Yeah. In that sense, there’s some overlapping things to Unity, but the way we think about it is like physics engine. Tools or code are cognitive tools like borrowing Chris’s term, right? Like tools [00:24:00] that the model can employ as means to an end.[00:24:04] So today maybe you say, okay, in this particular context we care about physics, we care about the long-term causality consequences. Then yes, we deploy it, employ physics engine, and then maybe tomorrow we say, okay, we’re we’re training that. Just say drones where we only care about really fluid dynamics and the visual aspect of the world.[00:24:25] Then, then yeah, maybe we don’t actually, the model actually doesn’t have to use a physics engine. Or maybe it employs other types of representation or physics engine to achieve the task. So yes, writing code for Unity is sort of similar to a tool that our A model can employ, but our goal is for a model to take a representation conditioned reasoning.[00:24:46] Approach or process.[00:24:47] swyx: Yeah,[00:24:47] Fan-yun Sun: internally.[00:24:48] swyx: Yeah. Using these things as just like general two calls. Right. Which I think is very interesting. The other more ambitious one is, some kind of recursive element where it becomes multiplayer, right? Like here, there’s a single player element, you’re not [00:25:00] modeling any other people involved.[00:25:01] And that is a whole other thing.[00:25:04] Fan-yun Sun: But in fact, we can really do multiplayers. Oh yeah, okay. I haven’t seen any double situations. So just actually just like prompt our, our model to say, Hey, like configure to multiplayer. Then it’ll do like this. You’ll be able to configure multiplayer[00:25:16] swyx: great[00:25:17] Fan-yun Sun: persistency database for you.[00:25:18] Easy. Yeah.[00:25:19] Vibhu: So what, what are like some of the current limitations in where we’re at? So there’s one approach of like, okay, scale up video predictors. Obviously there’s data issues. With approaches like this, is it data constraints? What are like the next steps? Is it real time? Like, so there’s one side of, write an agent to write Unity code, but okay, I want to be streaming a game real time.[00:25:38] I want to have characters being also like agent, but where, where do we kinda see this scaling up? Right?[00:25:44] Fan-yun Sun: Yeah, there’s definitely a data constraint. Like the more data, the, the better. This reasoning model can almost basically act as humans to like operate a variety of tools and softwares to build whatever’s necessary.[00:25:57] And then there’s a sort [00:26:00] of fidelity constraint, which we’re actually solving with another model, which we can talk about later. But it’s like, it’s not as easy to get to photorealism with the approach that we’re taking. But we think there are better solutions to that, which is we can dive into later.[00:26:14] Later.[00:26:15] Vibhu: The one one thing you note here is it’s a diffusion model, right? So there’s, there’s a few approaches, diffusion caution, splatting, yeah, so Ry diffusion model, you guys wanna[00:26:25] Fan-yun Sun: Yeah.[00:26:25] Vibhu: Introduce,[00:26:26] Fan-yun Sun: yeah, totally.[00:26:26] Rie: Neural Rendering & Skins for Worlds[00:26:26] Fan-yun Sun: So within our world modeling framework, we think there are two models that we train, right?[00:26:31] Like, there’s the multimodal reasoning model that we just talked about that essentially handles. Mainly the, the causality, the persistency and logic determinism of the world. And then RY is our bet on saying, okay, like while all those model, can take care of all these things that we just talked about, it’s limitations compared to existing, say, video models, is that it doesn’t have as high of a pixel [00:27:00] ality right off the gate, right?[00:27:02] And EE is to say, Hey, we can actually take whatever persistent representation that we generate with our multimodal reasoning model and learn to restyle it into photo photorealistic styles or arbitrary styles you want. So this model is almost to say, Hey, I’m going to respect the persistency and interactivity of the world that you created, but my only job is to make sure that its pixel distribution is close to what we want.[00:27:29] Vibhu: Yeah.[00:27:30] swyx: Great example right there. You kept the KL divergence.[00:27:33] Fan-yun Sun: Oh. Where,[00:27:34] swyx: no, no. I mean this, this is a, a classic like, how you don’t stray too far from the source material as you, you kept the kl, which is Oh yeah. Kind of cool. Yeah.[00:27:43] Fan-yun Sun: Yeah.[00:27:44] swyx: I mean, and the[00:27:44] Chris Manning: difference is, and I mean sun was pointing at this, where sort of saying it’s in one way a more difficult path, but a better path that, typically the diffusion models are producing the whole scene and it looks lovely, [00:28:00] but there isn’t spatial understanding behind it, which is allowing for the real time graphics gameplay, the spatial intelligence, understanding the consequences of worlds where this is, taking a path where it is assuming an abstracted semantic model of the world’s state.[00:28:20] And then the diffusion model is then being used on top of that to produce the high quality graphics.[00:28:27] swyx: Is there an intended practical, or business use for this, or is it like a, like a demonstration of capabilities?[00:28:34] Fan-yun Sun: We actually believe that this is gonna be the next paradigm of rendering. So it’s gonna replace how ra raizer, it’s gonna replace DLSS today because it not only has these pixel prior that’s learned from the world such that you can literally play any game in photo realistic styles, which is a lot of people’s desire when they do GTA, right?[00:28:51] Like,[00:28:51] Vibhu: all the mods, all the people adding perfect lighting and all this.[00:28:54] swyx: So[00:28:54] Fan-yun Sun: skins[00:28:55] swyx: for worlds, let’s call it[00:28:56] Fan-yun Sun: skins, let’s call it skin for worlds. I,[00:28:58] Vibhu: it’s also like, you can call it skin, you can call it [00:29:00] customization. You can play it how you want, right?[00:29:01] Fan-yun Sun: Yeah, exactly. And I think another thing that we really pointed out specific specifically in this blog is the programmability of it, right?[00:29:09] So what this means is that this render historically render is always a derivative of the game state, right? You’re saying, oh, here’s the game state, I’m rendering out a frame. But here I’m saying actually this render can be part of the gameplay loop. I can say something along the lines of, if upon getting 10.[00:29:26] Apples, I’m gonna, my weapon of choice, my bullet’s gonna turn into apples. And that’s, that’s possible because we can say, we can basically dynamically have certain game state trigger the, the preconditions to the render such that the rendering is now part of the game loop too. One thing is to just say, okay, it’s, it’s, it’s the appearance.[00:29:47] But the second thing is also to say there’s these novel interactions that are possible because this render now has actually priors of the world.[00:29:57] swyx: It is up to the artist to figure out what to do with it.[00:29:59] Fan-yun Sun: It [00:30:00] is up to the creators. Yes.[00:30:01] swyx: Yeah.[00:30:01] Fan-yun Sun: And I also think that’s actually another big argument that we’re making and the reason that we’re picking, taking the bet we’re baking is that a lot of the times, whether it’s for embody AI gaming, like you want a layer where human can inject their intentions.[00:30:15] So, for example, let’s just say in the context of gaming, it’s obviously like my creative intent, but maybe in the context of embodied ai, it’s like, oh, like I take this foundational policy and I want to actually fine tune it to deploy in my house. So you want to almost say, inject, have a layer where human can say, oh, here’s the distribution of things I want to create to achieve my goal.[00:30:35] And I think 3D graphics as it as it is today, is basic, the layer for people to say, Hey, what do I care about in this world? And it allows, basically human intent to be expressed in these worlds much more explicitly and distributionally as opposed to just saying, Hey, I’m gonna generate like, arbitrary.[00:30:54] And it’s like just prompts,[00:30:55] swyx: it’s one of those things where like, I think you, you’re going to build up a series of models, right? [00:31:00] This is just one of, this is probably like the highest utility or heaviest, frequency one, I don’t dunno what to call this. Where like you Yeah. You can immediately drop this in on any game and you don’t need anything else that.[00:31:10] That you guys do. But, I, I could see, I could see that I think the, the human intent is something that people are not even used to because we’re so used to static worlds or, worlds that just don’t react, or, I don’t know. It’s, it, you’re kind of blowing my mind right now with like, I’m, I wonder if you’ve talked to people at GDC Hmm.[00:31:27] And what are they gonna do with it?[00:31:30] Fan-yun Sun: Yeah. Now the stance that we take on this front is like, we’re not gonna be more creative than our users to ship[00:31:35] swyx: it out.[00:31:35] Fan-yun Sun: Yeah. But we wanna make sure that we’re building things in a way that really allows them to express their intent.[00:31:41] swyx: The thing that you said about, here’s the distribution that I want.[00:31:45] I think text may be too low of a bandwidth to. To really demonstrate, because I, I, there, I’m, I’m probably just gonna want to drop in a bunch of, reference assets and then you can figure it out from[00:31:58] Vibhu: there. But you probably wanna do a, a mixture of [00:32:00] both, right? Like you throw in a few images. I wanted this style.[00:32:02] Yeah. I want it to look like this. So it, it’s, it’s a mixture, right?[00:32:05] Chris Manning: I, I think it’s a mixture. I mean, yeah, I mean there’s clearly a visual component of this, and it’s not that, everything can be text. ‘cause of course you want to give a visual look, but there’s also a massive amount of giving the overall picture of the look of the world and the behavior of things that you can express in a few words of text.[00:32:32] And it be very time consuming and difficult to do via visual means. So I think, yeah, you want a combination of both.[00:32:40] Evaluating World Models[00:32:40] Vibhu: So one question I kind of have is, how do we go about evaluating world models? So like, there’s many axes, right? One is like, okay. I have preferences. How well do we adhere to prompts? One is the simulation.[00:32:50] One is like do things, is there core logic that’s broken? So coming from we know how to evaluate diffusion, there’s fidelity, there’s [00:33:00] stuff like that. But what are some of the challenges that most people probably aren’t thinking about?[00:33:04] Fan-yun Sun: Yeah, I think this is like a great question and probably one of the hardest questions in role models because like, I think it always comes back to what are you building this role model for?[00:33:13] And depending on your end goal and purpose, the evaluation should defer. So in the context of games, then the most direct way of measuring is how much behind are people actually spending in this world that you create? And if your goal is to say, for example, in the context that we just talked about, like, hey, deploying, deploying action in body, a agent, then your, your end.[00:33:33] Metric is then, okay, after training in these worlds that you generate how robust it is to when you actually deploy to the target environment. But then, it’s, it’s hard to measure these end metrics. So today people have like these proxy metrics that I call that basically try to measure what we really care about, which is the end metrics, but then frankly it’s different for every use case.[00:33:57] Yeah,[00:33:57] Vibhu: which seems like quite a challenge, right? Like in [00:34:00] in language models or video models. Image models, your benchmarks are proxies, right? People aren’t actually asking instruction, following tool use questions. They’re proxies of how well it will do downstream. But for this, so like, should teams, should companies have their own individual benchmarks outside of games?[00:34:16] If you think of stuff like, okay, video production, movies, stuff like that, that also want to use world models. Should, should they sort of internalize like. Their own proxy. Is this something you guys do? Where, where does that connect[00:34:28] Chris Manning: go? Yeah, I think this whole space is extremely difficult as things are emerging now.[00:34:35] And I mean, it’s not only for world models, I think it’s for everything including text-based models, right? ‘cause in the early days it seemed very easy to have good benchmarks ‘cause we could do things like question answering benchmarks and could you answer the question based on these documents and the various other kinds of, do pieces of logical reasoning or math.[00:34:58] But again, these are sort of. [00:35:00] And there were sort of visual equivalents of things like object recognition, right? For these small component tasks. These days so much of what people are wanting to do also with language models is nothing like that, right? You’re wanting to, have an interaction with the language model and get some recommendations about which backpack would be best for you for your trip in Europe next month.[00:35:25] And it’s not the same kind of thing, right? And it’s not so easy to come up with a benchmark as to does this large language model give you an effective interaction for guiding you in a good way for shopping, right? So, and it’s the same problem with these world models. So if we take the game design case, well success is that a game designer can.[00:35:57] Produce what they are [00:36:00] imagining in a reasonable amount of time. And that’s really the kind of macro task. That’s a very hard thing to turn into a benchmark and I think a lot of this is actually going to turn into people walking, walking with their feet. Right? I mean, I guess that’s what’s happening, at the large language model level, right?[00:36:23] When people are choosing to use, GPT five or Gemini or clawed, individuals are trying out these different models and deciding, oh, I like the kind of answers that GT five gives me, or no, I feel like I get more accurate detail from Claude, right?[00:36:43] Vibhu: It’s a lot of[00:36:43] Chris Manning: vitech, a lot of people just using it.[00:36:45] It’s vibe checking. I realize that, but it’s actually whether. People feel it’s giving them utility in what they want. Right.[00:36:52] Vibhu: And the the interesting thing there is like a lot of people prefer the visual, right? This looks pretty, which is not the objective of what this is [00:37:00] for, right? It’s if a, if a game designer is working on something, they care about the game engine, right?[00:37:04] The state, it’s, it can look whatever. You can fix that up later. Or you can have a really good game state and you can quickly edit it to 20. 20 different versions, like Keep State,[00:37:14] Chris Manning: right?[00:37:14] Vibhu: So[00:37:14] Chris Manning: that’s a really important distinction, for and for speaking to Moon Lake strength, right? So, yeah, great visuals are lovely to look at for a few seconds, but gains are really all about the concept, the game play.[00:37:33] And a lot of the time that doesn’t actually even require great visuals. I mean, there are just lots of very successful games which have relatively primitive visuals, and there are other games where people have spent millions producing photo realistic, visuals, and the game sucks, right? So, keeping those two axes apart is really important in thinking about what’s important in a [00:38:00] world model for different uses.[00:38:02] swyx: This conversation is reminding me of some game review and fiction discussions I’ve, had in my sort of non-AI related life. Some, for some people might know Brandon Sanderson, who’s a very famous, fiction author, had, is is a big game reviewer. And he, he’s a big fan of video games where you change one thing about a normal what you might assume about, about the world.[00:38:22] For example, Baba is you, I don’t know if you might have come across that, where like the rules change as you play the game. And also like where, you can do things like reverse time selectively or like change gravity selectively. And I think this is also reminds, reminds me of other kinds of world models that are created by authors.[00:38:38] Where Ted Chang is, is my typical example where he’ll take the world that, you know today, but change one thing about it and, but then create a consistent world based on that. Which is long-winded answer of me to, of. For me to say is it’s it easy to create alternative roles that don’t exist, but you change one thing and then let’s, let’s run a whole bunch of people through it to see if it works.[00:38:58] Chris Manning: My first dance will [00:39:00] be, that seems a lot easier and more conceivable to do using Techn technology like Moon Lakes than with some of the other world models out there, where the sun can actually make it happen. I’ll let him give a second answer.[00:39:15] swyx: If I guess for you, you’re constrained by the game engine tool, right?[00:39:18] Like at the end of the day, that’s the, that’s the thought, partner that you have. If I ask for something where like, if it never is allowed to reverse time or if gravity only ever works one way, then well that’s it. But sometimes gravity might change,[00:39:33] Fan-yun Sun: but it’s a lot easier to change with code as opposed to a model that is learned primarily on data of.[00:39:42] Real world and virtual worlds that are, I guess, like for example, junior, like there’s actually trained on a lot of real world data and a lot of virtual gaming data, and it’s hard to say maybe it’s easier to say, okay, I wanna change the visuals in like the time period of, of the world. Like, you can’t change gravity, for [00:40:00] example.[00:40:00] Vibhu: I feel like you can to light bounds, right? Everything comes down to like, code is a better way to execute it, but the models aren’t that diverse and creative, right? You can say, okay, make gravity slower. It can do that, but it’s limited to your representation of how you text it out, right? Like they’re, they’re only gonna do a few iterations, whereas programmatically, if there’s a game engine under the hood, you can kind of go wild, right?[00:40:22] So one of the, I dunno, one of the limitations of most models is that they’re very overtrained to one style. Right. And extracting diversity is pretty difficult. At least that’s something we’ve seen.[00:40:35] Fan-yun Sun: I mean, are there examples you have in mind where you Existing models? Yeah. Like it would be easier to do that’s not using code.[00:40:43] Certain types of creative intent or like transition state transitions,[00:40:47] swyx: Clipping, other models, other wo models are very good at clipping through things. Clipping my, my, my legs clipping through a rock because it’s, it’s just, it’s just bad. [00:41:00] Like, you would have to struggle very hard with your stuff to actually make that happen.[00:41:04] Which I think is maybe a topic that you actually prepared on, Gian Splatting versus, the other stuff.[00:41:09] Vibhu: Yeah. Yeah. It’s just for those not super familiar, right? There’s a, there’s gian splatting, there is diffusion. Like what works, what scales up. I feel like in February when Soro one came out the blog post was literally titled like,[00:41:21] swyx: you bring it up.[00:41:22] You never know.[00:41:23] Vibhu: World, world, video generation models are world simulators. It’s super bitter lesson pilled. Yeah, emer, a lot of it is emergence, right? So, not to go through their blog post, basically their whole thing was as you scale up all this consistency, all this stuff just kind of solves, it’s a very simple premise, right?[00:41:41] They just scaled up, diffusion, and from there, this is, this is Feb 2024, how much can we, it’s already been two years, which is basically five years. How much more in AI time do we need to just scale up or, or do we hit a data cap? But I think we already talked about this a lot, right? Like this is back to the beginning discussion of what’s [00:42:00] appropriate for the time.[00:42:01] And that seems like your approach, right?[00:42:03] Fan-yun Sun: Yeah. The point I’m trying to make is that they’re very many, many different types of world simulators and like having a world simulator that can produce pixel coherency is very, very useful for games and, marketing and all these things, but it’s not as useful as people think when it comes to causal reasoning.[00:42:25] When it comes to embodied ai. Yeah, like it this title is true. We’re not saying that it’s, it’s like, not a great world simulator, but actually in the blog that we, we, we, we wrote, the bet is more so that there are gonna be disproportionately large share of value of real world tasks or, and virtual tasks where high resolution pixel fidelity is not needed.[00:42:47] Yes. Video models have their values.[00:42:50] swyx: Yeah. This is at the absolute limit of my physics understanding, but one example that comes to mind is basically having to solve like ba the equivalent of a three [00:43:00] body problem in a deterministic Well, where the video models, which is approximated good enough. Yeah.[00:43:08] Right. Like there’s, there’s some point at which your approach kind of runs into like the you now have to simulate the world. Please, thank you very much. And like you’re trying to do that, but only to the extent that the game engine lets you and like game engines cannot do some things.[00:43:23] Fan-yun Sun: Yeah, no, I mean, I think the interesting or more technical question here actually is where do you draw the boundary between.[00:43:32] What’s handled with, let’s say, diffusion prior and what, when? What’s handled with symbolic priors?[00:43:38] swyx: Yes.[00:43:38] Fan-yun Sun: Okay.[00:43:38] swyx: Okay.[00:43:39] Fan-yun Sun: Right. Let’s go there. Because this, this boundary can actually be fluid. Like I think like maybe what you’re trying to get at is like, okay, people are saying pixel prior, everything. But what we’re saying is, okay, there’s a boundary that we draw where this is where we think provides the most economical value for the domains and things that we care about today.[00:43:59] [00:44:00] And I actually do think, and it’s something that we do internally all the time, which is like, okay, given new equations that we learn or new elements of the world and that we, we learn, or maybe some other knowledge that we acquire in the process of developing the models. Should we still be maintaining this line exactly as it is today?[00:44:22] Or should we move it a little bit left or a little bit right? Right. Like sometimes that we realize that, oh, like maybe customers or, or folks like want certain things that are better handled with preop pryor as opposed to, symbolic prior than,[00:44:34] swyx: yeah. Your, your skin thing is a, is a example moving it, right.[00:44:37] Yeah.[00:44:37] Or left. Yeah,[00:44:37] Fan-yun Sun: exactly.[00:44:38] swyx: I dunno what the, the left right is.[00:44:39] Fan-yun Sun: Yeah, yeah, yeah. No the, the model.[00:44:42] swyx: Yes.[00:44:42] Fan-yun Sun: Actually we have a few iterations of them. They’re actually at slightly different[00:44:45] swyx: I know boundaries. You should, you should do that. That’s a cool dimension to show.[00:44:49] Fan-yun Sun: Yeah.[00:44:50] swyx: Is quantum mechanics the diffusion prior of our world?[00:44:55] Right. It’s like that’s the boundary of classical mechanics versus quantum. Right? Like, that’s it. At one [00:45:00] point God plays dice and the other point doesn’t.[00:45:02] Fan-yun Sun: I dunno if Chris, you wanna say it, but I think, I think generally I feel like physics is better with symbol P priors.[00:45:08] Chris Manning: Even quantum physics.[00:45:09] Fan-yun Sun: Even quantum physics.[00:45:11] swyx: Yeah. This is starts against to, MLST territory is, is what I call it, where, he, he likes to get philosophical. We, we we’re quite friendly.[00:45:18] Vibhu: I mean, we need to get, we need to get singularity. I heard some of that.[00:45:23] swyx: No, no, I think that is actually really helpful and man, I just want you to productize this like, as a product guy, I’m just like, oh, also[00:45:32] Vibhu: a gamer, I[00:45:33] swyx: wanna, it’s like a researcher, like, it’s cool.[00:45:35] Like this is a, the theoretical, like you have a very good, I don’t know, like the way of thinking about these things, but I just wanna see you like, express it. I do think like your fundamentally things when, when you leave open new tools, like, okay, use, use human intent to incorporate it into how you render.[00:45:52] Artists are gonna have to take like two to three years to figure out what to do with this. And you just don’t know.[00:45:57] Chris Manning: Right. But I think, this is, [00:46:00] gives a much more approachable and controllable world for the society, which is the beauty, the beauty of, NLP, that that will enable it to be adopted and used.[00:46:10] And we are very hopeful about that. Yeah,[00:46:13] Fan-yun Sun: yeah. Yeah. I mean, we are, we are very focused actually on commercialization in the sense that like we do, we do really believe in the data flywheel app approach. Yeah. Where, we put this in the hands of the creators and the users and then they will teach us when, what capability our model should improve.[00:46:27] And that’s why we are, we are actually, like products and beta[00:46:31] swyx: Yeah. Focusing on gaming. What, what’s like the adjacent thing to gaming[00:46:34] Fan-yun Sun: embody adjacent, basically. So maybe we can, we can I’ll maybe start with where we see the platform in three years. Yeah. Which is like, okay. The users would tell us what they want to achieve.[00:46:45] The end goal could be, Hey, I just, I wanna make something to teach my kids the value of humility. Or it could be, Hey, I wanna fine tune my, drones to be really good at rescue situations. I could be vacuum robots. I want to like train [00:47:00] my manipulation or like vacuum robot to be very robust to my office, right?[00:47:04] But it’s like, whatever it is, scenario robust to[00:47:06] swyx: my office[00:47:07] Fan-yun Sun: or like navigate very robustly in my office. But then it’s like, whatever end goal that you want, our role model will say, okay, given what you want to achieve, let me generate a distribution of environments such that I can train and evaluate whatever it is you want.[00:47:24] Yeah. Right. Maybe for the purpose of games, it’s just the end simulation and that’s the end product for certain policies. It’s like I can train it within these environments and then help you see where your policy is failing or not. Yeah. And then, so I think,[00:47:37] swyx: so in that case, much more of a training tool.[00:47:40] Than in other training[00:47:41] Vibhu: evaluation? Both. Right?[00:47:43] swyx: Sure. Same. Same thing.[00:47:43] Fan-yun Sun: Yeah, same thing. I think it’s just this role model that allows people to train any policy that can act in any multimodal environments.[00:47:51] swyx: Would it be harder to reward hack? Is there an angle here where it is harder to reward hack? Like it’s just, I’ll just put it generally because I think that’s a, that’s obviously a key [00:48:00] problem that a lot of people face when in training agents in these environments, and I don’t know, can you solve it?[00:48:07] Chris Manning: I think not necessarily. To the extent that there’s a mis specified reward that. It seems like it could be hacked in a more symbolic world or in a more pixel based world. I dunno if Sun’s got any thoughts, but I don’t think that’s really being solved.[00:48:26] swyx: The other thing that comes to mind is just you could just build a better sawa as a video generator model, right?[00:48:31] Because then you, you would move the diffusion, side a bit more further to the right. I think if I got the directionality correct. And that’s it.[00:48:40] Vibhu: It’s better on domains, right? Like on consistency over now, or for sure it exists versus something doesn’t, right.[00:48:46] Chris Manning: So[00:48:46] swyx: yeah. Yeah. Is[00:48:49] Vibhu: is a question more like, like[00:48:51] swyx: I’m just riffing on like, how do you, what can you build, you know?[00:48:54] Oh, with the stuff that you have. I do think that the minor, the academic does go immediately to training [00:49:00] and in eval evaluation, but like art tends to take unusual directions. Like you might end up,[00:49:06] Chris Manning: okay. Yeah. But the question is, can you use this piece of software to develop compelling gameplay and. I don’t think you can take SOAR and produce compelling gameplay, right?[00:49:19] If you want to have a world that you can wander around in a bit, you are good. But what are your abilities to have gameplay mechanics implemented the way you’d like them to be and to have things stay, with the long-term history of your gameplay that influences future actions. I think there’s just nothing there for that.[00:49:39] swyx: Yeah, I do tend to agree. I, I’m just trying to sort of test the boundaries. I would also make the observation that as AAA games industry has developed the line between what is a movie and what is a game has blurred. And you, you, you do end up basically producing a two hour movie as part of your game.[00:49:57] Fan-yun Sun: No, honestly, there, there’s so many actually [00:50:00] applications in adjacent markets that our world model can go into. Yeah. But yeah, it, it’s sort of fun to riff, riff on. Although on the execution side, we we, we need to stay focused with like, okay, what are the capabilities we want to unlock over time?[00:50:11] And there’s a roadmap for that. But yeah, if we’re just riffing on sort of like the possibilities, I feel like, whether it’s endless Yeah, it’s like classic[00:50:18] swyx: and the embedding for a possibility and endless in my mind, it’s very close. Yeah. I do wanna, focus on one, like weird choice. I, I don’t know if it’s weird.[00:50:28] Maybe I’m, I got something here. Audio, right? You could have just said no audio And audio in my mind has a lot of recursion, whereas in video you can just do recasting and that’s much computationally much simpler. Audio just seems way harder. I don’t know if you wanna just comment on just the special 3D audio.[00:50:46] Problem. Did you really have to do it? I guess you do to be immersive, but like a lot of people do treat it as like, well, you just stick a, a tt S model on top of[00:50:57] Vibhu: Well, there’s a lot more to game audio than [00:51:00] just speech. Right. It’s not just[00:51:01] swyx: tts. Yeah. Tts. S Fxt, GM Spatial in my mind Echoes[00:51:06] Chris Manning: Yeah.[00:51:06] swyx: And reflections.[00:51:07] And I, I don’t even know what’s, what else? I don’t know what, what other problems in this space.[00:51:13] Fan-yun Sun: Yeah, I think this point like the, it’s sort of a more, more pointing to the benefits of using an game engine as a tool that’s available to the model, right? Because like part of the spatial audio is from the code that is underlying the simulation.[00:51:32] And while we do give our model access to other types of audio models as. Tools.[00:51:39] swyx: None of them would be spatial, I think.[00:51:41] Fan-yun Sun: But that’s exactly sort of more 0.2. We’re giving our model an abstraction or a suite of tools such that it’s able to achieve that. And you can argue that sort of spatial is like a, like a emergence out of the, the tools that we and abstraction that we provide to the agents.[00:51:59] And I think that’s the beauty of [00:52:00] this, this, this approach is like there’s a lot of things kind of like how human’s built technology and they’re like Lego blocks that build on top of each other. And it’s the same thing here. There’s gonna be things that sort of just sort of emerges from being able to put these things together in like combinatorially interesting ways,[00:52:14] Chris Manning: right?[00:52:15] So this integrated audio model exploits the understanding and semantics of the Moon Lake world, right? And whereas in general for the Gen AI video models. There’s no actual integration across to audio at all, right? That someone might stick some music or stick a soundscape or whatever else on top of their video.[00:52:44] So it’s not a silent video, but they’re in no way connected into a consistent world model. And there’s nothing that’s okay. An action is happening in the video. Therefore there should be a sound that’s [00:53:00] coming from this part of the visual field.[00:53:03] swyx: Yeah.[00:53:03] Vibhu: Is that different than Sora too? Does it not have audio?[00:53:06] Not to say it’s not like[00:53:08] swyx: amazing[00:53:08] Vibhu: isn’t a spatial[00:53:09] swyx: audio.[00:53:09] Vibhu: It doesn’t,[00:53:10] swyx: no. I’ve played around it with it enough. It just sounds like someone put an 11 laps voice on top of it and just tried to do the lip sync.[00:53:18] Vibhu: Oh, yeah. I’ve seen, okay. Generate a dog at the beach and reactions to big wave and move[00:53:23] swyx: around.[00:53:23] It’s definitely like, so have the dog, have the dog move away from camera and see if the, the song goes down. It doesn’t. ‘Cause they don’t have facial audio.[00:53:32] Fan-yun Sun: We do want to basically like we, our moral model, like the one we’re training is basically towards the goal of having a combined latent representation across all these different modalities.[00:53:42] Right? Such that it can like reason across these different modalities. So for example, if I close my eyes and like you play a video, you play a sound of like a car skidding away from me. I almost can like, visually extrapolate that trajectory in my mind. And I think that type of capability, we want our model to be able to reason, right?[00:53:59] And that’s the reason that [00:54:00] we’re sort of taking this multimodal reasoning approach. It’s like we want this combine late in space that can[00:54:05] swyx: Yeah. Oh, you said late in space. We like that. Here we have to play the, the bell Every time that someone says late in space, no, you gotta train daredevil one. Where you, you, you, it’s only audio, but you have to work out.[00:54:15] Where everything is.[00:54:19] Cool. I I think that that was, that was about it for our Moon Lake coverage. I do think that we have like a couple of, Chris Madden questions on, on IR and, just any, any other sort of attention topics or n NLP topics.[00:54:31] Vibhu: Okay.[00:54:31] swyx: Go ahead.[00:54:32] Chris Manning’s Journey: From NLP to World Models[00:54:32] Vibhu: Well, no, I mean, yeah, it’s just fun. We talked a bit about how you guys met, but you basically, you, you were like the godfather of NLP per se, right?[00:54:39] You spent the whole career from early embeddings, early early attention. You did 2015 attention for machine translation, everything. You, you had information retrieval, so RAG before rag, we just wanna shout that out and admire a lot of that. Right? So what prompted the switch over to world models?[00:54:56] How, how’d all that come about?[00:54:58] Chris Manning: To some answer it [00:55:00] is, the enthusiasms and creativity of students, but there’s a bit of a history there, right? So, yeah. So clearly most of my career has been doing stuff with language and how I got into research was thinking, ah, this is just so amazing how humans can produce speech and understand each other in real time.[00:55:21] And somehow they managed to learn languages from their kids. How could this possibly happen? And so, yeah, starting off I was very focused on language, but as it sort of got into the 2000 and tens, I started, going, I’d been working on question answering, and then I started to get, interest in visual question answering.[00:55:42] And that was an area where it was very noticeable. That the visual understanding was bad. Right. These were the days when like, it sort of seemed like there’s almost no visual [00:56:00] understanding. You were just getting answers that came from priors. So, if you asked how many people are sitting at the table, it’d always answer two regardless of how many, how many people you could see in the picture.[00:56:11] And so it seemed like, oh, these models actually aren’t able to get semantic information outta IMA images. And so I was interested in that problem and tried to work more on that. And so then that required. Knowing more about what’s happening in vision and how you can represent visual information.[00:56:34] And then things start, there started to be this revolution of, doing generative AI images. And then I had students that started looking at that before the era of Moon Lake. I was also working with Demi Gore, who founded pika. And so, and[00:56:50] swyx: Ian obviously[00:56:52] Chris Manning: with gans. Yeah. Though Ian was never my student, but yeah, Ian I was very aware for the, the whole decade there of Ian with Gans.[00:56:59] [00:57:00] Yeah. And I mean, Ian was a Stanford undergrad, but yeah,[00:57:03] Vibhu: richard des u.com, I believe he was your student.[00:57:06] Chris Manning: Yeah. Yeah. And there were, there were links across at that stage as well. So there were several papers in that era of doing, I mean, so Andre Cap was a, PhD student at the same time as Richard.[00:57:20] And so there was some joint language vision work in that era as well. It seems kind of ancient by modern standards, but yeah, we’re trying to go from sort of textural dependency graphs to visual scenes[00:57:32] Vibhu: at a time. The glove embeddings really took over a lot of. T-F-I-D-F, like one hot encoding, all that.[00:57:38] The early vision language models we saw were like lava style adapters, right? It’s, it’s technically still just embedding latent space. Let’s add image, let’s like mixed modality. So, and that, that’s one of the things you super put out there too, right?[00:57:51] swyx: Yeah.[00:57:51] Vibhu: Yeah.[00:57:52] swyx: Yeah.[00:57:52] Hiring, Closing & The Name “Moon Lake”[00:57:55] swyx: Well, thank you for all of that. Thank you for all advancing the worlds on, world modeling.[00:57:56] I honestly, do think that if people deeply understand everything we just [00:58:00] covered, they will see what’s coming. I think you guys have, made some, a really significant contribution here. What are you hiring for? What is the, what do people find? We, we agreed that the CTA was a hiring call.[00:58:10] Yeah. Don’t we have a GI You don’t need, you don’t need engineers anymore, right?[00:58:14] Fan-yun Sun: Yeah. On the model side we are actually striving towards basically a self-improving system. But what that means is that we need people to set up the self-improving system. So more, more specifically people who have the intersection of knowledge within co-generation and computer vision and graphics, right?[00:58:30] Yeah. That’s, that’s sort of the core research background that we look for within OTM and, and the majority of the team today do have like both backgrounds.[00:58:38] swyx: When you say computer vision and graphics, are they the same thing or is it computer vision one thing, graphics, another thing. And how intertwined are they?[00:58:46] Chris Manning: They’re intertwined but different.[00:58:49] swyx: Yeah.[00:58:49] Chris Manning: And I think, this relates to some of the themes that we’ve been talking about, that the more explicit underlying [00:59:00] world models that are being constructed inside Moon Lake really draw on the computer graphics tradition. And so it’s then combining that with the visual understanding of vision.[00:59:16] swyx: Got it. Yeah. All right. So you’ve written a game engine, you’re come talk to us, right?[00:59:21] Fan-yun Sun: Oh yeah, definitely. Definitely. But I do think that the line is blurred, like increasingly blurred these days where it’s like if you have a general understanding of group vision and graphics,[00:59:31] swyx: I think for your standards it is, for me it feels like vision is, is.[00:59:35] I’ll leave that to the big labs graphics. I, I, I can get that, you would want to do that from more first principles, but vision, there’s so many vision models off the shelf that I can take, but probably not good enough for your[00:59:45] Fan-yun Sun: I see, I see. If, if you’re sort of like making that distinction then maybe we, we care a little bit more about having graphics[00:59:51] swyx: knowledge.[00:59:51] Yeah, exactly.[00:59:52] It could be like, sometimes a hiring call can be as simple as like, if you know the answer to blah, you should talk to me. Like the sort of core known hard [01:00:00] problem in, in your world.[01:00:01] Fan-yun Sun: Ah, I see. Yeah. In that case, if you, yeah, definitely. If you’ve written a game engine before, if you’ve rld a variety of coding models on different objectives, like[01:00:13] swyx: easy,[01:00:13] Many of those, yeah.[01:00:14] Fan-yun Sun: If you’ve done multimodal lean space alignment, I, I intentionally include[01:00:20] swyx: space.[01:00:20] Fan-yun Sun: Again,[01:00:21] swyx: a poor editor has a thing every time. Yeah. Lean space alignment. Honestly. Is it that hard?[01:00:26] I, I, there’s some scripts out there that I’ve saved for the day. I someday have to do it, but I don’t have to do it.[01:00:31] But it’s[01:00:32] Fan-yun Sun: done, I think. Yeah. There, there’s, there’s a versions of that that are done. But I, I think we are aligning audio, text, language and video. Yeah. Right. Like, and basically we have these role models that are able to act as agents to like act in these worlds and extract long horizon videos and encoding that back to the model to sort of self-improve.[01:00:52] So it’s an insanely exciting, but also technically challenge problem. Yeah. So people who wanna do their lives best work, that only [01:01:00] makes a place.[01:01:01] Vibhu: How big are you guys? Where are you guys based?[01:01:02] Fan-yun Sun: We’re currently based in San Mateo, although we’re moving up to sf. We’re about 18 folks right now.[01:01:08] swyx: My ending question was gonna be why, what, what is the name?[01:01:10] What’s behind the name?[01:01:11] Vibhu: Yeah.[01:01:12] Fan-yun Sun: Oh,[01:01:14] Vibhu: Very cool. Graphics and design, by the way.[01:01:16] Fan-yun Sun: Actually at the, at the time when the, when the, when we started the company, we were thinking a lot about how do we make a company name that gives people the vibe of like, open ai, but for like, almost like industrial light and magic vibes.[01:01:28] Wow. Because it’s like we care about creativity and using that as a funnel to solve a GI. So then we were, we, we brainstorm a lot around like Dreamworks, right? Like industrial light magic. And, so there’s a few, few basically, space of things that we feel like are very, very semantically close to the company’s identity.[01:01:47] swyx: Yeah.[01:01:48] Fan-yun Sun: And then it ended up being Moon Lake, partly because of the Dreamworks vibe, the Dreamworks, moon[01:01:54] swyx: Lake.[01:01:55] Fan-yun Sun: Exactly. Yep. So that was a little bit of that inspiration. And then the moon was sort of [01:02:00] like a, it basically was like about the. Reflection. The reflection part also implies the self-improvement loop.[01:02:07] Wow. That we sort of like, that’s really bleed and that’s the path towards multimodal general intelligence. So that’s, that’s that. I’ll leave that as I love a good[01:02:15] swyx: name. I love a good name. This is great. It’s a[01:02:16] Vibhu: very[01:02:17] swyx: good name. It’s very good. Lo I’m glad I asked the question. I will also say, one, my favorite story, books or biographies ever is, creativity Inc.[01:02:24] With Ed Kamal’s, story about Pixar and how he, was rejected as a Disney animation artist. So then he went into computing and brute forced his way into back. No, I love that story. Yeah. Disney.[01:02:37] Fan-yun Sun: Yeah. And Walt Disney is also like one of my favorite founders. He’s like, his, his story. Like at the time you’re like, okay, I’m gonna create this like.[01:02:44] Immersive park. Like people can’t, don’t even have that technology to create it virtually, but they’re like, you know what, let’s just build it physically such that people can,[01:02:50] swyx: so he is the first world modeler.[01:02:52] Fan-yun Sun: No, I, I I tell people that like, theme parks are world models too.[01:02:56] swyx: Mm. Yeah. Yeah. Yeah. I mean, it’s a small world or it’s [01:03:00] a, like the Epcot center with all the little, replicas of the countries.[01:03:03] Yeah. Those are very interesting. Okay. Well thank you, we’ve covered, a huge amount. Thank you for your time and thank you for inspiring us.[01:03:10] Fan-yun Sun: Thank you[01:03:10] swyx: for having us. Thank you. It’s fun[01:03:11] Fan-yun Sun: chatting. Yeah. It’s been a good time. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Transcript
Discussion (0)
Starting point is 00:00:00 I think this whole space is extremely difficult as things are emerging now. And I mean, it's not only for world models. I think it's for everything, including tech-based models, right? Because, you know, in the early days, it seemed very easy to have good benchmarks, because we could do things like question answering benchmarks. But, you know, these days, so much of what people are wanting to do is nothing like that, right? you're wanting to get some recommendations about which backpack would be best for you for your trip in Europe next month. It's not so easy to come up with a benchmark and it's the same problem
Starting point is 00:00:40 with these world models. Before we get into today's episode, I just have a small message for listeners. Thank you. We will not be able to bring you the AI engineering, science and entertainment contents that you so clearly want if you didn't choose to also click in and tune into our content. approached by sponsors on an almost daily basis. But fortunately, enough of you actually subscribe to us to keep all this sustainable without ads, and we want to keep it that way. But I just have one favor to ask all of you. The single most powerful, completely free thing you can do is to click that subscribe button. It's the only thing I'll ever ask of you, and it means absolutely everything to me and my team that works so hard to bring the in-space to you each and every week.
Starting point is 00:01:22 If you do it, I promise you we'll never stop working to make the show even better. Now let's get into it. Okay, we're back in the studio with Moon Lakes, two leads. I guess there's other founders as well, but Sun and Chris Manning, welcome to the studio. Thanks for all right. Thanks for having us. You guys have, you know,
Starting point is 00:01:46 burst onto the scene with a really refreshing new take of world models. I would just want to sort of, I guess, ask how the two of you came together. Chris, you're a legend in NLP, and just AI in general, you're his grad student, I guess. Actually, my co-founder. Oh, yeah. I should give a lot of credit to my co-founder, Sharon.
Starting point is 00:02:05 Yeah. She was actually working with Professor Pavia Lin-Jajan, and then she ended up working with Ron and Chris Manning here. And then, so I got connected to Chris initially, actually, through my co-founder. What is Moon Lake? Actually, I'm also very curious about the name, but like why going into world models?
Starting point is 00:02:25 So I was working a lot with actually NVIDIA research during my PhD years on essentially generating interactive worlds to train reinforcement learning agents or embody EIA agents. And then there's two observations, one in academia and one in industry. In industry like folks like Nvidia are actually paying a lot of dollars to purchase these types of interactive worlds, whether it's for the sake of evaluation or training the robots or policies or models. And then in academia, same thing is happening. And more specifically, when I was actually working with Nvidia on the Synthetic Data Foundation Model Training project, we were actually generating a lot of these synthetic data and showing that, hey, you can actually,
Starting point is 00:03:05 these synthetic data are actually as useful as rural-world data when it comes to multimodal pre-training. But then, like I said, there's a lot of dollars being paid out to external vendors or like other folks to manually create these types of data. It was very clear to us that, okay, On our way to, that's called it, embody general intelligence, models need to learn the consequences behind their actions, which means that they need interactive data. And the demand for those types of data are growing exponentially, but everybody's sort of thinking about it from a pure, say, video generation perspective or something else.
Starting point is 00:03:37 But we feel like the true actually opportunity is actually building reasoning models that can do these things, like how humans do these things today. So that's a little bit on the genesis of Moon Lake. And I think the reason I got into world models was partly a philosophical take on the world where I like, you know, believe in the simulation theory and stuff like that. But on the other hand, it's really just like, oh, like there's an opportunity there
Starting point is 00:04:01 that I feel like nobody's doing it the way I think should be done. I can say a little bit about that. Yeah. So the overall goal is the pursuit of artificial intelligence and, you know, most of my career has been doing that in the language space. And that's been just, extremely productive, as we all know the story of the last few years.
Starting point is 00:04:21 I don't have to tell about how much we've achieved with large language models. But although they're being extremely effective for ramping language and general intelligence, it's clearly not the whole world. There's this multimodal world of vision, sound, taste that you'd like to be dealing with more than just language. And then the question is how to do it. And despite, you know, a huge investment in the computer vision space, right, as the research field, computer vision has been for decades far, far larger than the language space, actually. I mean, I think it's fair to say that, you know, vision understanding sort of stalled out, right?
Starting point is 00:05:08 You got to object recognition and then progress just wasn't being made, right? if you look at any of these vision language models, it's the language that's doing 90% of the work and the vision barely works. And so there's really an interesting research question as to why that is. And at heart, the ideas behind Moonlaker an attempt to answer that, believing that there can be a really rich connection between a more symbolic layer of abstracted understanding of visual domains, which aren't in the mainstream. vision models, which is still trying to operate on the surface level of pixels. I think one of your blog posts, you put it as structure or not scale. Is that a general thesis?
Starting point is 00:05:56 Yeah. Well, scale is good, too. Yeah, skills is good too. Lots of data is good as well. But nevertheless, you want the structure to be able to much more efficiently learn. Yeah. The other thing I really liked also was you put out an example of what your kind of reasoning traces look like, right, which you wouldn't.
Starting point is 00:06:13 still is the word that comes to mind. I don't even think that's a good, good description, but it would involve, for example, geometry, physics, affordances, symbolic logic, perceptual mappings, and what, what have you. But like that that is the kind of example that involves, let's call it spatial reasoning, world model reasoning as compared to normal LM reasoning. Yeah. But also like taking it a step back. So how do you guys define world models?
Starting point is 00:06:40 You know, a lot of people see like, okay, you can do different. fusion, you can do video generation, but you guys put out quite a few blog posts. You put out an essay recently, we can even pull it up, about efficient world models. You have a pretty like structural definition here, but for the general audience that don't super follow the space, right? What's the difference in what we see from like a video generation model to a world gen, a simulator? How do you kind of paint that?
Starting point is 00:07:04 Yeah. So I think this is actually a little bit subtle because, you know, people look at these amazing generative AI video models. SORA V-O-3, one of these things, and they think, genies, they think, oh, this is amazing. This is sort of, you know, we've solved understanding the world because you can produce these generative AI videos. But the reality is that although the visuals do look fantastic, those visuals actually aren't accompanied by an understanding of the 3D world, understanding how objects can move, what the consequences of different actions are.
Starting point is 00:07:48 And that's what's really needed for spatial intelligence. So, I mean, a term we sometimes use is that you need action-conditioned world models, that you only actually have a world model if you can predict, given some action is taken, what is going to change in the world because of it. And in particular, that becomes hard over longer time. time scales. So if you're simply, you know, trying to predict the next video frame, that's not so difficult. But what you actually want to do is understand the consequences, likely consequences of actions, minutes into the future. And to do that, you actually need much more of an abstracted
Starting point is 00:08:33 semantic model of the world. Yeah. The question comes where you want to have more structured than is available in just predicting the next token. And typically, well, let's call it the experience that the last five years has been that is just washed away by scale, right? So what is the right middle ground here? That you don't ignore the bitter lesson, but also you can be more efficient than what we're doing today. You know, one possibility is, look, if we just collect masses and masses and masses,
Starting point is 00:09:11 and masses of video data, this problem will be solved. Under certain assumptions that could be true, but there are sort of multiple avenues in which it could not be true. The first is what's really essential is understanding the consequences of actions, producing an action-conditioned world model. And if you're simply collecting observational video data, which is the easy stuff to collect when you're sort of mining online videos, you don't actually know the actions that are being taken to see how the video is changing. And so if you're never collecting directly actions and you're having to try and infer them from what happened in the observed video, that's not impossible, but it's very hard and it's not really established. that you can get that to work at any scale yet.
Starting point is 00:10:11 And so there's a lot of premium on collecting action condition video data, which is part of why there's been a lot of interest in using simulation so that you can be collecting data where you do know the actions, which is in quite limited supply. But there's also in the limit of as much data as you could possibly have, you know, maybe the problem is eventually, solvable, but even though we collect huge amounts of text data, text data is always at a great level of abstraction, right? Language is a human-designed, abstracted representation, where there's
Starting point is 00:10:53 meaning in each token, and it's representing an abstraction of the world, right? As soon as you're describing someone as a professor, and as soon as you're saying that they're condescending, right, You know, these are very abstracted descriptions of the world, is not at sort of what you're observing as pixel level. And so to get to that kind of degree of abstraction, starting from pixels, is orders and magnitude of extra data and processing. And so although, you know,
Starting point is 00:11:26 we absolutely want to exploit, get as much data as possible, use the bitter lesson. Nevertheless, if there are ways in which you can work with five orders of magnitude, less data than people working purely from pixels, you're going to be able to make a lot more progress, a lot more quickly, and that's the bet here. And so you could just say that's only wanting to be able to, you know,
Starting point is 00:11:52 do it more efficiently, do it more quickly, do it more cheaply. But I think it's actually more than that. I think one should be making the analogy to how human beings, work at one level you know yes we have these high resolution eyes and we can look and see a scene like a video but all of the evidence from neuroscience and psychology is that most of what comes into people's eyes is never processed right that you're doing fairly fine processing of exactly what you're focusing on but you know as soon as it's away from that of yeah there's another guy over there that you've sort of only processing top down this very abstracted semantic
Starting point is 00:12:40 description of the world around you. And so, you know, that's what human beings are doing. They're working with semantic abstractions. And so I think it is just the right representation because we also have other goals. We want to be able to do, you know, real-time worlds. That means there's a limit to how much processing you can do. And we want to do long-term planning. and consistency and again that favors abstraction. I mean, I guess there was actually a recent blog posts that came out from our friends at physical intelligence and, you know, they were sort of heading in the same direction. They were saying, oh, to maintain a long-term memory of what's happening in the world. So we can do longer term. We're actually storing text of what has, you know,
Starting point is 00:13:32 been happening in the world, right? It's not such a successful strategy of trying to keep it all at a pixel level. And yeah, I mean, you can see it in video models like that. Temporal consistency. We're at a scale of train on, you know, all the video data we have. We have it for maybe 30 seconds, a few minutes. It's not the same as a game state played for half an hour, right? I thought you guys break it down pretty well. You have a blog post about building multimodal worlds with an agent. I don't know if you guys want to talk about this. This is one of the things I read. I think the thing I talked about with the reasoning chain, yeah.
Starting point is 00:14:06 So there's like different phases to this. It seems like it's more of an agent, a scaffold, very different approach than just, you know, type in a prompt and you don't have the same consistency. It also like for people that are listening, you know, I would highly recommend reading it. It breaks down the problem in a different light, right? So like what do you need to consider when you're talking about video, like world game models, right? How would what do you need to consider? What are the factors? What are the elements? What's the state? So I don't know if you guys have stuff to talk about for this one. Yeah. Actually, I wanted to add on a little bit on our previous point, which is just like, I do feel like sometimes people confuse like, oh, like we're taking an a method with abstraction.
Starting point is 00:14:48 That means they don't believe in bitter lesson. Like that's just false. Right. We are believe as a bitter lesson. But then I feel like the question that we always discuss is like what is the right abstraction level? today. The analogy I like to make is like, let's just say we can encode and decode, represent all of images, videos, audio, in bytes. Then the most bitter lesson approach is to train
Starting point is 00:15:11 a next byte prediction model as opposed to a next token prediction model where it's just like, okay, it's needily multimodal. But it's like, well, yeah, like to Chris point, it's like the scale and compute you need to achieve that. So that's why we always come back to like, okay, what is the most efficient way to do it?
Starting point is 00:15:26 And reasoning models to the point of this blog post is a showcase of like, hey, we're actually just like reasoning about the world and reasoning about the aspects of the world that matter for me to learn what I want to learn from this role model. Yeah, it's like you're improving the encoder of whatever you're trying to model. And like a better representation would just represent the important things in less space. Yeah, which would just be more efficient. So yeah, I fully agree that it is not.
Starting point is 00:15:59 antagonistic to bitter lesson. I do want to mention one more thing. Is there any philosophical differences with the Jepa stuff that Yan Lakun is working on? I got to go there. You're mentioning like some latent abstraction. I'm like, okay, fine, let's talk about it, right? Like it's an elephant in the room. Yeah, there are philosophical differences. Jan Lukun is a dear friend of mine, but he has never the power of language in particular or symbolic representations in general. Yarn is a very visual thinker. He always wants to claim that he thinks visually and there are no word, symbols, or math in his head.
Starting point is 00:16:46 Maybe that's true of yarn. It's certainly not the way I think. But at any rate, you know, the world according to yarn is the basic stuff of the The world and of intelligence is visual, and language is just this low bitrate communication mechanism between humans, and it doesn't have much other utility, and it's far inferior to the high bitrate video that comes into your eyes. And I think he's fundamentally missing a number of important things there, right? Think of this evolutionary argument looking at animals, right?
Starting point is 00:17:34 The closest analogies are things with chimps, right? So chimpanzees, you know, have fairly similar brains to human beings. They have great vision systems. They have great memory systems. They've got, you know, better memory than we do of short-term memories. They can plan, they can build primitive tools, that, you know, humans massively ahead in what we understand about the world, what we can plan, what we can build. And essentially what took off for us was that humans managed to develop language. and that gave a symbolic knowledge representation and reasoning level,
Starting point is 00:18:20 which just gave this sort of vaulting of what could be done with the intelligence in brains. So the philosopher Dan Dennett refers to language as a cognitive tool and argues that humans unique among the creatures in the world have managed to build their own cognitive tool, And language is the famous first example, but other things like mathematics and programming languages are also cognitive tools. They give you an ability to think in abstractions, in extended causal reasoning chains, and that allows you to do much more.
Starting point is 00:19:02 And we use that for spatial representation and intelligence and planning and gameplay as well. So we believe, and this is underlying the specific technologies that Moonlake is making, that symbolic representations are powerful and you want to use it in your understanding of the visual world, when you want a causal understanding, when you want to maintain long-term consistency and prediction. And, you know, as I understand it, that's just not in Yan Lecoon's worldview. So I think that's a fundamental philosophical difference. Then there's the specific model he's been advancing Jepa. I mean, that's a reasonable research bed as a direction as to head for building out a model of the visual world.
Starting point is 00:19:57 To my mind, it's sort of one reasonable research bed. It's not really established. It's the best one that everyone should be following. At least developed at scale and meta, but it's not just vision, right? Like, I mean, JEPA is a, you know, just join-in-billing prediction can be applied to anything, really. And people have done it. If the argument is that there is a latent representation or that is probably more suited to the task, then why not let machines do it for us instead of pre-defining it at all?
Starting point is 00:20:25 And isn't something like a Jepa-shaped thing the right answer? And if not, why not? So I think there's a part of Jepa that's right, which is, you do want to have a joint embedding that gives you a consistent model of the world. And Jan's argument is you can never get that from auto-regressive language models because they're sort of left to right churning out one token at a time. I guess this is where we're, you know, the research arguments of the field. you know, I'm not actually convinced that's right,
Starting point is 00:21:05 because although the token production is this auto-regressive process that's heading, you know, left to right, I guess don't have to be left to right, but anyway, in sequence of tokens, we could have right to left Arabic. But, you know, although that's true, all of the weights of the model that are internal to the transformer, they are a joint model of the models,
Starting point is 00:21:32 of the world. And so I think you can think of the weights of the model as a form of joint representation and therefore it is plausible to think that that could be the basis of a world model which avoids yarn's objections. I think I follow and obviously that will touch on what Moon Lake eventually ends up doing as well right now, which it's hard to tell because you put out the end results but we don't know the inputs that go into it. So it's that's like you know that's that's something that we have to figure out over time. Yeah. I mean I guess this kind of breaks down some of the outputs. Do you want to walk us through it? Yeah. So this really just walks us through the
Starting point is 00:22:16 reasoning traces of like okay, so that just say we want to build the world in this context, it's really just a game demo that shows the variety of interactions that this world model can build. And yeah, it's really just a reasoning traces. of like, okay, you're prompted to create a bowling game. Like, how did it achieve what you saw that level of causality, interaction, and consistency, right? So, yeah, this is almost just like an example of like a reasoning traces. Very detailed. Very, very detailed.
Starting point is 00:22:46 You got to like, you don't even realize it. Right? Like when a video is generated, what happens when a ball strikes a pin, right? So first, like, there's audio in that, like audio triggers happen, score increments. The world changes like pins up to start dropping. There's a timer that goes on. You know, it's just like very similar to how now we're used to reasoning for language models. There's a whole state of what happened. So geometry, physics, all this stuff. And then there's kind of that single prompt. So asset, um,
Starting point is 00:23:15 Physication, all this stuff. It's like a, it's a nice view to see what's going on. I think Sun is also too polite to point out that, uh, both like Google's genie demos as well as, uh, world labs is marble do not have interactive worlds. That's the benefit of having a reasoning model, right? Because you can say, oh, like maybe in this particular context, I want to learn how to bowl. And then you can say, okay, then what is it important when it comes to learning how to bowl? Okay, maybe it's like, I need to understand the basic of like physics and I want to throw it over them. I want to know that when I, when it resets, it's a new game. So I know that, yeah, basically, you know, you know to pick up the ball, you know that ball's going to call the pins to fall down.
Starting point is 00:24:00 You know that what's important to this particular bowling game is to score. And you know that the score corresponds to the number of pins that fell down. So it's just like if it's a model that sort of knows what it looks like, knows what a bowling game looks like, but doesn't actually allow you to practice over and over again and to understand that, oh, like what it takes to actually get a high score, then it sort of doesn't actually allow you to learn what you set out to learn within the world model. right and and i think this is really just one example of showing like the advantages of the approach that we're taking over most the that's called the zeitgeist is today when people talk about clinical world models right so it sort of seems like the question to ask when there's a world model
Starting point is 00:24:49 is can i not only just wander around the world and look at the beautiful graphics can i interact with the objects in the world and see the right consequences of actions. And you also understand what the consequences would be if you do something, right? So it's not just like, okay, there's one thing. If I pick it up, something will happen. But, you know, there's 50 options. And I know I can expect, I can infer what would happen if I do any of them, right? So very different when you can actually see it play around with it. There's two cheeky elements of that. I mean, the sort of, I guess, less ambitious one is, Let's really establish it for listeners.
Starting point is 00:25:30 Why is this fundamentally different than writing Unity code, right? Just creating a model to translate a prompt into Unity code. So there is an underlying physics engine. In that sense, there's some overlapping things to Unity. But the way we think about it is like physics engine or tools or code are cognitive tools, like borrowing Chris's term, right? Like tools that the model can employ as, means to an end. So today maybe you say, okay, in this particular context, we care about
Starting point is 00:26:02 physics, we care about the long-term causality consequences, then yes, we deploy a physics engine. And then maybe tomorrow we say, okay, we're training that just say drones, where we only care about really fluid dynamics and the visual aspect of the world, then yeah, maybe we don't actually, the model actually doesn't have to use a physics engine or maybe it employs other types of representation or physics engine to achieve, the task. So yes, writing code for Unity is sort of similar to a tool that our model can employ, but our goal is for model to take a representation conditioned reasoning approach or process internally. Yeah. Using these things as just like general tool calls, right, which I think is very interesting.
Starting point is 00:26:47 The other more ambitious one is some kind of recursive element where it becomes multiplayer, right? Like here there's a single player element. You're not modeling any other people involved and that is a whole other thing. But in fact, we can already do multiplayer. Oh yeah. Okay. I haven't seen any demo. So if you just actually just like prompt our our model to say, hey, like, configure multiplayer, then it'll do like this. You'll be able to configure multiplayer persistence in your database for you. Easy. Yeah. So what are like some of the current limitations in where we're at? So there's one approach of like, okay, scale up video predictors, obviously there's data issues. You know, with approaches like this, is it data constraints? What are like the next
Starting point is 00:27:26 steps. Is it real time? Like, so there's one side if you know, write an agent to write unity code, but okay, I want to be streaming a game real time. I want to have characters being also like agentic, but where where do we kind of see this scaling up, right? Yeah, there's definitely a data constrained. Like the more data, the better this reasoning model can almost basically act as humans to like operate a variety of tools and software to build whatever is necessary. And then there's a sort of fidelity constraint, which we're actually solving with another model Revery, which we can
Starting point is 00:28:00 talk about later. But it's like, well, it's not as easy to get to photorealism with the approach that we're taking. But we think there are better solutions to that, which is we can dive into a later later. One thing you note here is it's a diffusion model, right? So
Starting point is 00:28:15 there's a few approaches, diffusion, caution, splatting. Yeah, so Revery, diffusion model you guys want to introduce? Yeah, totally. So within our world modeling framework, we think there are two models that we train, right? Like there's the multimodal reasoning model that we just talked about that essentially handles mainly the causality, the persistency and logic determinism, determinism with the world. And then reverie is our bet on saying, okay, like, while all those model can take care of all these things that we just talked about, it's limitations compared to, existing video models is that it doesn't have as high of a pixel fidelity right off the gate, right?
Starting point is 00:29:01 And reverie is to say, hey, we can actually take whatever persistent representation that we generate without a multimodal reasoning model and learn to restile it into photo realistic styles or arbitrary styles you want. So this model is almost to say, hey, I'm going to respect the persistency and interactivity of the world that you created. But my only job is to make sure that its pixel distribution is close to what we want. Yeah. Yeah, for example, right there. You kept the KL divergence. Oh, where?
Starting point is 00:29:34 No, no, I mean, this is a classic, like, how you don't stray too far from the source material as you kept the KL, which is kind of cool. Yeah, yeah. I mean, and the difference is, and I mean, Sun was pointing at this where it's sort of saying it's in one way more difficult. path but a better path that you know typically the diffusion models are producing the whole scene and it looks lovely but there isn't spatial understanding behind it which is allowing for the real-time graphics gameplay the spatial intelligence understanding the consequences of worlds where this is taking a path where it is assuming an abstracted semantic model of the world the world's state And then the diffusion model is then being used on top of that to produce the high-quality graphics.
Starting point is 00:30:28 Is there an intended practical or business use for this, or is it like a demonstration of capabilities? We actually believe that this is going to be the next paradigm of a rendering. So it's going to replace how rasterizers. It's going to replace DLSS today. Because it not only has these pixel prior that's learned from the world, such that you can literally play any game in photo realistic styles, which is a lot of people's desire when they do GTA, right? Like all the mods, all the people adding perfect lighting and all this. So skins for worlds, let's call it.
Starting point is 00:30:58 Skins. That's called skins. That's called skins for worlds. You can call it customization. You can play it how you want, right? Yeah, exactly. And I think another thing that we really pointed out specifically in this blog is the programmability of it. Right. So what this means is that this renderer, well, historically renderer is always a derivative of the game state.
Starting point is 00:31:17 Right. You're saying, oh, here's the game state. I'm rendering out a frame. But here I'm saying, actually, this renderer can be part of the gameplay loop. I can say something along lines of if upon getting 10 apples, my weapon of choice, my bullet's going to turn to apples. And that's possible because we can say we can basically dynamically have certain games that trigger the preconditions to the renderer, such that the rendering is now part of the game
Starting point is 00:31:45 loop too. One thing is to just say, okay, it's it's appearance. But the second thing is also. say there's these novel interactions that are possible because this renderer now has actually priors of the world it is up to the artist to figure out what to do with it it is up to the creators yeah yeah and i also think that's actually another big argument that we're making and the reason that we're picking back taking the bet we're baking is that a lot of the times whether it's for embody air gaming like you want a layer where human can inject their intentions
Starting point is 00:32:17 right so for example that just say in the context of gaming it's It's obviously like my creative intent, but maybe in the context of embody AI, it's like, oh, like I take this foundational policy and I want to actually fine-tune it to deploy in my house. So you want to almost say, have a layer where human can say, oh, here's the distribution of things I want to create to achieve my goal. And I think 3D graphics as it is today is basically the layer for people to say, hey, what do I care about in this world? And it allows, um, basically, basically human intent to be expressed in these worlds much more explicitly and distributionally, as opposed to just saying, hey, I'm going to generate like arbitrary. And it's like just prompts,
Starting point is 00:32:59 you know. It's one of those things where like I think you're going to build up a series of models, right? This is just one of this is probably like the highest utility or heaviest frequency one. I don't know what to call this where like you yeah, you can immediately drop this in on any game and you don't need anything else that that you guys do. But I could see I could see that. I think the human intent is something that people are not even used to because we're so used to static worlds or you know worlds that just don't react or I don't know it's it you're kind of blowing my mind right now with like well I'm I wonder if you've talked to people at GDC and what are they going to do with it yeah now the stance that we take on this front is like we're not going to be more creative than
Starting point is 00:33:41 our users ship it out yeah but we want to make sure that we're building things in a way that really allows them to express their intent the thing that you said about here's the distribution that I want. I think text may be too low of a bandwidth to to really demonstrate because I you know that I'm probably just going to want to drop in a bunch of reference assets and then you can figure it out from there. You probably want to do a mixture of both right like you throw in a few images I wanted this style I wanted to look like this it's a mixture right I think it's a mixture I mean yeah I mean there's clearly a visual component of this and it's not that
Starting point is 00:34:20 you know, everything can be text because of course you want to give a visual look. But there's also a massive amount of giving the overall picture of the look of the world and the behavior of things that you can express in a few words of text and it be very time-consuming and difficult to do via visual means. So I think, yeah, you want a combination of both. So one question I kind of have is how do we go about evaluating world model? So like there's many axes, right? One is like, okay, I have preferences.
Starting point is 00:34:57 How well do we adhere to prompts? One is the simulation. One is like do things, is there core logic that's broken? So coming from we know how to evaluate diffusion, there's fidelity, there's stuff like that. But what are some of the challenges that most people probably aren't thinking about? Yeah, I think this is like a great question and probably one of the hardest questions in world models. because I think it always comes back to what are you building this world model for? And depending on your end goal and purpose, the evaluation should differ.
Starting point is 00:35:26 So in the context of games, then the most direct way of measuring is how much time are people actually spending in this world that you create. And if your goal is to say, for example, in the context that we just talked about, like, deploying action embodied agent, then your end metric is then, okay, after training in these worlds that you generate, how robust it is. to when you actually deploy it to the target environment. But then, you know, it's hard to measure these end metrics. So today people have like these proxy metrics that I call that basically try to measure what we really care about which to end metrics. But then frankly, it's different for every use case. Yeah.
Starting point is 00:36:07 Which seems like quite a challenge, right? Like in language models or video models, image models, your benchmarks are proxies, right? People aren't actually asking instruction following tool use questions. their proxies of how well it will do downstream but for this so like you know should should team should companies have their own individual benchmarks outside of games if you think of stuff like okay video production movies stuff like that that also want to use world models should should they sort of internalize like their own proxy is this something you guys do where does that kind of
Starting point is 00:36:39 yeah i think this whole space is extremely difficult as things are emerging now and i mean it's not only for world models. I think it's for everything, including text-based models, right? Because, you know, in the early days, it seemed very easy to have good benchmarks, because we could do things like question answering benchmarks, and could you answer the question based on these documents and the various other kinds of, you know, do pieces of logical reasoning or math. But again, these are sort of, and there are sort of visual equivalents of things like object recognition, right?
Starting point is 00:37:16 for these small component tasks. But these days, so much of what people are wanting to do also with language models is nothing like that, right? You're wanting to have an interaction with the language model and get some recommendations about which backpack would be best for you for your trip in Europe next month. And it's not the same kind of thing, right? And it's not so easy to come up with a benchmark. as to does this large language model give you an effective interaction for guiding you in a good way for
Starting point is 00:37:54 shopping, right? So, and it's the same problem with these world models. So if we take the game design case, well, success is that a game designer can produce what they are imagining in a reasonable amount of time. And that's really the kind of macro task. But that's a very hard thing to turn into a benchmark. And I think a lot of this is actually going to turn into people walking with their feet, right? I mean, I guess that's what's happening at the large language model level, right, when people are choosing to use, you know, GPG5 or Gemini.
Starting point is 00:38:43 or Claude, you know, individuals are trying out these different models and deciding, oh, I like the kind of answers that GPT5 gives me or no, I feel like I get more accurate detail from Claude, right? It's a lot of vibe checking. I realize that, but it's actually whether people feel it's giving them utility and what they want, right? And the interesting thing there is like a lot of people prefer the visual, right? this looks pretty which is not the objective of what this is for right it's if a game designer is working on something they care about the game engine state it's it can look whatever you can fix that up later or you can have a really good game state and you can quickly edit it to 20 20 different versions that keep state right so that's a really important distinction um for and for speaking to moon lake strength
Starting point is 00:39:36 right so yeah i mean you know great visuals are lovely to look at for a few seconds but games are really all about the concept, the gameplay. And, you know, a lot of the time that doesn't actually even require great visuals. I mean, there are just lots of very successful games which have relatively primitive visuals. And there are other games where people have spent millions producing photo realistic visuals and the game sucks, right? So keeping those two axes apart is really important in thinking about what's important in a world model for different uses. This conversation is reminding me of some game review and fiction discussions I've had in my sort of non-AI related life. Some people might know Brandon Sanderson, who's a very famous fiction author, is a big, big game reviewer.
Starting point is 00:40:38 And he's a big fan of video games where you change one thing. about a normal, what you might assume about the world. For example, Baba is you. I don't know if you might have come across that, where the rules change as you play the game. And also where you can do things like reverse time selectively or change gravity selectively. I think this also reminds me of other kinds of world models
Starting point is 00:41:01 that are created by authors, where Ted Chang is my typical example where he will take the world that you know today, but change one thing about it, but then create a consistent world based on that. which is long window answer for me to say is it's easy to create alternative roles that don't exist, but you change one thing. And then let's run a whole bunch of people through it to see if it works.
Starting point is 00:41:22 My first answer will be that seems a lot easier and more conceivable to do using technology like moon lakes than with some of the other world models out there where the sun can actually make it happen. I'll let him give the second answer. If I guess for you, you're constrained by the game engine tool, right? Like at the end of the day, that's the thought partner that you have. If I ask for something where like it never is allowed to reverse time, or if gravity only ever works one way, then well, that's it.
Starting point is 00:41:56 But sometimes gravity might change. But it's a lot easier to change with code as opposed to a model that is learned primarily on data of real world and virtual worlds. and virtual worlds that are, I guess like, for example, Junior, like, there's actually train on a lot of real world data and a lot of virtual gaming data. And it's hard to say, well, maybe it's easy to say, okay, I want to change the visuals in like the time period of the world. But you can't change gravity, for example. I feel like you can to light bounds, right? Everything comes down to like code is a better way to execute it. But the models aren't that diverse and creative, right?
Starting point is 00:42:34 You can say, okay, make gravity slower. It can do that. but it's limited to your representation of how you text it out, right? Like they're only going to do a few iterations, whereas programmatically, you know, if there's a game engine under the hood, you can, you can kind of go wild, right? So one of the, I don't know, one of the limitations of most models is that they're very overtrained to one style, right? And extracting diversity is pretty difficult, at least.
Starting point is 00:43:00 That's something we've seen. I mean, other examples you have in mind where existing models, you like, it would, be easier to do that's not using code like certain types of creative intent or like you know you know clipping uh other models other world models are very good at clipping through things clipping my my legs clipping through a rock because it's you know it's just it's just bad like you would have to struggle very hard with your your stuff to actually make that happen which i think is it may be a topic that you actually prepared on uh gaucious splatting very versus the other stuff.
Starting point is 00:43:39 Yeah, yeah. It's just for those not super familiar, right? There's gushin splatting. There is diffusion. Like what works, what scales up? I feel like in February when Sora 1 came out, the blog post was literally titled, like, you bring it up, you never know,
Starting point is 00:43:54 world video generation models are world simulators. It's super bitter lesson pilled. Yeah, a lot of it is emergence, right? So not to go through their blog post. Basically, their whole thing was, as you scale up, all this consistency, all this stuff just kind of solves. It's a very simple premise, right?
Starting point is 00:44:12 They just scaled up diffusion. And from there, you know, this is Feb 2024. How much can we, it's already been two years, which is basically five years, you know, how much more an AI time do we need to just scale up? Or do we hit a data cap? But I think we already talked about this a lot, right? Like this is back to the beginning discussion
Starting point is 00:44:30 of what's appropriate for the time. And that seems like your approach, right? Yeah. The point I'm trying to make is, that there are many, many different types of world simulators. And like having a world simulator that can produce pixel coherency is very, very useful for games and, you know, marketing and all these things. But it's not as useful as people think when it comes to causal reasoning,
Starting point is 00:44:57 when it comes to embodied AI. And yeah, like, this title is true. Like we're not saying that it's like, you know, not a great world simulator. but actually in the blog that we wrote, the bet is more so that they're going to be disproportionately large share of value of real-world tasks or in virtual tasks where high-resolution pixel fidelity is not needed. And yes, video models have their values. Yeah. This is at the absolute limit of my physics understanding, but one example that comes to mind is basically having to solve the equivalent of a three-body problem in a deterministic world,
Starting point is 00:45:36 whereas the video models would just approximate it good enough. Yeah. Right, like there's some point at which your approach kind of runs into, like, the, well, you now have to simulate the world, please. Thank you very much. And like, you're trying to do that, but only to the extent that the game engine lets you. And like, game engines cannot do some things.
Starting point is 00:45:57 Yeah. No, I mean, I think the interesting or more technical question here, actually is where do you draw the boundary between what's handled with, that's a diffusion prior and what's handled with symbolic priors? Yes. Okay. Right. Because like this boundary can actually be fluid.
Starting point is 00:46:17 Like I think like maybe what you're trying to get at is like, okay, people are saying pixel prior everything. But what we're saying is, okay, there's a boundary that we draw where this is where we think provides the most economical value for the domain. and things that we care about today. And I actually do think, and it's just something that we do internally all the time, which is like, okay, given new equations that we learn or new elements of the world and that we learn or maybe some other knowledge that we acquire in the process of developing the models, should we still be maintaining this line exactly at it is today or should we move it a little bit left or a little bit right?
Starting point is 00:46:59 Right? Like sometimes that we realize that, oh, like, maybe customers or folks like want certain things that are better handled with pixel prior as opposed to symbolic prior. Yeah, your skin thing is an example moving it right. Or left. Yeah, exactly. I don't know what the left right is. Yeah, yeah. No, the the reverie model. Yes. Actually, we have a few iterations of them.
Starting point is 00:47:21 They're actually as slightly different. I know. You should do that. That's a cool dimension to show. Yeah. Is quantum mechanics, the, the. diffusion prior of our world. Right?
Starting point is 00:47:34 It's like that's the boundary of classical mechanics versus quantum, right? Like that's it, right? At one point, God plays dice and the other point doesn't. I don't know. I don't know if course you want to say, but I think generally I feel like physics is better with symbolic priors. Even quantum physics. Even quantum physics.
Starting point is 00:47:50 Yeah. This is starts against the MLST territory. It's what I call it where he likes to get philosophical. We're quite friendly. I mean, we need to get singularity. I heard some of that. No, I think that is actually really helpful. And, man, I just want you to productize this.
Starting point is 00:48:10 Like, as a product guy, I'm just like, well, okay. Like, a researcher, you know, like, it's cool. Like, this is a theoretical. Like, you have a very good, I don't know, like, the way of thinking about these things. But I just want to see you, like, you know, express it. I do think, like, you're fundamentally things. when you leave open new tools like okay use human intent to incorporate it into how you render well artists are going to have to take like two to three years to figure out what to do with this
Starting point is 00:48:39 and you just don't know but i think you know this is um gives a much more approachable and controllable world for the beauty of uh nLP that that will enable it to be adopted and used and we're very hopeful about that yeah yeah yeah yeah i have I mean, we are very focused actually on commercialization in the sense that like we do, we do really believe in the data flow app approach. Yeah. Where we put this in the hands of the creators and the users and then they will teach us when what capability our model should improve. And that's why we are, we are actually, you know, like products in beta. Yeah. Focusing on gaming.
Starting point is 00:49:18 What's like the adjacent thing to gaming? In body I basically. So we can we can, I'll maybe start with where we see the platform in three years, which is like, okay. the users would tell us what they want to achieve. The end goal could be, hey, I want to make something to teach my kids the value of humility. Or it could be, hey, I want to find you my drones to be really good at rescue situations. I could be vacuum robots. I want to like train my manipulation or like vacuum robot to be very robust to my office.
Starting point is 00:49:50 Right. But it's like whatever it is scenario. Robust to my office. Like navigate area very robustly within my office. But then it's like whatever. end goal that you want, our world model will say, okay, given what you want to achieve, let me generate a distribution of environments such that I can train and evaluate whatever this you want. Maybe for the purpose of games, it's just the end simulation, and that's the end product.
Starting point is 00:50:15 For certain policies, it's like, I can train it within these environments and then help you see where your policy is failing or not. And then, you know, so I think... So in that case, much more of a training tool than in other applications. Training, evaluation, both, right? Sure. Same thing. Yeah. I think it's just this world model that allows people to train any policy that can act in any multimodal environments.
Starting point is 00:50:37 Would it be harder to reward hack? Is there an angle here where it is harder to reward hack? I'll just put it generally. Because I think that's obviously a key problem that a lot of people face when training agents in these environments. And I don't know. Can you solve it? I think not necessarily.
Starting point is 00:50:56 I mean, to the extent that that. there's a misspecified reward that it seems like it could be hacked in a more symbolic world or in a more pixel-based world. I don't know if Sun's got any thoughts, but I don't think that's really being solved. The other thing that comes to mind is you could just build a better SORA as a video-generated model, right? Because then you would move the diffusion side a bit more further to the right, I think, if I got the directionality correct. And that's it. It's better on domains, right? Like on consistency over an hour, for sure.
Starting point is 00:51:33 It exists versus something doesn't, right? Yeah. Is your question more like? I'm just riffing on like, how do you, what can you build, you know, with the stuff that you have. I do think that the mind of the academic does go immediately to training and in evaluation. But like art tends to take unusual directions like you might end up. Okay.
Starting point is 00:51:55 Yeah. But the question is, can you use this piece of? software to develop compelling gameplay. And I don't think you can take SOAR and produce compelling gameplay, right? If you want to have a world that you can wander around in a bit, you're good. But what are your abilities to have gameplay mechanics implemented the way you'd like them to be and to have things stay, you know, with the long-term history of your gameplay that influences future actions? I think there's just nothing there for that. Yeah, I do tend to agree.
Starting point is 00:52:29 I'm just trying to sort of test the boundaries. I would also make the observation that as AAA games industry has developed, the line between what is a movie and what is a game has blurred. And you do end up basically producing a two-hour movie as part of your game. No, honestly, there's so many actually applications in adjacent markets that our model can go into. Yeah. But yeah, it's sort of fun to refurb. RIF on, although on execution side, we need to stay focused with like, okay, what are the capabilities we want to unlock over time and there's a roadmap for that.
Starting point is 00:53:03 But yeah, we're just riffing on sort of like the possibilities. I feel like whether it's endless. Yeah, it's like classic. And the embedding for a possibility and less in my mind is very close. Yeah. I do want to focus on one like weird choice. I don't know if it's weird. Maybe I'm I got something here. Audio, right? You could have just said no audio and audio in my mind has a lot of recursion where. Whereas in video, you can just do raycasting and that's much computationally much simpler. Audio just seems way harder. I don't know if you want to just comment on just the spatial 3D audio problem.
Starting point is 00:53:40 Did you really have to do it? I guess you do to be immersive, but like a lot of people do treat it as like, well, you just stick a TTS model on top of... Well, there's a lot more to game audio than just speech, right? It's not just TTS. Yeah. Spatial in my mind echoes and reflections and I don't even know what's what else I don't know what are the problems in this space Yeah, I think this point like the is sort of a more more pointing to the benefits of using an game engine as a tool that's available to the model right because like part of the spatial audio is from the code that is underlying the simulation
Starting point is 00:54:25 And while we do give our model access to other types of audio models as tools, none of them would be spatial, I think. Right, but that's exactly sort of more point to we're giving our model an abstraction or a suite of tools such that it's able to achieve that. And you can argue that sort of spatial is like an emergence out of the tools that we, an abstraction that we provide to the agents. And I think that's the beauty of this, this approach is like, there's a lot of things, kind of like how humanity's built technology and they're like Lego block that built on top of each other. And it's the same thing here. Like there's going to be things that.
Starting point is 00:55:04 So just sort of emerges from being able to put these things together in like a comparatorially interesting ways. Right. So this integrated audio model exploits the understanding and semantics of the moon lake world, right? And whereas in general, for the Gen AI video models, there's no actual integration across to audio at all, right? That someone might stick some music or stick a soundscape or whatever else on top of their video, so it's not a silent video.
Starting point is 00:55:40 But they're in no way connected into a consistent world model, and there's nothing that's okay. An action is happening in the video. therefore there should be a sound that's coming from this part of the visual field. Yeah. Is that different than Sauru? Does it not have audio? Not to say it's not like... There's a spatial audio. It doesn't? No.
Starting point is 00:56:06 I've played around it enough. It just sounds like someone put it at 11 Labs voice on top of it and just try to do the lip sync. I mean, I've seen, okay, generate a dog at the beach and reactions to big wave and move around. It's definitely like early early... Have the dog move away from camera. and see if the sound goes down. It doesn't, right? Because they don't have special audio.
Starting point is 00:56:27 We do want to basically like we are a moral model, like the one we're training, is basically towards the goal of having a combined latent representation across all these different modalities, such that you can like reason across these different modalities. So for example, if I close my eyes and you play a video, you play a sound of like cars skidding away from me, I almost can like visually extrapolate that trajectory in my mind.
Starting point is 00:56:50 that trajectory in my mind and i think that that type of capability we want our model to be able reason right and that's the reason that we're sort of taking this multimodal reasoning approach it's like we want to combine latent space that can yeah oh you said latent space we like that here we have to play the the bell every time there someone says late in space uh no you got to train daredevil one where you it's only audio but you have to work out where everything is cool i think that was uh that's about it for our moon lake coverage uh i do think that we We have like a couple of Chris Manning questions on IR and just any any other sort of attention topics or NLP topics. Okay.
Starting point is 00:57:29 I mean, yeah, it's just fun. You know, we talked a bit about how you guys meant, but you basically, you were like the godfather of NLP per se, right? You spent the whole career from early embeddings, early, early attention. You did 2015 attention for machine translation, everything. You had information retrieval. So rag before rag. You know, we just want to shout that out and admire a lot of that, right? So what prompted the switch over to world models?
Starting point is 00:57:55 How'd all that come about? To some answer, it is the enthusiasms and creativity of students. But there's a bit of a history there, right? So yeah, so clearly most of my career has been doing stuff with language and, you know, how I got into research was thinking, oh, this is just so amazing how humans can produce speech and understand each other in real time. And somehow they managed to learn languages from their kids, how this possibly happen.
Starting point is 00:58:26 And so, yeah, starting off, I was very focused on language. But, you know, as it sort of got into the 2010s, I started, you know, going, I'd been working on question answering, and then I started to get interest in visual question answering. And that was an area where it, was very noticeable that the visual understanding was bad, right? You know, these were the days when, like, it sort of seemed like,
Starting point is 00:59:00 there's almost no visual understanding. You were just getting answers that came from priors. So, you know, if you've asked how many people are seeing at the table, it'd always answer to regardless of how many people you could see in the picture. And, you know, so it seemed like, oh, these models actually aren't a to get semantic information out of images. And so I was interested in that problem and tried to work more on that. And so then that required knowing more about what's happening in vision
Starting point is 00:59:33 and how you can represent visual information. And then things start, you know, there started to be this revolution of doing generative AI images. And then I had students that started looking at that before the era of moon. Wake I was also working with Demigour who founded Pika and so and Ian obviously with Gans. Yeah though Ian was never my student but yeah Ian I was very aware for the whole decade there of Ian with Gans yeah and I mean Ian was a Stanford undergrad but yeah Richard does you dot com I believe he was your student um yeah and you know there were there were links across at that stage as well so I mean you know there was
Starting point is 01:00:18 several papers in that era of doing, I mean, so Andre Kapathi was a PhD student at the same time as Richard. And so there was some joint language vision work in that era as well. You know, it seems kind of ancient by modern standards, but yeah, we're trying to go from sort of textual dependency graphs to visual scenes. At a time, the glove embeddings really took over a lot of TFIDF, like one hot encoding, all that. The early vision language model, models we saw were like lava style adapters, right? It's technically still just embedding latent space, let's add image, it's like mixed modality.
Starting point is 01:00:55 So that's one of the things you super put out there too, right? Yeah. Yeah. Yeah. Well, thank you for all of that. Thank you for advancing the world on world modeling. Honestly, I do think that if people deeply understand everything we just covered, they will see what's coming.
Starting point is 01:01:10 And I think you guys have, you know, made some, it's a really significant contribution here. What are you hiring for, you know, what is the... What do people find you? You know, we agreed that the CTA was a hiring call. Yeah. I mean, don't we have AGI, you don't need engineers anymore, right? Yeah.
Starting point is 01:01:25 On the model side, we are actually striving towards basically a self-improving system. But what that means is that we need people to set up the self-improving system. More specifically, people who have the intersection of knowledge within cogeneration and computer vision and graphics, right? Yeah. That's sort of the core research background that we look for within our team. And the majority of the team today do have like both backgrounds. When you say computer vision and graphics, are they the same thing?
Starting point is 01:01:53 Or is it computer vision, one thing, graphics, another thing? How intertwined are they? They're intertwined, but different. Yeah. And I think, you know, this relates to some of the themes that we've been talking about, that the more explicit underlying world models that are being constructed inside moon, like really draw on the computer graphics tradition and so it's then combining that with the visual understanding of vision got it yeah all right so if you've written a game engine you're
Starting point is 01:02:31 come talk to us right oh yeah definitely but i do think that the line is blurred like increasingly blurred these days where it's like if you have a general understanding of vision and graphics i think for your standards it is uh for me it feels like vision is is, you know, I'll leave that to the big labs. Graphics, I can get that, you know, you would want to do that from more first principles. But vision, there's so many vision models off the shelf that I can take, but probably not good enough for your... I see, I see. If you're sort of like making that distinction, then maybe we care a little bit more about having graphics knowledge.
Starting point is 01:03:05 Yeah, exactly. It could be like, you know, sometimes a hiring call can be as simple as like, if you know the answer to blah, you should talk to me. You know, like the sort of core known hard problem in, in your... world. Ah, I see. Yeah. In that case, if you, yeah, definitely if you've written a game engine before, if you've RLed a variety of coding models on different objectives, like... Easy. Many of those, yeah. If you've done multimodal lane space alignment, I intentionally included in space again. A poor editor has an edit thing every time. Yeah, lean space alignment. Honestly, is it that hard? Well, there's some scripts out there that I've saved for the day I,
Starting point is 01:03:46 something they have to do it but i don't have to do it but it's done i think yeah there's versions of that they're done but i i think we are lining audio text language and video right like and basically we have these world models that are able to act as agents to like act in these worlds and extract long horizon videos and encoding that back to the model to sort of self-improve so it's an insanely exciting but also technically challenge problem so people who want to do the their lives best work, you know, that makes a place. How big are you guys? Where are you guys based?
Starting point is 01:04:20 We're currently based in San Mateo, although we're moving up to SF. We're about 18 folks right now. My ending question was going to be, what is the name? What's behind the name? Oh. Very cool graphics and design, by the way. Actually, at the time when the, when we started the company, we were thinking a lot about how do we make a company name that gives people the vibe of, like,
Starting point is 01:04:44 open AI, but for like almost like, industrial, industrial light and magic vibes. Because it's like we care about creativity and using that as a funnel to solve AGI. So then we were we brainstormed a lot around like DreamWorks, right? Like industrial light and magic. So there's a few, few basically space of things that we feel like are very, very semantically close to the company's identity. Yeah. And then it ended up being Moon Lake partly because of the DreamWorks vibe, you know, the DreamWorks vibe.
Starting point is 01:05:14 You know, the DreamWorks. Exactly. So that was a little bit of that inspiration. And then the moon was sort of like a, it basically was like about the reflection. The reflection part also implies the self-improvement loop. Wow. We sort of like really believed in and that's the path towards multimodal general intelligence. So that's that's that's that. I'll leave a good name. I love a good name. This is great. It's a very good lore. I'm glad I asked the question. I will also say, you know, one of my favorite story books or biographies ever. Creativity Inc with Ed Catmull's story about Pixar and how he you know was rejected as a Disney animation artist so then he went into computing and brute forced his way into back into Disney yeah and Walt Disney is also like one of my favorite founders he's like his story like at the time you're like okay I'm gonna create this like immersive park like people can't can't don't even have that technology to create it virtually but like you know what they just build it very physically such that people can so he's the first world modeler um
Starting point is 01:06:15 No, I tell people that. Like, theme parks are world models too. Yeah, yeah, yeah. I mean, you know, it's a small world or it's like the Epcot Center with all the little replicas of the countries. Yeah, those are very interesting. Okay, well, thank you. We've covered, you know, a huge amount. Thank you for your time.
Starting point is 01:06:32 And thank you for inspiring us. Thank you. It's been fun chatting. Yeah, it's been a good time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.