Latent Space: The AI Engineer Podcast - World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

Starting point is 00:00:00 Hi, listeners. As you may know, I recently wrapped up the AIE Code Conference in New York, and while I'm traveling, I do like to visit Top AI startups in person to bring you interviews that you don't find on any other podcast that just does a Zoom call. General Intuition, or GI, for short, is a spinout of a 10-year-old game clipping company called Metal, which has 12 million users. But in comparison, Twitch only has 7 million monthly active streamers. Metal collects this data by building the best retroactive clipping software in the world. In other words, you don't need to be consciously recording, you actually just have metal on in the background while you're playing, and you hit a button to clip the last 30 seconds after something interesting happens. It's very similar to how Tesla and self-driving does bug reporting, if you ever done a self-driving bug report in Tesla's. The result is that metal has accumulated 3.8 billion clips of the best moments in games, resulting in one of the most unique and diverse data sets of peak human behavior actively mining for the interesting moments. They were also very prescient in navigating. In navigating privacy and data collection concerns by mapping actions to these visual inputs and game outcomes.

Starting point is 00:01:04 As you saw on our Fei-Fei Lee and Justin Johnson episode with World Labs, and with the recent departure of Yan Lakun from META, there's a lot of interest in world models as the next frontier after LLMs, to improve on spatial intelligence and to work on embodied robotics use cases. DeepMind has been working on this with Genie 1-2 and SEMA 1-2, and this year, Okunee-I-Seele finally agree, because they have been pending on LLMs a lot, and they made news by offering to the internet. $500 million for Meadows video game clip data. Our guest today, PIM, turned down that money and instead chose to build an independent world model lab instead. Kosova Ventures led the $134 million seed round, which is Vinod Kostler's largest single seed bet since Open AI. We're able to get

Starting point is 00:01:46 an exclusive preview of GIs models, which unfortunately we cannot show you directly. But I can confirm they were incredibly human life and we chose to include the first 11 minutes of the demo discussion even though I couldn't show it to you. It may be hard to follow, but I tried to call out what was noteworthy for you to know as your likely reaction if you were watching along with us. Now, enjoy the world's first look at my first look at Geno Intuition. So what I'm about to show you is a completely vision-based agent

Starting point is 00:02:11 that's just seeing pixels and predicting actions the exact same way a human would. And so, yeah, what I'll show you here is what this looks like four months ago. So again, this is just an agent that's seeing, that's receiving frames, and it's just predicting action. So you can see it has like a decent sense of of being able to navigate around. It tabs a scoreboard, just like gamers always tab the scoreboard.

Starting point is 00:02:37 So these are purely, these are pure imitation learning. I see. So the LZE is slicing the knife. Yeah, exactly. So it's doing everything like humans would. In this case, here was the first interesting part that we saw. Like it gets stuck and then it has, they have memory as well. So you see it can get unstuck.

Starting point is 00:02:50 How long is the memory? Four seconds. Yeah, four seconds for the straight and color. So this was four months ago. This was maybe a few weeks after that. So you can see there is like, it's still doing the scoreboard thing, but it's, they're still, they're still, uh, quite like, and these are bots too. So you can see it.

Starting point is 00:03:06 It's very human. Let's just say that. Yeah. And then, um, right? So this was really like the early days of research where you can see right. There's one thing and then goes for another. Um, and then we've been scaling right, um, on, on data and compute. And also we've just been making the models better.

Starting point is 00:03:24 And this is where we are now. So what you're seeing is pure, like I said, pure mutation learning. This is just a base model. There's no RL, no fine-tuning. This model sees no game states. It is purely capable of sequence. It's purely predicting the actions from the frames. That's it.

Starting point is 00:03:43 And this is playing against real humans, just like a human would play. And it's also, it's running completely in real time. So there's absolutely everything here plays exactly. like human. Do you give it a goal? No. It just figures out it's like a goal because obviously it's trained on by saying yes. And I picked right, I picked the sequence where also it doesn't do well

Starting point is 00:04:09 initially so you can see like this is just like a sequence, a random sequence. But this is the, I mean it looks like it's very well. So, um. Oh, okay. Yeah, watch. Yeah, this is pretty good. Maybe too good.

Starting point is 00:04:30 Um, this is my favorite part. So you can see, it does something that like, here, like, human would never do this, then gets unstuck, then has four realize this, which, and then in the distance.

Starting point is 00:04:49 So you're saying, one, it makes a mistake that a human will never make, but it unstacks itself. And two, what we just saw is it is doing superhuman things. Yeah. Okay. Yeah. I mean, there are things that demon sit, obviously. But because it is trained on

Starting point is 00:05:04 the highlights, the things that all the exceptional things, it's inheriting those yeah. So it's not like Move 37 where we are all their way into something. Yeah, we're replicating it's superhuman. Yeah, exactly. Or like, peak human. The baseline of our data set is PQ and performance. Yes.

Starting point is 00:05:18 Yeah. Okay, so that's the agent. So now what I'm going to show you is we then are able to take those action predictions and we're able to label any video on the internet using those actions.

Starting point is 00:05:37 So, and so this is, this is just frames in, actions out. Yellow is the model prediction, or sorry, yellow is crown truth, purple is the model prediction.

Starting point is 00:05:50 And then bottom left is compound error over the entire sequence. And then this is reset per prediction. Reset meaning, you have you known to reset? Yeah, so this just means it resets to baseline. And so this basically, a single error

Starting point is 00:06:04 the entire sequence compounds here, but it doesn't compound here. That makes sense. So, and again, this is just seeing frames, right? It's not, it's not seeing any of the options. And so, you know, so what we did, right, is we trained it on less realistic games, so we transferred it over to a more realistic game. And then, and this is where it gets really exciting, we transferred it over to a real world video,

Starting point is 00:06:29 which means that you can use any video on the internet as pre-training. What was it for the big thing? It's predicting it as if you were controlling it using keyboard and mouse. So if you were basically playing this sequence as the human. Is there some sense of error? So that's why you transfer it to more realistic games first. And then you transfer to real world video because you can't get a sense from ground truth from the real world video yet.

Starting point is 00:06:55 Let's see. And then, so we don't, so let's show you here. This one is also, this is the same. agents that I just showed you. This is playing against other AIs. This one's playing against Bats, yeah. The previous one was against players. But with the sniper, it doesn't really matter that much, as you'll say.

Starting point is 00:07:19 It's like, so one thing that's really interesting is you notice that it behaves differently as it has, like, different items, right? That makes sense. Yeah. Intuitary. Yeah. I think there's also a question about egosentricity

Starting point is 00:07:34 versus like so the third person. Yeah. Does it matter? The third person, I think, will be very, very helpful if you're, for instance, trying to control multiple objects in an environment later on. Right now, I think having fully in perception, first person is quite helpful. This one's also, this is the policy itself. What do you mean?

Starting point is 00:07:54 This is the policy. The agent. Yeah, saying for the strains that I just told you about. Yeah. Like this, where it hides, that to me was just incredible. like just from knowing being able to predict. But the appearance also high when you see it. Exactly.

Starting point is 00:08:12 Yeah, yeah. And it needs a special intuition to go, well, this is hiding. And that's not hiding. Exactly. And right while it was reloading, yeah. Okay, so that, so those are, that's a policy. And this is a completely general recipe, meaning we can scale this to any environment.

Starting point is 00:08:33 Is this work closest? Okay, now, let's keep going on demos until. I was going to go ahead to research. Yeah, yeah, that sounds good. Okay, so, and then this is, this is, so what I'm about to show you are to world models. There's a few really, really interesting parts about our world models. So the first is we actually made the decision to transfer,

Starting point is 00:08:55 sorry, we made the decision to pre-trained world models from scratch, but also we've actually been able to fine-tune open source video models to get a better sense of physical transfer. And so one of the things that you'll notice here is like our world models have mouse sensitivity, which is something that gamers absolutely want, right? So you can have these very rapid movements, which you couldn't do in any other world model. And so this is a holdout set. So this clip was never seen before at training time.

Starting point is 00:09:24 As you can see, it has a spatial memory. This is about a 22nd-ish generation. Here's what's fascinating. This is an explosion that occurs. and you can see that in the physical world, right, the camera would shake and in the game that would never happen. So you see the world model inherits the physical world camera shake, but the actual game never does that,

Starting point is 00:09:49 which is sort of that to us was quite fascinating, right? Also did the models that I just showed you that we used to transfer over from video. The two of those combined will allow us to push way beyond games in terms of training. This is another interesting. So this is a world model. This is rapid camera motion. So again, this is stuff that we're literally just taking one second from here in the context and the actions and replaying it here. Right.

Starting point is 00:10:12 And so you'll, you never essentially have, like what we're saying is the skill that you see in the clips, that like the speed and the movement, and that also pays off at training time and you're doing world models. This is my favorite example. So this shows that the role model is capable of performing with partial observability. So what you're going to see is, again, you're replaying the actions from here and here, just using one second of video context. Everything after that is completely generated. So what you're going to see is the model is going to encounter, in this case, smoke. Normally now models break down.

Starting point is 00:10:50 What you actually see comes out at the same place. And so it's capable of even with partial observability still maintaining its position in the world. And then here it is also interesting. So this is this is typing. So this gives you like a reaction time. Like the fact that it can do depths and like sequences in completely different views, right? So this is a completely different view than if you were to be outside of that view, right? And so it's able to maintain consistency.

Starting point is 00:11:22 While zooming in. Yeah, exactly. And so, yeah, so you can see. So even while this goes out of scope, right, watch, and then it comes back and you'll see it's still there. Yeah. And so, this is the work that Anthony, who has been working on? I'm just wondering how much game footage you have to watch

Starting point is 00:11:47 in order to find these things. We can ask Anthony. I'm sure he's not going to be too excited to play these games afterwards. You're not playing you, right? You're just watching. Yeah, yeah, yeah. Great. Okay, so those were no models. These are interesting.

Starting point is 00:12:05 So we also were able to distill into really, really tiny models. So this is, for instance, a long sequence on a very, very tiny one. You can see it makes like a bit more stupid mistakes. Like it does things that are not as optimal. But at the beginning it was running into a wall for free. Exactly. I mean, I do that too. Yeah.

Starting point is 00:12:31 Yeah. I mean, it's doing pretty well. Yeah. And again, all these models are running completely in real time. There's no. I was thinking your main model does real time anyway. What's the goal of distilling? Is it cost or?

Starting point is 00:12:45 Yeah, parameters. Yeah. Yeah. This is the industry one of peaks of corner. That's what we mean by like the space and the poor reasoning aspect. Is humans actually, they sort of simulate the optical dynamic. of their eyes and how to actually especially reason of all the data, right?

Starting point is 00:13:00 You've seen all this. Yep, exactly. And so, like, even in, like, real, this is kind of interesting, even in, like, the real world, um, with, uh, for instance, YouTube data, right? You have to first solve for pose estimation.

Starting point is 00:13:12 Then once you have pose estimation, maybe you do something like inverse dynamics, right? Where you basically are able to, like, somehow label some of the options that you're seeing. And then you still have to account for optical dynamics of, like, where are your eyes actually looking before the decision? Because, like, there's just three levels of information loss. or when you're playing video games, you're actually simulating the optical dynamics with your hand.

Starting point is 00:13:30 Right? And I think that's why I think why games are a better representation of switch support reasoning initially than, um, uh, than YouTube videos, for instance. Okay. We're in the GI offices with CEO and the way. Welcome. Thank you. Thanks for having us in your office.

Starting point is 00:13:49 Yeah, it's weird. If I'm in New York and you're one of the hottest reasons at the year, I have to come and visit and thanks for taking some time on the weekends or. Yeah. Yeah. So you've raised a. 133 million C. So general information,

Starting point is 00:14:03 most people would be fair about you. I guess this G.I. is new, but more gamers would have for the middle. And before that, you ran probably Rui Saksie Kempir. Yes. The largest depth, we see it somewhere. What's your reflection on just that journey of like, now you're

Starting point is 00:14:19 in a half under? Yeah. And he started off like Roosstead. Yeah, I think I grew up with Tourette's. I spend most of my time as a teenager coding and playing video games. So in that sense, it doesn't feel that much difference. But I think for, so I started the largest privacy of RoonScape, worked at Dr. Subwardess for three years, for Sunibola, and then on like satellite based map generation for disaster response, which was already like very AI-related adjacent. I built some models back

Starting point is 00:14:51 then and then started metal, which became one of the largest social networks and video games. I've always been kind of like AI like Jason I'm a self-taught engineer so for me the modeling itself always felt a little foreign I actually had to take a ton of tons of classes over the summer and early this year to get better at it because I it still felt like

Starting point is 00:15:13 I was really really good at the infrastructure side and I had written like our transcoders for metal myself so I was very very familiar with Kuda and like the GPU side and all the video infrastructure that we were using for this stuff, but the modeling side itself was still quite foreign. Luckily, obviously, we have, I have really, really good co-founders, but they essentially put a bunch of course work together for me to go complete, to get really, really good at understanding the fundamentals better. I think for me, I had seen inside of the labs that had really, really good leadership

Starting point is 00:15:44 with fundamentals at top and also the ones that didn't, and I think the ones that did were just, like, much better. And so for me, yeah, I wanted to be more like that. So in that sense, it was a bit, it was first very foreign and then now I feel pretty comfortable with everything. But yeah, like, I think for there's a lot to be explored starting in video games. And also reverse engine, like I think the interesting thing about reverse engineering is it kind of teaches you to look at problems very differently. It's like the ultimate form of deductive reasoning in a way. And so this is just how I think, how I operate. And so for me, it's been a really, really interesting journey.

Starting point is 00:16:20 I don't claim to have any of the credentials or skills that some of the other guests do have that on, but hopefully it will make for a good time. Yeah, well, your co-founders definitely bring a lot of that different ability, and you bring a lot of the, I guess, gaining it certies? Mostly with truth trees. We'll see what I bring to the table.

Starting point is 00:16:40 Just a little bit of history of Mendo. Let's establish METO for us who don't know. The Liddy, Twix, the viewer. You have more acting views. users, concurrent users in Twitch? Something like that. Yeah, on the creator side, I think.

Starting point is 00:16:55 And the reason is because metal is a lot more like Instagram than it is like Twitch. So the way to think about metal is it's a native video recorder, like unlike something like Twitch where you actually have to use other software to record and stream to Twitch. It's not a streaming software. It's actually a video recording software. And a lot of gamers love to put things like overlays on top of their videos. And as a result of that, we have sort of the largest data set of Grouchard Truth, action labeled video footage.

Starting point is 00:17:20 on the internet by maybe one or two orders of magnitude. Yeah. What was an example of an overlay? The only over there I usually can know this implicase can. Yeah, yeah. Also, controller overlays, for instance, if you're playing, like, let's say you're playing console. Yeah, like flight simulator, you get like, you know, the joystick and all the things.

Starting point is 00:17:39 So you get the actual actions that people take inside the games as well as the frames of the games themselves, which is a loop, right? Because it's essentially you perceive, then you act, then there's a state update, and then you perceive again, you act state update, which is roughly precisely what you use in order to train these agents. Yeah, it's almost perfect training data. You were showing me in the demo

Starting point is 00:18:01 and we show some B-roll here on how you don't love Key. It's very important to use a lot of action. When did you figure this out? Maybe starting a year and a half ago. Yeah, and we realized that figuring out the side of the research for us was we very much never wanted to be in a number, position where we eroded privacy or something like that. So we never wanted to

Starting point is 00:18:23 actually log like a W or A or S and a D, which for researchers, the fact that we don't do that, like often it sounds strange. Like, why wouldn't you do that? But I think for us, the privacy we don't let me get the data. Yeah, I think, you know, a lot of the, the researchers they didn't quite understood yet that you can actually just get away with just doing the actions. And the reason is like at training time, having the actual keys is noise anyways. It's like if there is text on the screen and you would want to, in theory, make that part of the training. Then like reading text from a frame is like really easy. And so for us, if we actually can, so we convert basically hit, you hit the input.

Starting point is 00:18:58 We convert it to the actual action. So we had thousands of humans label every single action you can take in every single video game over the past year and a half, which is an enormous amount of action labels. Yeah. So when you act, we get the actual action itself. And then it being said, at training time, you can. for like a, the general set of that, of that game, convert back into computer inputs if you want to, but you can never do it for any individual person. And so that for us from, from like a design perspective was, was important. So we figured all that stuff out. Then we actually started pushing,

Starting point is 00:19:31 like we already had features as well with this. So for instance, like gamers already love to be able to navigate their clips by like things that happened. So we have an events capture system. And we also have the overlays where you actually just want to overlay and render the actions on top of your clip. We developed kind of in tandem. with the feature set itself. And then obviously, when World Bottles became a thing, and it's very, very clear that all the data for this was precisely like that sequence.

Starting point is 00:19:54 Yeah, we were able to sort of be first to market, recruit the best researchers, and start a lab. Yeah, that's terrible. One more question on metal before I renew for some of the DA. It's in 10 years. Yeah. What is the... I don't even know how you bro or something like this.

Starting point is 00:20:09 You know, right? I'm just kind of curious. And I like the opportunity to ask you, what really worked? Yeah. that you became so huge. Because you're not the only one. Yeah, but I have a choice performance and everything.

Starting point is 00:20:22 A few things that really worked. I think the first was a lot of our competitors were focused on solving the social network and a recorder at the same time. And that never, like, our bet was really that we could get so many people to record with us that we could bootstrap the network on top of that. And that worked. So, well, everyone was sort of distracted trying to bootstrap a social network. We were just focused on building a really, really good capture tool.

Starting point is 00:20:43 And then we got tens of millions of people to use that. which then we were to bootstrap a network on top of the share behaviors. We already had the profile behaviors and the share behaviors, obviously, but the actual content consumption piece and the sharing piece really only came after we hit critical mass. It was actually early days during COVID when, like, the network really accelerated. Fortnite happened, which was really important. And I think also the fact that Discord existed made it quite a different time than when other types of networks of these types had launched. Because Discord essentially was like the connective tissue already between gamers that like never really

Starting point is 00:21:15 existed before. And so I think those combinations of things really, really made it. I think we also build a product that, for instance, with most video recorders, you have to remember to start and stop the recorder. So you have to go into the application, then hit start, then start your game, and then, you know, maybe you'll play games for three hours, and you'll close the game, then you have to close your video application. Then you have to process like a multi-gigabyte file. Then you have to upload those somewhere. And so, like, this was a pain for people. And so what we did is we just ran this kind of recorder. When you hit that button, it does a retroactive video record. So all the recording initially is in memory. And then when you hit that button, it exports only

Starting point is 00:21:52 that sequence to disk and sings it to your phone. And so that, that became super popular. It also, what was interesting about it also means that you're not sort of behaving or acting differently because it's always there and you can just export whatever happens, which is also very, very helpful for trading, obviously. Um, the thing, you went to first to be there. Yeah. The thing you were explaining just before this was similar to how Tesla does their bump reports. You're driving from the having Disengage autopilot. Yeah. They're like, well, tell us what happens.

Starting point is 00:22:21 Exactly. Exactly. See, you're driving. Tesla doesn't want to train on the like 10 hours of you driving through a desert where nothing interesting happens. You have the clip button on the steering wheel. Something interesting happens. Either while FSD is engaged. And I'm not sure if you can use it without FSD as well.

Starting point is 00:22:36 But you hit the clip button and it basically uses that precise sequence. to mark which is then more helpful for training because it's more unique as a training time yeah yeah i mean so one thing that i when we're going to get to this on the eight inside one thing that i that's that does pop up as well a lot of life is boring a lot of life is going from me a lot of a lot of playing games is doing the boring stuff that is not capable of yeah somehow you see the generalized fight yeah yeah yeah it makes you think right it makes you think yeah yeah it's also quite interesting like i showed you to models like what happens when you increase the size of the context window and how behaviors actually are largely shaped by the size of the context window.

Starting point is 00:23:15 That to me was like one of the most interesting parts about the research made me think about our own behaviors in a way. Let's talk about also the forming a chain. On your website, you're 12. I don't know that's changed out. Before the three co-founders. Yeah. And just let's talk about how this team comes to them because you may not

Starting point is 00:23:34 visit yourself. You don't have that at the end of network. while you mostly elements people. Yeah. I started reading all the research papers. By that time, I was already pretty deep into, like, having a decent understanding of, not world models, in particular, in particular, LMs and transformer-based models. And so there was Genie, there was Sima.

Starting point is 00:23:56 Those two were really, really interesting. In Sima, in particular, it was interesting because what they do is they basically take 10 games, and then they have a graphic in Sima where you can see. see kind of the precise actions that are inside of those games that they mapped. And I believe they found something like 100, which are actually actions that also exist in the real world. And what they did was they then, I believe it was specifically for navigation. They did a 9-1 holdout set. So they trained an agent on the nine games. And then they had to play the 10th game, the holdout game. But then they also trained a specialized agent just on a 10th game and they compared

Starting point is 00:24:34 how good they did. And if I recall correctly, it's, did roughly as well playing the 10th game on navigation specifically on the holdout on the nine game agent, then it did on the one game agent. And that's what it was really interesting because that's precisely the type of data that we had. Right. And so for us, the thinking was, okay, what if we did exactly what LMs did? What if we used, right, this, right? So LMs were trained on predicting like text tokens on words on the internet. What if we predict action tokens on essentially what is the equivalent of the common crawl dataset, but for interactivity? Vision and clip?

Starting point is 00:25:09 Yeah, actually, no. But correct, that's it. But I think, well, actually, I'm going to double back a little bit to you. Thanks. A question I had, which is, one of the reasons why I thought you were wanting to prefer keyboard and mouse over actions is the actions is potentially undonelled. Right. You can jump, walk left, walk right, but then also look up, look left, the bench. It's unmounded.

Starting point is 00:25:31 So it's huge, isn't it? Yeah, I think, problem. Yeah, there's benefits to the action space being small to start with. So I think we're going to start with anything that you can control using a game controller. But yeah, long term, we want to actually predict maybe like action embeddings and have models sit inside a general action space to be able to transfer out to other inputs as well. Yeah. Okay, and then let's see going on the research time.

Starting point is 00:25:52 So, Genie Simba. Yeah. And then do co-farmers. Yeah. So there was the diamond paper. There was Jeannie and then there was Sima. The diamond paper for me was really interesting because they had actually managed to get this world model called Diamond running on a consumer GPU,

Starting point is 00:26:11 I believe it was a 4090 at 10 FPS, and you could play it. And they did that on like 90 hours of data, like 95 hours. I think it was 87 hours and I think 8 in the whole that set or something like that. That was just incredible, right? That they had something playable on that little data.

Starting point is 00:26:26 So I actually cold emailed the entire group of students. And I told them, hey, I think we have this thing. And then it was pretty interesting. So like right when that happened, a lot of the labs also started understanding what we had. And so we started very aggressively, multiple labs tried to bring us in in various ways. And they were part of that. Like they basically were seeing that happen.

Starting point is 00:26:47 I think for them that also kind of like solidified how real it was. And then when we chose to do our own thing, you know, initially we thought that we were going to have to just work on role models, right? So we thought, okay, the main metaphor of this data set is like Jeannie is world models. What we didn't realize at a time is that we have so much of this data is that we can essentially do these role models. parallel and take the equivalent of like the LLM bet, mostly on imitation learning, and then use the world models after that to get into like our all stage, right? And so for us...

Starting point is 00:27:15 And eventually get rid of the world bottles. This is something that evening. I mean, ideally you get rid of the imitation, yeah, the imitation learning, but yeah, we essentially realized that that we could get so far on just imitation learning. The way to look at it is we essentially, like, let's take the LM analogy. We essentially have sort of the internet or like common crawl, if you will, And every single lab is trying to simulate that, right, in order to get similar data in order to train their agents. And so for us, the reason why we say independent and we just said our own thing was we think we can essentially leap every single company that is forced to either be consumers of world models or build world models and take this foundation model for space of the board agents. And be in a place where, you know, we have a lot of customers years before any of the labs even get there. And maybe the most similar comparison is like when Anthropic did with code, right?

Starting point is 00:28:05 Anthropic just focused really, really hard on nailing the code use case. Their models are incredible for it. A lot of their customers needed for it. So we just want to become incredible at this spatial temporal agent use case. And likely that starts in like game simulation and then using world bottles, we can then start expanding out to other areas. So would you show me a little bit of how we think does generalize our victims? But although games is come to come in prayer. Yeah.

Starting point is 00:28:30 I would specify it as game engines in Berticiller. So even if you're, for instance, simulating human behavior in Omniverse because they're trying to create better training data for factory floors, you can use it. Yeah. Maybe meta has a similar data set because of the quest. I never really asked them,

Starting point is 00:28:47 and I never really looked into the meta quest specifically. So you need a few things. You can't just, like, there's lots of companies that have, like, maybe recorders, but you also need the public graph. Otherwise, you can't train on the data, right? You can't train on people's, like, private videos that they have safe somewhere, right?

Starting point is 00:29:02 And so I think you need the social network graph components because these videos need to be on the internet. To its rank? No, to train on them. Yeah, I mean, I think generally people don't want to train on. Because these things, they live on your device usually, right? And you can't train on anything that lives on your device. Like, you actually need to go and upload and do your thing, right?

Starting point is 00:29:22 For meta specifically, I think also VR, the scale of VR is still pretty small. the amount of environments in VR that have consumption at scale is probably in the hundreds, whereas on PC it's probably in the tens of thousands, right? And so you get a lot less diversity. The three-dimensional input space of VR is pretty interesting. We see some of this too, obviously. And so, yeah, I do suspect, you know, meta starts using these types of things, but it's unclear to me whether they can get to like a similar scale of data or diversity

Starting point is 00:29:55 on the environment as we can. Yeah, there's a lot of challenges there. Yeah. Okay. I want to take this in a few different things. But I guess let's fill up the papers. Maybe one more to mention is Tire. Yeah, which I actually interviewed a dire author's,

Starting point is 00:30:10 but that too seems like the particular insight that brought it overseas. Yeah. So Anthony, who led the research on Gaia II, is also the engineers to join our team. So it's all the diamond, the core contributors for diamonds. and then Anthony. And we just had three more researchers showing this week. It's been a good week.

Starting point is 00:30:29 And yes, I think a lot of the approaches in Gaia, too, were heavily inspired by Diamond. And then Ginn-Sah, who was one of the authors of Diamond, also already was at Wave by the time that I emailed them. Anthony also realized what this was and realized that, you know, you could scale world models to a much larger, like, scale and decided just to make the leap as well. So I think everybody that sees the dataset makes a leap. Because it's, but it takes a well to wrap your head around it because it's like, oh, it's video games, right? Like, intuitively, it doesn't make sense. And then when you actually understand and you see, right, how we've been able to transfer it to physical world video and things like that. Then it makes sense.

Starting point is 00:31:05 And then everybody tends to jump. They don't call the video games follow that are around around there. Yeah. If I lived in San Francisco, maybe I would. Just a quick note, because we actually cover all these papers in the latest-day student club. Seema 2 did not seem to have as much intact in Sima 1, and I don't really know why they did it allow more word. GE3 had a ton of impact, but I also felt like,

Starting point is 00:31:28 because you could play with the model or people, it just seems the extension of all those days. I guess any quick takes on Sima 2, Gen 83, which were both this years. Yeah, I'll talk about Sima 2. The steerability of Sima 2 was to me the most impressive part, because lighting up the action sequences and the text conditioning is quite hard to do, right?

Starting point is 00:31:52 And so that, and the fact that they were, like, it's also quite interesting that means that they can sort of use Gemini as part of the flywheel, right? Where you can sort of scale this orchestrator as like an independent, almost like a puppet master, if you will. And then, like in theory, Gemini could orchestrate many instances of Seema, right? That to me is the most interesting part is where I tend to agree with this where, like, I think our models will initially be used as like, like, you'll have like an orchestrator

Starting point is 00:32:19 VLM of source that's kind of like managing instances and instructing them and I think for Seema showing that you can do this was fascinating. Also the fact that you could, they didn't just have text conditioning but they also were able to do like drawings and markings

Starting point is 00:32:35 of where to go. They really took an interesting end-to-end approach to me that I look forward to seeing a lot more of. But you're talking to them like you said it. Is that the one collaborative room? Yeah, I think the, yeah, we're very friendly with Deep Mind. We like them a lot. I just saw the team not too long ago, and I think, you know, big fans with their work. The Vineland that I can shake from Alice Heath's coverage of you, yeah, is you're the biggest bet that and Vinod's personalized me since opening eye. Yeah. How did that conversation start? Okay, so what, with a note's style and maybe I'll get slapped in the fingers for revealing this or whatever, but, uh...

Starting point is 00:33:12 Forgive me if I'm a... Matt. He asked you to draw a 20, 30 picture of your company, and I think he just picks N plus five years, whatever. I don't know. I did the same for you. Yeah. He asked you to like walk that back from first principles all the way from today. And he expects you to do that flawlessly, where he can challenge any assumption, any part of the vision that, that, and he asks questions, right? He has a very technical background. He also has a bunch of technical people in see. And he truly backs people that have these like very large visions on that vision and the ability to defend it alone. And that's what he did for us. And I think that's why I made that bad.

Starting point is 00:33:53 So I think also through this, through through this question, he gets to know a lot of things about how technical you are. He gets to know how well you think from first principles because if that vision is not connected to something real, it's very easy to suss it out by asking good questions. And then he just backs fully, I think. Like he really gets in your corner if it's the right fit. And yet, they've been incredible partners. They've opened so many doors for us. I have the after question.

Starting point is 00:34:24 I think it's a very notable story. Obviously, a lot of work went into it. But it's also worth him and come out of the side. For sure. One of the things also wanted to, I think I kind of asked this question out of sequence, but one of the things that are exciting about telling you is there are a lot of people like you who are founders of business and businesses that along the way have a ton of data. And yours happens to be highly valuable.

Starting point is 00:34:51 You pursued before deciding to do an independent journey, they also talk to other companies about potential licensing or acquisition and something back. What is your learning from those periods? It's like, one version of this is very simply, how do you value data? Yeah. I don't think you can value it unless you actually model it yourself and see what the capabilities are. That's my real outcome. You see model, but train them up.

Starting point is 00:35:18 Yeah. But that's obviously like not doable for everyone. And also I think my general advice would be as model capabilities increase, you and models are also like, you know, these VLMs. They're very, very good in labeling as well. generally, right? What I was afraid of when I was having some of these conversations was okay, like, you know, as the capability is increased, you're just going to need less ground through data and like you can do more model-based data generation or synthetic data generation. I would recommend if you're going to do large data deals, like just try to get like a large

Starting point is 00:35:52 chunk of equity in the company that you're doing it with if you can. Now, a lot of them won't do this, but I think that to me would or just go do the research, figure out what's actually possible. In our case, we were quite lucky in the sense that this is actually the foundation data. Right? And I think, right? Like, that's not true for every data set. I think, you know, we just

Starting point is 00:36:13 happened to hit a particular gold mine. But you also did, you read Kuwaiti, you did the action Ving one or five years ago. Yeah. So you, you did word. Yeah, that's the thing. Like, you have to be grounded, right? And I think a lot of the, and I think that's the hard part.

Starting point is 00:36:29 And I think a lot of that's interesting is you can also kind of look for if like scaling laws already exist on your data type, which like for video there were some, but for these like input action labeled sets there, they really wasn't any. The other question is like, does it go into LMs? Does it go into world models? Does it go into like what type of model is going to be used for? And I think that's an important thing to know.

Starting point is 00:36:53 And so I just want to, you know, if you're having these conversations with labs about data, just like make sure that you actually understand like what it's going to be used for because that's a very, very good way for you to make the decision yourself about what are you want to pursue that. Now, a lot of them won't tell you that. And I think, you know, I think in that case, you generally just don't want to do it because, like, I think for our case, like, we really cared that, like, for instance, there weren't going to be competing products with game developers built, right? Because we didn't want to, like, bite the hand that feeds us. And I think we are part of the games industry. So those questions, I think, are normal. And then we eventually

Starting point is 00:37:24 decided, you know, you just have the data. We're just going to go do it ourselves. And That's when the rest happened. And he assembled the team and maintained. I think about it and said that. I feel like that's, you've aligned a lot of stars in order to make GI happen. Yeah. That other data founders, they are at the beginning of restoring me. Yes.

Starting point is 00:37:42 Oh, I'm a data founder. Founders who happen to have beta. But they had a main business, right? I don't know if you have another. There's two sides to this, right? It's really easy to be super naive about it. And like, I had a lot of people tell me initially, oh, it's not that valuable. You're just like making this up.

Starting point is 00:37:56 And so for me, like, doing the work and actually understanding it myself was a really, really big part of building that confidence and go start the company. But a lot of times it is true that like model capabilities increase so quickly that like the certain data you just don't need anymore. And so I think it is it's really important to like get people to do to work such that you can make these types of distinctions. And so my recommendation would be go build models with your data, see if you can create any sort of capabilities that aren't. clearly already there or on path to being there and then figure out where you go. Yeah. I did want to ask this earlier, but you give me an opportunity to. We usually do the learning, do coursework and all that.

Starting point is 00:38:36 And your co-founders gave you some homework. Yeah. Is this like some books? I mean, Coursera? No, this was Francois Flores. So he has a little book of deep learning. And then he also has a full course that he's published on his website. I went through the entire course over the summer.

Starting point is 00:38:54 I believe it's like something like 30 or 40. D lectures, which also take home projects and things like that. And I would recommend anybody does this. It goes through, right, history of deep learning, like the topology. It takes you through the linear algebra, the calculus, eventually end up with, like, chain rule. And by this time, you've done, like, all the more important concepts, it takes you through how do you create neural networks using these concepts that you've learned?

Starting point is 00:39:21 Wow, this is super first principles. This guy, and I've had the opportunity to, the opportunity. spend some time with him as well. He is one of the most first principles people I've met in my entire life. I'm convinced, like I actually asked him why did you this course? He said, oh, because I thought all the other courses weren't right. And because because he's so first principles and he can only explain things from like everything you see and how he explains this thing. It's everything is from first principles, including like the history of deep learning itself was part of the course. And yes, he goes, so all, so he goes through everything and then, and by the end of it, I think I now have like a pretty

Starting point is 00:39:56 good intuitive understanding of how everything works, but obviously still, right? Like, I like to describe it as, um, I'm like the, the guy who just got his driver's license. I can drive the car. And like my co-founders are like the F1 drivers that like have done this for years. They know where all the, um, uh, where all the gaps are. And so I enjoy getting to learn from them. The cool thing is also that work models is just like a very, very new space. And so, you know, I got to bring ideas to the table that like, you know, one's thought of and not because I'm great at this is just because it's such a new space that like people We'll just haven't tried it yet.

Starting point is 00:40:28 So, let's get a hit on definition. Yeah. What are world models to you? You know, in a video model, you might predict the next likely sequence or the next most entertaining, the next most entertaining frame. What world models do is they actually have to understand the full range of possibilities and outcomes from the current state. And based on the action that you take, generates the next state, right?

Starting point is 00:40:54 So the next frame. And so it is a much more sort of complex problem than traditional video models. So to me, it is a world that is accurately generated based on the actions that you take as a result of what's already been generated. And just a fact check, that is it needs to understand physics. It needs to understand if I'm building a type of material, you need to how it interacts with some type of material. Yeah, I think the interactions is the most important part. I think the reasons why world models are so fascinating, one of the things that I did when I was studying over. the summer was I tried to actually build a super rudimentary Pi-Torch-based physics engine,

Starting point is 00:41:31 which I would not recommend writing a physics engine in PyTorch for obvious reasons, but I wanted to be able to, because it's a differential, so you can generate the... Sure, for the... Yeah, exactly. You can... And then you can train. And so I wanted to... You know, I got so many people ask me about, you know, why aren't you just using... why engine is simulating or generating this data. And I really wanted to understand from first principles why.

Starting point is 00:41:53 And I think the most important thing that I figured out was the compute complexity of simulation goes up really, really rapidly with three variables. First, the numbers of agents in an environment. Second, dare doff. So their individual. Jewels or freedom. Yeah. And then third, the information that each action reveals. So like, for instance, if you have a text action or a speech action, the environment can change so much based on whether you say, right, water or fight. that the outcomes are going to be completely different of how a human would behave in that type of situation. And so it goes up so quickly with those three variables that at some point you just hit a point where you just want to maximally bet on either video transfer or generation of these environments using world models. Because that type of soccasicity is just incredibly difficult. But it's already very, very present in a lot of the video pre-training that goes into these world models. Right. And so I think for us, it is more so about making a maximal bet on video transfer and interacting

Starting point is 00:42:53 with things that are difficult to simulate. And the steerability is also really interesting with text than it is on betting against simulation or something like that. And so I think there's still a large market for traditional simulation engines. It's specifically in areas where video is really hard to get. Is this exactly what the big lads are also same when you're talking to that? I honestly haven't talked about the big to the big labs. Since we started working on them ourselves, I think people are more reserved with what they share with us.

Starting point is 00:43:21 Yeah. Of course. with him, make sense. That's funny about question. How would you contrast your version of war models with that they read, the Yombe. Yeah. So I don't know exactly what Yonlun is doing today.

Starting point is 00:43:33 My understanding it's based on Lefi Jepa, like Le Jepa approach, which is, so I'll start with Fei-Fali. I think what's really interesting about Fei-Fali's approach is that you in some way are able to reuse the, the, um, the spots, right, in game engines and in things that let you stay in verifiable domain, um, which I think is a really interesting approach. However, my understanding is they're currently not interactive, which in my opinion is like the whole point of world models, right? It's environments. They're great environments. And I think from a business perspective, I think they picked a really important part of the tool chain.

Starting point is 00:44:05 But to me, that's not really a world model. But my guess is they'll get there, right? They'll start generating. Yeah, just have been reused it. Yeah, exactly, exactly. And I think, right, Fife is one of the like founders of the entire space. So I think it's going to be really interesting to me on what maybe that interactive piece looks like for me to really judge their approach. I think... We interviewed... Just before you moved to Jan, we interviewed her with Justin Johnson, her co-founder. He was more focused on the physics side of the...

Starting point is 00:44:40 Yeah. And the interactivity... I do think that basically, the splats, if you just add more dimensions on, I guess, the forces acting on them, then you get... to attract you to the out of the box. Because you are basically, these are virtual atoms that then has all the low more physics applied to them. Yeah, I'm excited to see what that looks like when they actually release it. It's really hard for me to comment on anything.

Starting point is 00:45:08 I really like the frame-based approach because all of our video or all of our training data is in this format. Yes. Yeah, so we actually asked them about this and they were like, yeah, it's possible, about literally choosing the SPAPR. Yeah. Yeah. And you can also go from splat to frames, right? I'm sure you can write like at some, it wouldn't be easy. Like you'd have to actually render out the environment. Sure, it's not, it's not going to be a simple problem. But like in theory, it has to be something that you can do if you really wanted to. So like it's almost like having a more sort of grounds for three dimensional

Starting point is 00:45:39 representation of the underlying world. Yeah. Right. So I think it's an interesting approach. It might be overkill, right? You're also dealing with like a much larger like degrees of freedom on the output space, right? So who knows how well it scales? I like the fact that, like, I think these video models also use things like auto encoders, right? You can actually have the world models predict, like, much smaller, maybe like a... Rism machine or size.

Starting point is 00:46:04 Yeah, exactly. And then you can use, like, diffusion upscaling or methods like this to actually enrich. And so I think that world models just allow a much more, or world models in my sense, for much more, like, controlled space that we know really well. I'm not suggesting their approach is wrong. I'm just, you know, like this is, I think, what we really like about it. Honestly, Yon's podcast that he did, I don't remember which one it was, but a long time ago where he basically proclaimed LLMs to be a dead end,

Starting point is 00:46:33 was one of the things that inspired me to do this. I think this is very consensus around low models. Basically, everyone who heard this is like stops with their LMs and just goes through to WOM models. I would say that the main perspective, I asked this exact question to Nolan Brown. them over the eye and do us like, well, be learning this at all moments, right? So it's basically that we didn't see

Starting point is 00:46:55 including the narcissistic. Or what are you on in? Yeah, yeah, I'm not one to proclaim LMs or that end, personally. I think they're actually quite useful, in particularly as orchestrators. Like the way I think about is, as demons, right, we had sort of a three-dimensional

Starting point is 00:47:10 worlds, then we invented text as like a, in a way, in compression method, right? So you had, we invented text in order to communicate with each, other in a common way, in a way that actually compresses all this information that we are perceiving in three-dimensional space into just like a single sequence. And I think that allowed science of three-marge, right? It allowed so many literature, like so many parts of the world that we charge. So I think it's a critical part of the whole picture. I also agree that it's very, very clear

Starting point is 00:47:45 that they do build sort of the internal implicit world models inside LLMs. And so I think they'll be very helpful as things like orchestrators. The problem is when it comes to the generalization, I think, text as a generalization backbone. When most of the pre-training is text, right, or largely text sequences, then I think you want that backbone to be kind of more spiritual in nature and then also just have text as one of the, as part of that.

Starting point is 00:48:13 And I think the actual argument of LLMs is also, for instance, the auto-regressive nature of the prediction itself. So the fact that it's running the entire output, right, through the transformer. And then in order to predict the next token, which doesn't, like, the environment in the real world is continuous, right? It's always, it's always changing. And LLMs kind of just forget about that, right? I think a lot of the argument is in my first, right? So I think the fact that text doesn't necessarily generalize well to a situation of moral context and then the auto-rogressive nature of the prediction and using text for that, right?

Starting point is 00:48:50 So I think those are the two main arguments. I think text prediction is just one of the actions that is going to come out of these, you know, these policies and world models. I think speech and text generation will just be one of the actions that can be a part of that. I think that there will just be labs coming at this problem from both sides. And everyone ends up in roughly the same place, and the same place will be whatever people think is cool. Right?

Starting point is 00:49:20 Like, whatever the consumer is closest to AI. Yeah. And so I don't think there's like a clear answer. I think it's really interesting to come at it from the world modeling side, but it's also because we have to, right? Because like text has largely commoditize. We can import all the texts. I think it's interesting intending.

Starting point is 00:49:38 Yeah. Lime detecting, it makes sense that you can probably recover. It's kind of like you're taking a step back. You're studying your branch of the ML Research sheet, but you might just end up recovering all the other tech stuff for merging. Yeah, yeah, we can import a lot of that research, right? A lot of that is on. That's really cool on the research side.

Starting point is 00:49:57 Let's talk about the stuff that GIS producing or like that I guess the sort of research and products output. You mentioned the word customer. What are your turning customers? Yeah, so we're already working with some of our largest game developers in the world. Yeah. We're also working with game engines directly. And so really what we're doing at the moment is replacing essentially the player controller inside of a game engine.

Starting point is 00:50:20 So anything that you're currently that maybe like behavior trees or things that you're deterministically coding, we hope to replace with a single API, which is just you stream us frames and we predict actions. And that can be inside an engine or it can be on a single API. or it can be eventually even inside the real world. Hopefully, Dozerden also steerable. So, well, say you saw word text steerable yet, but I think we want to get to a point where they're fully texturable. Well, to see steerable muse like, well, I want you to build to share.

Starting point is 00:50:49 If you're there are anything else, I agree. Yeah, I think it's sex conditioning on the generation. So, yeah, the ability to, you're right. We want to get to a point where you can generally, and that's why it's called general intuition, where we can sort of mimic the intuition of all these gamers into human, like behaviors in any situation. As I mentioned, also,

Starting point is 00:51:09 lab is named after the Demis Abyssusclode from Alpha Fold, which is, wouldn't it be amazing if we could mimic the intuition of these gamers who are, by the way, only amateur biologists on his path to he tried to get an AI to train Fold it, to generate a lot of data for AlphaFold. And so for us, really, the North Star, right,

Starting point is 00:51:27 what we hope to get to one day is being able to represent scientific problems in three-dimensional space and then have a space nuclear agent capable of perceiving that space and using hope, hopefully also the text reasoning capabilities that LLLNs have today, in addition to the space and poorer capabilities to be able to work on the other side of that problem. So that for us is sort of the North Star. That's why we're sort of trying to be hyper-focused,

Starting point is 00:51:48 Spish and the PoroWorld's workloads, the same way that Anthropic was hyper-focused code, and use that to then get into organizations and expand from there. Just as a side note, since you mentioned, Antarctic, any idea what they did on this to solve what idea? No, out of any lab I've probably, probably no entropic at least to go on it. Yeah.

Starting point is 00:52:07 I admired him, though. Yeah. Well, the current working theory is that they had a super lucky role of the ducks. But, well, and any compounds from there. That sounds like a nice story. I'm sure I saw that. Yeah. Okay. So, why do the game developers want this?

Starting point is 00:52:25 So if you're a game developer, how well you're actually retaining players. It's like, if you have a game that's already at skills, like, decently dependent on how good your bots are. So if you're logging in at an obscure time, let's say 3 a.m. in America and your player liquidity is low, then you need really, really good bots to keep those players engaged. Is this a thing? Yeah. For sure. For that, and whatever.

Starting point is 00:52:49 A lot of human worth it is. Yeah. And so if you're like, as a human, do I want to play his bots? Usually it's not just bots. It's like players mix in with bots because you don't want to play just against bots. But it's better to have a full game than to have like an empty game. Yeah. And so I think it's only as it's part of the end.

Starting point is 00:53:05 environment, I think it's okay. That means you also have to sort of grade that skill level. Yeah, yeah, which we can do. Because we know exactly how good people are at these games. Yeah, I think for us, bots is kind of like step one, right? So what I was showing is we're building a general agent that can sort of play any game in real time. But really that extends into all of simulation, right? Like in GTIA5, for instance, people are generally role playing real life.

Starting point is 00:53:29 Building. Right? And so they're actually behaving in quite aligned ways with the goals they set for themselves. So you have all these examples represented in video games, right? You have truck simulator, power wash simulator. Power wash simulator. Where, like, actually, the behaviors that you'd want an agent to be able to perceive, they're all there. Okay.

Starting point is 00:53:47 Yeah. It's really how seriously some gamers take truck to simulator. Did he haven't seen these tips? You should watch it. Yeah. They buy the whole truck driving set and they're doing the job of a truck driver. Yeah. What I mentioned to you, we have more people at any given time.

Starting point is 00:54:05 on metal playing with steering wheels in like truck simulator and these types of games, then Waymo has cars on the road. It's a ridiculous stat, but it's true. Yeah, yeah. I mean, so, you know, I used to think that while to self-soldriving, he kind of just, the interplayed on a GT5. Yeah, I mean, it's not bad to this. Yeah, our bet is not that we can zero shot any of these things.

Starting point is 00:54:25 It's just that, like, the next self-driving company can maybe have, collect 1% of the data. Because, right, also, for instance, clips already self-select into negative events and adversity, right? And so a lot of our data set, because it's already highlights, is really precisely what a lot of these companies spend, like, their last 20% doing. Right. And I think that's the main argument if you're, if you're, if you're another company that's looking at what we're doing, I think the thing that people are not, that people won't understand is that anything that you're currently doing in pre-training. As long as your robot can be controlled using a game controller, we hope that we can move that to post-training for you. So our bet is not that we can create the next self-driving car company. It's just that the next cell driving car company, hopefully, only needs 1% of the data or maybe 10% of the data, I don't know, right, to be able to deliver a really good product.

Starting point is 00:55:10 Yeah, yeah. It's also, the term that comes to mind a lot is active learning. I don't know if you've used to identify with that. It got less cool for a bit, and it seems like the only uptrend, which obviously you have the best data set for the high intensity or you say negative. But I feel like you thought negatively. It could be a little bit. Yeah, for sure. I think negative events is just because it's the most common term that people use for, like,

Starting point is 00:55:35 if you're, if your Tesla, you want the crashes, you want like... Right, right, right, right. But it's only gaming. It's both. Yeah, yeah. So, you know, the model that you saw, obviously, had really, really incredible moments, and that was largely... Yeah.

Starting point is 00:55:47 Yeah. That, um, uh, that it had a large representation of people at their best. Yes. Yeah. And worst. Yeah. Yeah. Amazing. Okay, cool. Uh, anything else on the customer development side that you want to sort of fledge off?

Starting point is 00:55:59 Yeah. Um, uh, we're also. We're also already working with robotics companies, but again, and manufacturing, but the key is that the robot has to have gaming inputs. So our bet is not that we can transfer over to, like, hired off robots and the keyboard and mouse. It's really just that we can move the hard work of pre-training, hopefully, to post-training. Yeah, it's kind of like the foundation model that is a very good basis to stuff. Yeah, you're going to give us frames and likely some text. OEO license the model, because they've been to want to post-training.

Starting point is 00:56:28 Yeah. Our business model is initially going to be an API, like the Anthropic API. But you also saw, for instance, some of the video labeling models that we've been able to develop. So the goal is for any company to be able to take in their video data as well. And we can create first, obviously, custom versions of the policy for you, the agent. If that doesn't work, then we've already working with a customer that is doing. We distill a model and they turn that into our product for themselves. So people can engage with you on the agent level,

Starting point is 00:56:58 with API level. People can engage me on the model level. Can you sell data? No. All right. Yeah. We don't sell data. Okay, cool.

Starting point is 00:57:06 So that's the business. And is there a world in which, I mean, I think this is on you, I think this is on you, I think you're in, you know, front to your labs for, for world models. Is there a world in which there is a more sort of application layer thing that you, that comes out, like a chat GPT for whatever? Yeah. You're going to see us launch a few things on, on metal itself, that,

Starting point is 00:57:28 are going to blow your mind as a result of this agent. I'll leave the imagination for now. If people will integrate out, you know, land. Yeah, on the world modeling side, like, I think one people underestimate is that metal is already one of the largest, you know, video consumption platforms as well. People watch millions to millions of videos a day. Whistle.

Starting point is 00:57:47 So, World Model Base Entertainment and things like that. Well, it's not like a focus for us right now. I think we'll be, like on the consumer side, we have the ability to move very, very quickly here and get it integrated in a way that I don't think anyone else can't. Yeah. You could theoretically do a video gen, like the Asora,

Starting point is 00:58:04 like, what is the disabam? What's the middle one? Meda, the middle one? It was not real. All the vibes? Yeah.

Starting point is 00:58:13 Could theoretically generate clips that nobody play. But you know what's the device. Yeah, I think for us, the games being so human-centric is like a really big part of a mix is special. Like, I actually just don't think that would work. Like, one thing that we are really

Starting point is 00:58:27 excited about though, I'll give you one sneak peek of what we're thinking about is what if you could literally replay any of the clips that you have inside a world model or your friends can play them. Like I showed you a model that already took part of your clip as a contact. It seems to replay, enter that walls. But it's also how we go from imitation learning to RL, right? Because it's part of a research robot anyways to make every single every single clip on that all playable. So yeah, who is to say that that doesn't apply to just the actual clips that you take? Yeah, yeah. He's seen one with the RL potential? We describe metal as the episodic memory of humanity and simulation.

Starting point is 00:59:00 So when you take a clip, really the way to think about it is you get the highlight of what is maybe three hours of playtime, right? You maybe get like two to three minutes of the things that were the most out of distribution, right? It is genuinely your episodic memory of that playtime and simulation of things that you most want to remember and share. We want to be able to load, and this is the work that Anthony Who is doing, the reason why we built world models, is every creation. crash that you run into in Euro Truck Simulator or American Truck Simulator or a driving game. We want to be able, right? And again, these are ground-truth labels. So we know precisely the actions that lead up to the negative events. They're also title labeled when people uploaded onto their platform. They say, oh, good, it's a crash. Right. And so we can select

Starting point is 00:59:42 all these events. And if we can put them inside a world model, we can go into, right? We can, we can train reward models to then reward based on how you perform in clips that actually contain negative events, for example. And so for us, it's, it's, it's, we can train reward models. And so, for us, it's very much about, right, we can create this, this, this like, LLM moment on, I think of an invitation learning, but actually making every single clip on the platform playable at billions of clip scale is how we go from imitation learning to RL. Cool. We covered a lot of it.

Starting point is 01:00:11 Is there anything else that you want to do before we need to grab up with the long-term vision stuff? Yeah, yeah. I think for us, this is a very, very ambitious, long-term vet. We need the best researchers in the world that want to work on this stuff. It's really exciting not being extremely data constrained. We really get to, like, we get so many learnings every week that we didn't think were possible, and it makes it for a joy working here. Also, the other thing is because we have such a large data modes,

Starting point is 01:00:38 we don't have to be as concerned as the LM company is about publishing because we don't mean the ones that have been able to. Exactly. No one can replicate the models, right? And so for us, we really want to bring back, like, the original culture of open research, which is why we did the partnership with QTai in France. I said didn't. Yeah, we just announced our partnership with Qa Thai in France,

Starting point is 01:00:59 which is an open science lab in Paris, one of the best research labs in the world. Eric Schmidt, I believe, funded in addition to some French people. They are essentially acting as the partner that's currently doing a lot of open research on the data. We also want the partner with universities because we do believe this is the frontier, but it's so data constraint that really no,

Starting point is 01:01:19 everyone has their hands side behind their back right now. And so we want to help fix that. So, for instance, we want to work with universities to build, like, negative event prediction models for maybe like trucks in India on all the truck data where all these crashes occur. We have all these things that we know we can do that we just have it at the time to do. And so if you're listening to this and you're maybe an academic institution or something and you want to access to some of this data in a research, in an educational research fashion, I think we're quite open to doing that because we want to educate people. And, yeah, and other than that, we just want to work with the best infrastructure and research, years on the planet as we're going into scaling, you know, runs at F,000, sense of thousands, eventually hundreds of thousands of GPUs. Yeah. Yeah. Amazing. I primed you this as like the closing

Starting point is 01:02:01 question of like, it's a little bit that we know cost that 30, I didn't know. Yeah. So what does GI become in by the race? Yeah. In 2030, we want to be the gold standard, um, of intelligence. Uh, and any sequence, uh, long enough is fundamentally sufficient and temporal, right? Which I think is, um, so by kneeling space with impore reasoning, you go after the root-nealer problem. of intelligence itself. What the world looks like is we want to have eight, so I sort of group the sequences of AI in three stages, and I credit Andre Kaparti for teaching this, bits to bits, atoms to bits, atoms, and then atoms to atoms. In the atoms to atoms stage, I want, like, I want GI models to be responsible for 80% of all the atoms-to-adams interactions driven by

Starting point is 01:02:44 AI models. And the reason for that is because we were able to unblock intelligence so quickly in robotics like intelligence is the bottleneck, that supply chains actually converged on gaming inputs as their primary input methods, and they converged on essentially simpler systems that let us do a lot more or a lot quicker. So we are essentially the 80% market approach, and then you have lots of companies

Starting point is 01:03:06 that have kind of specialized, maybe humanoid robot OS stacks that are the other 20. And then so I want to be responsible for 80% of all the atoms-atoms-in interactions driven by these models and be the goal center for intelligence and maybe 100x more in simulation, because I think simulation will actually be the larger market initially.

Starting point is 01:03:23 So I think in simulation, because you have very little constraints, also from a safety perspective, simulation is much easier. So I think a lot of the takeoff initially systems simulation, so a lot of the simulation use cases, like what I mentioned, scientific use cases, I'm really, really excited about. And so, yeah, 80% of Adams-Satoms interactions coming downstream from these types of space of the World Foundation models, and then 100x more in simulation. Yeah. It reminds you a lot of.

Starting point is 01:03:50 that what Mark and Fasilla from the Chazzoberg Institute are doing with virtual biology, because you can do a lot of putting simulation and you can do it. Yeah, oh, you can do a lot faster with interest. Amazing. Thank you for everybody. That's your office. Yeah, and thank you sharing a little bit while you're turning. Thank you.

Starting point is 01:04:06 Yeah.

Latent Space: The AI Engineer Podcast - World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.