Latent Space: The AI Engineer Podcast - World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI
Episode Date: December 6, 2025From building Medal into a 12M-user game clipping platform with 3.8B highlight moments to turning down a reported $500M offer from OpenAI (https://www.theinformation.com/articles/openai-offered-pay-50...0-million-startup-videogame-data) and raising a $134M seed from Khosla (https://techcrunch.com/2025/10/16/general-intuition-lands-134m-seed-to-teach-agents-spatial-reasoning-using-video-game-clips/) to spin out General Intuition, Pim is betting that world models trained on peak human gameplay are the next frontier after LLMs.We sat down with Pim to dig into why game highlights are “episodic memory for simulation” (and how Medal’s privacy-first action labels became a world-model goldmine https://medal.tv/blog/posts/enabling-state-of-the-art-security-and-protections-on-medals-new-apm-and-controller-overlay-features), what it takes to build fully vision-based agents that just see frames and output actions in real time, how General Intuition transfers from games to real-world video and then into robotics, why world models and LLMs are complementary rather than rivals, what founders with proprietary datasets should know before selling or licensing to labs, and his bet that spatial-temporal foundation models will power 80% of future atoms-to-atoms interactions in both simulation and the real world.We discuss:* How Medal’s 3.8B action-labeled highlight clips became a privacy-preserving goldmine for world models* Building fully vision-based agents that only see frames and output actions yet play like (and sometimes better than) humans* Transferring from arcade-style games to realistic games to real-world video using the same perception–action recipe* Why world models need actions, memory, and partial observability (smoke, occlusion, camera shake) vs. “just” pretty video generation* Distilling giant policies into tiny real-time models that still navigate, hide, and peek corners like real players* Pim’s path from RuneScape private servers, Tourette’s, and reverse engineering to leading a frontier world-model lab* How data-rich founders should think about valuing their datasets, negotiating with big labs, and deciding when to go independent* GI’s first customers: replacing brittle behavior trees in games, engines, and controller-based robots with a “frames in, actions out” API* Using Medal clips as “episodic memory of simulation” to move from imitation learning to RL via world models and negative events* The 2030 vision: spatial–temporal foundation models that power the majority of atoms-to-atoms interactions in simulation and the real world—Pim* X: https://x.com/PimDeWitte* LinkedIn: https://www.linkedin.com/in/pimdw/Where to find Latent Space* X: https://x.com/latentspacepodFull Video EpisodeTimestamps00:00:00 Introduction and Medal's Gaming Data Advantage00:02:08 Exclusive Demo: Vision-Based Gaming Agents00:06:17 Action Prediction and Real-World Video Transfer00:08:41 World Models: Interactive Video Generation00:13:42 From Runescape to AI: Pim's Founder Journey00:16:45 The Research Foundations: Diamond, Genie, and SEMA00:33:03 Vinod Khosla's Largest Seed Bet Since OpenAI00:35:04 Data Moats and Why GI Stayed Independent00:38:42 Self-Teaching AI Fundamentals: The Francois Fleuret Course00:40:28 Defining World Models vs Video Generation00:41:52 Why Simulation Complexity Favors World Models00:43:30 World Labs, Yann LeCun, and the Spatial Intelligence Race00:50:08 Business Model: APIs, Agents, and Game Developer Partnerships00:58:57 From Imitation Learning to RL: Making Clips Playable01:00:15 Open Research, Academic Partnerships, and Hiring01:02:09 2030 Vision: 80 Percent of Atoms-to-Atoms AI Interactions This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Hi, listeners. As you may know, I recently wrapped up the AIE Code Conference in New York,
and while I'm traveling, I do like to visit Top AI startups in person to bring you interviews
that you don't find on any other podcast that just does a Zoom call. General Intuition, or GI, for short,
is a spinout of a 10-year-old game clipping company called Metal, which has 12 million users. But in
comparison, Twitch only has 7 million monthly active streamers. Metal collects this data by building
the best retroactive clipping software in the world. In other words, you don't need to be consciously
recording, you actually just have metal on in the background while you're playing, and you hit a button to clip the last 30 seconds after something interesting happens. It's very similar to how Tesla and self-driving does bug reporting, if you ever done a self-driving bug report in Tesla's. The result is that metal has accumulated 3.8 billion clips of the best moments in games, resulting in one of the most unique and diverse data sets of peak human behavior actively mining for the interesting moments. They were also very prescient in navigating. In
navigating privacy and data collection concerns by mapping actions to these visual inputs and game outcomes.
As you saw on our Fei-Fei Lee and Justin Johnson episode with World Labs,
and with the recent departure of Yan Lakun from META, there's a lot of interest in world models as the next frontier after LLMs,
to improve on spatial intelligence and to work on embodied robotics use cases.
DeepMind has been working on this with Genie 1-2 and SEMA 1-2, and this year, Okunee-I-Seele finally agree,
because they have been pending on LLMs a lot, and they made news by offering to the internet.
$500 million for Meadows video game clip data. Our guest today, PIM, turned down that money and
instead chose to build an independent world model lab instead. Kosova Ventures led the $134 million
seed round, which is Vinod Kostler's largest single seed bet since Open AI. We're able to get
an exclusive preview of GIs models, which unfortunately we cannot show you directly. But I can confirm
they were incredibly human life and we chose to include the first 11 minutes of the demo discussion
even though I couldn't show it to you.
It may be hard to follow,
but I tried to call out what was noteworthy for you to know
as your likely reaction if you were watching along with us.
Now, enjoy the world's first look at my first look at Geno Intuition.
So what I'm about to show you is a completely vision-based agent
that's just seeing pixels and predicting actions
the exact same way a human would.
And so, yeah, what I'll show you here is what this looks like four months ago.
So again, this is just an agent that's seeing, that's receiving frames,
and it's just predicting action.
So you can see it has like a decent sense of
of being able to navigate around.
It tabs a scoreboard, just like gamers always tab the scoreboard.
So these are purely, these are pure imitation learning.
I see.
So the LZE is slicing the knife.
Yeah, exactly.
So it's doing everything like humans would.
In this case, here was the first interesting part that we saw.
Like it gets stuck and then it has, they have memory as well.
So you see it can get unstuck.
How long is the memory?
Four seconds.
Yeah, four seconds for the straight and color.
So this was four months ago.
This was maybe a few weeks after that.
So you can see there is like, it's still doing the scoreboard thing, but it's,
they're still, they're still, uh, quite like, and these are bots too.
So you can see it.
It's very human.
Let's just say that.
Yeah.
And then, um, right?
So this was really like the early days of research where you can see right.
There's one thing and then goes for another.
Um, and then we've been scaling right, um, on, on data and compute.
And also we've just been making the models better.
And this is where we are now.
So what you're seeing is pure, like I said, pure mutation learning.
This is just a base model.
There's no RL, no fine-tuning.
This model sees no game states.
It is purely capable of sequence.
It's purely predicting the actions from the frames.
That's it.
And this is playing against real humans, just like a human would play.
And it's also, it's running completely in real time.
So there's absolutely everything here plays exactly.
like human. Do you give it a goal?
No. It just figures out
it's like a goal because obviously it's trained on by saying yes.
And I picked
right, I picked the sequence where also it doesn't do well
initially so you can see like this is just
like a sequence, a random sequence.
But this is the, I mean it looks like
it's very well.
So, um. Oh, okay.
Yeah, watch.
Yeah, this is pretty good.
Maybe too good.
Um, this is my favorite part.
So you can see,
it does something that like, here, like,
human would never do this,
then gets unstuck, then
has four
realize this, which,
and then in the distance.
So you're saying, one, it makes a mistake
that a human will never make, but it unstacks
itself. And two, what we just saw
is it is doing superhuman
things. Yeah. Okay.
Yeah. I mean, there are things
that demon sit, obviously.
But because it is trained on
the highlights, the things that all the
exceptional things, it's inheriting those
yeah. So it's not like Move 37
where we are all their way into something.
Yeah, we're replicating it's superhuman.
Yeah, exactly. Or like, peak human.
The baseline of our data set is
PQ and performance. Yes.
Yeah.
Okay, so that's the agent.
So now what I'm going to show you is
we then are able to
take those action predictions
and we're able to
label any video on the internet
using those actions.
So,
and so this is,
this is just frames in,
actions out.
Yellow is the model prediction,
or sorry,
yellow is crown truth,
purple is the model prediction.
And then bottom left is compound error
over the entire sequence.
And then this is reset per prediction.
Reset meaning,
you have you known to reset?
Yeah, so this just means it resets to baseline.
And so this basically,
a single error
the entire sequence compounds here, but it doesn't compound here.
That makes sense.
So, and again, this is just seeing frames, right?
It's not, it's not seeing any of the options.
And so, you know, so what we did, right, is we trained it on less realistic games,
so we transferred it over to a more realistic game.
And then, and this is where it gets really exciting,
we transferred it over to a real world video,
which means that you can use any video on the internet as pre-training.
What was it for the big thing?
It's predicting it as if you were controlling it using keyboard and mouse.
So if you were basically playing this sequence as the human.
Is there some sense of error?
So that's why you transfer it to more realistic games first.
And then you transfer to real world video
because you can't get a sense from ground truth from the real world video yet.
Let's see.
And then, so we don't, so let's show you here.
This one is also, this is the same.
agents that I just showed you.
This is playing against other AIs.
This one's playing against Bats, yeah.
The previous one was against players.
But with the sniper, it doesn't really matter that much, as you'll say.
It's like,
so one thing that's really interesting is you notice that it behaves differently
as it has, like, different items, right?
That makes sense.
Yeah.
Intuitary.
Yeah.
I think there's also a question about egosentricity
versus like so the third person.
Yeah.
Does it matter?
The third person, I think, will be very, very helpful if you're, for instance,
trying to control multiple objects in an environment later on.
Right now, I think having fully in perception, first person is quite helpful.
This one's also, this is the policy itself.
What do you mean?
This is the policy.
The agent.
Yeah, saying for the strains that I just told you about.
Yeah.
Like this, where it hides, that to me was just incredible.
like just from knowing being able to predict.
But the appearance also high when you see it.
Exactly.
Yeah, yeah.
And it needs a special intuition to go, well, this is hiding.
And that's not hiding.
Exactly.
And right while it was reloading, yeah.
Okay, so that, so those are, that's a policy.
And this is a completely general recipe,
meaning we can scale this to any environment.
Is this work closest?
Okay, now, let's keep going on demos until.
I was going to go ahead to research.
Yeah, yeah, that sounds good.
Okay, so, and then this is, this is,
so what I'm about to show you are to world models.
There's a few really, really interesting parts about our world models.
So the first is we actually made the decision to transfer,
sorry, we made the decision to pre-trained world models from scratch,
but also we've actually been able to fine-tune open source video models
to get a better sense of physical transfer.
And so one of the things that you'll notice here is like our world models have mouse sensitivity,
which is something that gamers absolutely want, right?
So you can have these very rapid movements, which you couldn't do in any other world model.
And so this is a holdout set.
So this clip was never seen before at training time.
As you can see, it has a spatial memory.
This is about a 22nd-ish generation.
Here's what's fascinating.
This is an explosion that occurs.
and you can see that in the physical world, right,
the camera would shake and in the game that would never happen.
So you see the world model inherits the physical world camera shake,
but the actual game never does that,
which is sort of that to us was quite fascinating, right?
Also did the models that I just showed you that we used to transfer over from video.
The two of those combined will allow us to push way beyond games in terms of training.
This is another interesting.
So this is a world model.
This is rapid camera motion.
So again, this is stuff that we're literally just taking one second from here in the context and the actions and replaying it here.
Right.
And so you'll, you never essentially have, like what we're saying is the skill that you see in the clips, that like the speed and the movement,
and that also pays off at training time and you're doing world models.
This is my favorite example.
So this shows that the role model is capable of performing with partial observability.
So what you're going to see is, again, you're replaying the actions from here and here, just using one second of video context.
Everything after that is completely generated.
So what you're going to see is the model is going to encounter, in this case, smoke.
Normally now models break down.
What you actually see comes out at the same place.
And so it's capable of even with partial observability still maintaining its position in the world.
And then here it is also interesting.
So this is this is typing.
So this gives you like a reaction time.
Like the fact that it can do depths and like sequences in completely different views, right?
So this is a completely different view than if you were to be outside of that view, right?
And so it's able to maintain consistency.
While zooming in.
Yeah, exactly.
And so, yeah, so you can see.
So even while this goes out of scope, right, watch,
and then it comes back and you'll see it's still there.
Yeah.
And so, this is the work that Anthony, who has been working on?
I'm just wondering how much game footage you have to watch
in order to find these things.
We can ask Anthony.
I'm sure he's not going to be too excited to play these games afterwards.
You're not playing you, right?
You're just watching.
Yeah, yeah, yeah.
Great. Okay, so those were no models.
These are interesting.
So we also were able to distill into really, really tiny models.
So this is, for instance, a long sequence on a very, very tiny one.
You can see it makes like a bit more stupid mistakes.
Like it does things that are not as optimal.
But at the beginning it was running into a wall for free.
Exactly.
I mean, I do that too.
Yeah.
Yeah.
I mean, it's doing pretty well.
Yeah.
And again, all these models are running completely in real time.
There's no.
I was thinking your main model does real time anyway.
What's the goal of distilling?
Is it cost or?
Yeah, parameters.
Yeah.
Yeah.
This is the industry one of peaks of corner.
That's what we mean by like the space and the poor reasoning aspect.
Is humans actually, they sort of simulate the optical dynamic.
of their eyes and how to actually
especially reason of all the data, right?
You've seen all this.
Yep, exactly.
And so, like, even in, like, real,
this is kind of interesting,
even in, like, the real world,
um, with, uh,
for instance, YouTube data, right?
You have to first solve for pose estimation.
Then once you have pose estimation,
maybe you do something like inverse dynamics, right?
Where you basically are able to, like,
somehow label some of the options that you're seeing.
And then you still have to account for optical dynamics of, like,
where are your eyes actually looking before the decision?
Because, like, there's just three levels of information loss.
or when you're playing video games, you're actually simulating the optical dynamics with your hand.
Right?
And I think that's why I think why games are a better representation of switch support reasoning initially than,
um, uh, than YouTube videos, for instance.
Okay.
We're in the GI offices with CEO and the way.
Welcome.
Thank you.
Thanks for having us in your office.
Yeah, it's weird.
If I'm in New York and you're one of the hottest reasons at the year, I have to come and visit and
thanks for taking some time on the weekends or.
Yeah.
Yeah.
So you've raised a.
133 million C.
So general information,
most people would be fair about you.
I guess this G.I. is new, but more
gamers would have for the middle.
And before that, you ran
probably Rui Saksie Kempir. Yes. The largest
depth, we see it somewhere.
What's your reflection on just that
journey of like, now you're
in a half under? Yeah. And he started
off like Roosstead. Yeah, I think
I grew up with Tourette's.
I spend most
of my time as a teenager coding and playing video games. So in that sense, it doesn't feel that
much difference. But I think for, so I started the largest privacy of RoonScape, worked at
Dr. Subwardess for three years, for Sunibola, and then on like satellite based map generation
for disaster response, which was already like very AI-related adjacent. I built some models back
then and then started metal, which became one of the largest social networks and video games.
I've always been kind of like AI
like Jason I'm a self-taught engineer
so for me
the modeling itself always felt a little foreign
I actually had to take a ton of tons of classes
over the summer and early this year to get better at it
because I it still felt like
I was really really good at the infrastructure side
and I had written like our transcoders for metal myself
so I was very very familiar with Kuda and like the GPU side
and all the video infrastructure that we were using
for this stuff, but the modeling side itself was still quite foreign. Luckily, obviously,
we have, I have really, really good co-founders, but they essentially put a bunch of course
work together for me to go complete, to get really, really good at understanding the fundamentals
better. I think for me, I had seen inside of the labs that had really, really good leadership
with fundamentals at top and also the ones that didn't, and I think the ones that did were just,
like, much better. And so for me, yeah, I wanted to be more like that. So in that sense, it was a bit,
it was first very foreign and then now I feel pretty comfortable with everything.
But yeah, like, I think for there's a lot to be explored starting in video games.
And also reverse engine, like I think the interesting thing about reverse engineering is it kind of teaches you to look at problems very differently.
It's like the ultimate form of deductive reasoning in a way.
And so this is just how I think, how I operate.
And so for me, it's been a really, really interesting journey.
I don't claim to have any of the credentials
or skills that some of the other guests do have that
on, but hopefully it will make for a good time.
Yeah, well, your co-founders definitely bring
a lot of that different ability, and you bring a lot of
the, I guess, gaining it certies?
Mostly with truth trees.
We'll see what I bring to the table.
Just a little bit of history of Mendo.
Let's establish METO for us who don't know.
The Liddy,
Twix, the viewer.
You have more acting views.
users, concurrent users in Twitch?
Something like that.
Yeah, on the creator side, I think.
And the reason is because metal is a lot more like Instagram than it is like Twitch.
So the way to think about metal is it's a native video recorder, like unlike something
like Twitch where you actually have to use other software to record and stream to Twitch.
It's not a streaming software.
It's actually a video recording software.
And a lot of gamers love to put things like overlays on top of their videos.
And as a result of that, we have sort of the largest data set of Grouchard Truth, action
labeled video footage.
on the internet by maybe one or two orders of magnitude.
Yeah.
What was an example of an overlay?
The only over there I usually can know this implicase can.
Yeah, yeah.
Also, controller overlays, for instance, if you're playing, like, let's say you're playing
console.
Yeah, like flight simulator, you get like, you know, the joystick and all the things.
So you get the actual actions that people take inside the games as well as the frames
of the games themselves, which is a loop, right?
Because it's essentially you perceive, then you act, then there's a state update,
and then you perceive again, you act state update,
which is roughly precisely what you use
in order to train these agents.
Yeah, it's almost perfect training data.
You were showing me in the demo
and we show some B-roll here on how you don't love Key.
It's very important to use a lot of action.
When did you figure this out?
Maybe starting a year and a half ago.
Yeah, and we realized that figuring out
the side of the research for us was
we very much never wanted to be in a number,
position where we eroded privacy or something like that. So we never wanted to
actually log like a W or A or S and a D, which for researchers, the fact that we don't do that,
like often it sounds strange. Like, why wouldn't you do that? But I think for us, the privacy
we don't let me get the data. Yeah, I think, you know, a lot of the, the researchers
they didn't quite understood yet that you can actually just get away with just doing the
actions. And the reason is like at training time, having the actual keys is noise anyways.
It's like if there is text on the screen and you would want to, in theory, make that part of the training.
Then like reading text from a frame is like really easy.
And so for us, if we actually can, so we convert basically hit, you hit the input.
We convert it to the actual action.
So we had thousands of humans label every single action you can take in every single video game over the past year and a half, which is an enormous amount of action labels.
Yeah.
So when you act, we get the actual action itself.
And then it being said, at training time, you can.
for like a, the general set of that, of that game, convert back into computer inputs if you want to,
but you can never do it for any individual person. And so that for us from, from like a design
perspective was, was important. So we figured all that stuff out. Then we actually started pushing,
like we already had features as well with this. So for instance, like gamers already love to be
able to navigate their clips by like things that happened. So we have an events capture system.
And we also have the overlays where you actually just want to overlay and render the actions on top of
your clip. We developed kind of in tandem.
with the feature set itself.
And then obviously, when World Bottles became a thing,
and it's very, very clear that all the data for this
was precisely like that sequence.
Yeah, we were able to sort of be first to market,
recruit the best researchers, and start a lab.
Yeah, that's terrible.
One more question on metal before I renew for some of the DA.
It's in 10 years.
Yeah.
What is the...
I don't even know how you bro or something like this.
You know, right?
I'm just kind of curious.
And I like the opportunity to ask you,
what really worked?
Yeah.
that you became so huge.
Because you're not the only one.
Yeah, but I have a choice performance and everything.
A few things that really worked.
I think the first was a lot of our competitors were focused on solving the social network
and a recorder at the same time.
And that never, like, our bet was really that we could get so many people to record with us
that we could bootstrap the network on top of that.
And that worked.
So, well, everyone was sort of distracted trying to bootstrap a social network.
We were just focused on building a really, really good capture tool.
And then we got tens of millions of people to use that.
which then we were to bootstrap a network on top of the share behaviors.
We already had the profile behaviors and the share behaviors, obviously,
but the actual content consumption piece and the sharing piece really only came after we hit critical mass.
It was actually early days during COVID when, like, the network really accelerated.
Fortnite happened, which was really important.
And I think also the fact that Discord existed made it quite a different time than when other types of networks of these types had launched.
Because Discord essentially was like the connective tissue already between gamers that like never really
existed before. And so I think those combinations of things really, really made it. I think we also
build a product that, for instance, with most video recorders, you have to remember to start and stop
the recorder. So you have to go into the application, then hit start, then start your game,
and then, you know, maybe you'll play games for three hours, and you'll close the game,
then you have to close your video application. Then you have to process like a multi-gigabyte
file. Then you have to upload those somewhere. And so, like, this was a pain for people. And so what
we did is we just ran this kind of recorder. When you hit that button, it does a retroactive video
record. So all the recording initially is in memory. And then when you hit that button, it exports only
that sequence to disk and sings it to your phone. And so that, that became super popular. It also,
what was interesting about it also means that you're not sort of behaving or acting differently
because it's always there and you can just export whatever happens, which is also very, very
helpful for trading, obviously. Um, the thing, you went to first to be there. Yeah.
The thing you were explaining just before this was similar to how Tesla does their bump reports.
You're driving from the having Disengage autopilot.
Yeah.
They're like, well, tell us what happens.
Exactly.
Exactly.
See, you're driving.
Tesla doesn't want to train on the like 10 hours of you driving through a desert where nothing interesting happens.
You have the clip button on the steering wheel.
Something interesting happens.
Either while FSD is engaged.
And I'm not sure if you can use it without FSD as well.
But you hit the clip button and it basically uses that precise sequence.
to mark which is then more helpful for training because it's more unique as a training time
yeah yeah i mean so one thing that i when we're going to get to this on the eight inside one thing that
i that's that does pop up as well a lot of life is boring a lot of life is going from me a lot of
a lot of playing games is doing the boring stuff that is not capable of yeah somehow you see the
generalized fight yeah yeah yeah it makes you think right it makes you think yeah yeah it's also
quite interesting like i showed you to models like what happens when you increase the size of the context
window and how behaviors actually are largely shaped by the size of the context window.
That to me was like one of the most interesting parts about the research made me think about
our own behaviors in a way.
Let's talk about also the forming a chain.
On your website, you're 12.
I don't know that's changed out.
Before the three co-founders.
Yeah.
And just let's talk about how this team comes to them because you may not
visit yourself.
You don't have that at the end of network.
while you mostly elements people.
Yeah.
I started reading all the research papers.
By that time, I was already pretty deep into, like, having a decent understanding of,
not world models, in particular, in particular, LMs and transformer-based models.
And so there was Genie, there was Sima.
Those two were really, really interesting.
In Sima, in particular, it was interesting because what they do is they basically take 10 games,
and then they have a graphic in Sima where you can see.
see kind of the precise actions that are inside of those games that they mapped. And I believe
they found something like 100, which are actually actions that also exist in the real world.
And what they did was they then, I believe it was specifically for navigation. They did a 9-1 holdout
set. So they trained an agent on the nine games. And then they had to play the 10th game,
the holdout game. But then they also trained a specialized agent just on a 10th game and they compared
how good they did. And if I recall correctly, it's,
did roughly as well playing the 10th game on navigation specifically on the holdout on the
nine game agent, then it did on the one game agent. And that's what it was really interesting
because that's precisely the type of data that we had. Right. And so for us, the thinking was,
okay, what if we did exactly what LMs did? What if we used, right, this, right? So LMs were
trained on predicting like text tokens on words on the internet. What if we predict action tokens
on essentially what is the equivalent of the common crawl dataset, but for interactivity?
Vision and clip?
Yeah, actually, no.
But correct, that's it.
But I think, well, actually, I'm going to double back a little bit to you.
Thanks.
A question I had, which is, one of the reasons why I thought you were wanting to prefer keyboard and mouse over actions is the actions is potentially undonelled.
Right.
You can jump, walk left, walk right, but then also look up, look left, the bench.
It's unmounded.
So it's huge, isn't it?
Yeah, I think, problem.
Yeah, there's benefits to the action space being small to start with.
So I think we're going to start with anything that you can control using a game controller.
But yeah, long term, we want to actually predict maybe like action embeddings
and have models sit inside a general action space to be able to transfer out to other inputs as well.
Yeah.
Okay, and then let's see going on the research time.
So, Genie Simba.
Yeah.
And then do co-farmers.
Yeah.
So there was the diamond paper.
There was Jeannie and then there was Sima.
The diamond paper for me was really interesting because they had actually managed to get this world model
called Diamond running on a consumer GPU,
I believe it was a 4090 at 10 FPS,
and you could play it.
And they did that on like 90 hours of data,
like 95 hours.
I think it was 87 hours and I think 8 in the whole that set
or something like that.
That was just incredible, right?
That they had something playable on that little data.
So I actually cold emailed the entire group of students.
And I told them, hey, I think we have this thing.
And then it was pretty interesting.
So like right when that happened,
a lot of the labs also started understanding what we had.
And so we started very aggressively, multiple labs tried to bring us in in various ways.
And they were part of that.
Like they basically were seeing that happen.
I think for them that also kind of like solidified how real it was.
And then when we chose to do our own thing, you know, initially we thought that we
were going to have to just work on role models, right?
So we thought, okay, the main metaphor of this data set is like Jeannie is world models.
What we didn't realize at a time is that we have so much of this data is that we can
essentially do these role models.
parallel and take the equivalent of like the LLM bet, mostly on imitation learning, and then use
the world models after that to get into like our all stage, right? And so for us...
And eventually get rid of the world bottles. This is something that evening. I mean, ideally
you get rid of the imitation, yeah, the imitation learning, but yeah, we essentially realized that
that we could get so far on just imitation learning. The way to look at it is we essentially, like,
let's take the LM analogy. We essentially have sort of the internet or like common crawl, if you will,
And every single lab is trying to simulate that, right, in order to get similar data in order to train their agents.
And so for us, the reason why we say independent and we just said our own thing was we think we can essentially leap every single company that is forced to either be consumers of world models or build world models and take this foundation model for space of the board agents.
And be in a place where, you know, we have a lot of customers years before any of the labs even get there.
And maybe the most similar comparison is like when Anthropic did with code, right?
Anthropic just focused really, really hard on nailing the code use case.
Their models are incredible for it.
A lot of their customers needed for it.
So we just want to become incredible at this spatial temporal agent use case.
And likely that starts in like game simulation and then using world bottles, we can then start expanding out to other areas.
So would you show me a little bit of how we think does generalize our victims?
But although games is come to come in prayer.
Yeah.
I would specify it as game engines in Berticiller.
So even if you're, for instance,
simulating human behavior in Omniverse
because they're trying to create better training data
for factory floors, you can use it.
Yeah. Maybe meta has a similar data set
because of the quest.
I never really asked them,
and I never really looked into the meta quest specifically.
So you need a few things.
You can't just, like, there's lots of companies
that have, like, maybe recorders,
but you also need the public graph.
Otherwise, you can't train on the data, right?
You can't train on people's, like, private videos
that they have safe somewhere, right?
And so I think you need the social network graph components
because these videos need to be on the internet.
To its rank?
No, to train on them.
Yeah, I mean, I think generally people don't want to train on.
Because these things, they live on your device usually, right?
And you can't train on anything that lives on your device.
Like, you actually need to go and upload and do your thing, right?
For meta specifically, I think also VR, the scale of VR is still pretty small.
the amount of environments in VR that have consumption at scale is probably in the hundreds,
whereas on PC it's probably in the tens of thousands, right?
And so you get a lot less diversity.
The three-dimensional input space of VR is pretty interesting.
We see some of this too, obviously.
And so, yeah, I do suspect, you know, meta starts using these types of things,
but it's unclear to me whether they can get to like a similar scale of data or diversity
on the environment as we can.
Yeah, there's a lot of challenges there.
Yeah.
Okay.
I want to take this in a few different things.
But I guess let's fill up the papers.
Maybe one more to mention is Tire.
Yeah, which I actually interviewed a dire author's,
but that too seems like the particular insight that brought it overseas.
Yeah.
So Anthony, who led the research on Gaia II,
is also the engineers to join our team.
So it's all the diamond, the core contributors for diamonds.
and then Anthony.
And we just had three more researchers showing this week.
It's been a good week.
And yes, I think a lot of the approaches in Gaia, too, were heavily inspired by Diamond.
And then Ginn-Sah, who was one of the authors of Diamond, also already was at Wave by the time that I emailed them.
Anthony also realized what this was and realized that, you know, you could scale world models to a much larger, like, scale and decided just to make the leap as well.
So I think everybody that sees the dataset makes a leap.
Because it's, but it takes a well to wrap your head around it because it's like, oh, it's video games, right?
Like, intuitively, it doesn't make sense.
And then when you actually understand and you see, right, how we've been able to transfer it to physical world video and things like that.
Then it makes sense.
And then everybody tends to jump.
They don't call the video games follow that are around around there.
Yeah.
If I lived in San Francisco, maybe I would.
Just a quick note, because we actually cover all these papers in the latest-day student club.
Seema 2 did not seem to have as much intact in Sima 1,
and I don't really know why they did it allow more word.
GE3 had a ton of impact, but I also felt like,
because you could play with the model or people,
it just seems the extension of all those days.
I guess any quick takes on Sima 2, Gen 83,
which were both this years.
Yeah, I'll talk about Sima 2.
The steerability of Sima 2 was to me the most impressive part,
because lighting up the action sequences
and the text conditioning is quite hard to do, right?
And so that, and the fact that they were, like, it's also quite interesting that
means that they can sort of use Gemini as part of the flywheel, right?
Where you can sort of scale this orchestrator as like an independent, almost like a puppet
master, if you will.
And then, like in theory, Gemini could orchestrate many instances of Seema, right?
That to me is the most interesting part is where I tend to agree with this where, like,
I think our models will initially be used as like, like,
you'll have like an orchestrator
VLM of source that's kind of
like managing instances and
instructing them and I think
for Seema showing that you can do this
was fascinating. Also
the fact that you could, they didn't just
have text conditioning but they also were able
to do like drawings and markings
of where to go. They really took
an interesting end-to-end approach to me
that I look forward to seeing a lot more of.
But you're talking to them like you said it. Is that the one
collaborative room? Yeah, I think
the, yeah, we're very friendly with Deep Mind. We like them a lot. I just saw the team not too long ago, and I think, you know, big fans with their work.
The Vineland that I can shake from Alice Heath's coverage of you, yeah, is you're the biggest bet that and Vinod's personalized me since opening eye.
Yeah. How did that conversation start? Okay, so what, with a note's style and maybe I'll get slapped in the fingers for revealing this or whatever, but, uh...
Forgive me if I'm a...
Matt. He asked you to draw a 20, 30 picture of your company, and I think he just picks
N plus five years, whatever. I don't know. I did the same for you. Yeah. He asked you to like
walk that back from first principles all the way from today. And he expects you to do that
flawlessly, where he can challenge any assumption, any part of the vision that, that, and he asks
questions, right? He has a very technical background. He also has a bunch of technical people in
see. And he truly backs people that have these like very large visions on that vision and the ability
to defend it alone. And that's what he did for us. And I think that's why I made that bad.
So I think also through this, through through this question, he gets to know a lot of things about
how technical you are. He gets to know how well you think from first principles because if that
vision is not connected to something real, it's very easy to suss it out by asking good questions.
And then he just backs fully, I think.
Like he really gets in your corner if it's the right fit.
And yet, they've been incredible partners.
They've opened so many doors for us.
I have the after question.
I think it's a very notable story.
Obviously, a lot of work went into it.
But it's also worth him and come out of the side.
For sure.
One of the things also wanted to, I think I kind of asked this question
out of sequence, but one of the things that are exciting about telling you is there are a lot of
people like you who are founders of business and businesses that along the way have a ton of data.
And yours happens to be highly valuable.
You pursued before deciding to do an independent journey, they also talk to other companies
about potential licensing or acquisition and something back.
What is your learning from those periods?
It's like, one version of this is very simply, how do you value data?
Yeah.
I don't think you can value it unless you actually model it yourself and see what the capabilities are.
That's my real outcome.
You see model, but train them up.
Yeah.
But that's obviously like not doable for everyone.
And also I think my general advice would be as model capabilities increase, you and models are also like, you know, these VLMs.
They're very, very good in labeling as well.
generally, right? What I was afraid of when I was having some of these conversations was okay,
like, you know, as the capability is increased, you're just going to need less ground through
data and like you can do more model-based data generation or synthetic data generation.
I would recommend if you're going to do large data deals, like just try to get like a large
chunk of equity in the company that you're doing it with if you can. Now, a lot of them won't
do this, but I think that to me would or just go do the research, figure out what's actually
possible. In our case, we were quite lucky
in the sense that this is actually the
foundation data.
Right? And I think, right? Like, that's
not true for every
data set. I think, you know, we just
happened to hit a particular gold mine.
But you also did, you read
Kuwaiti, you did the action
Ving one or five years ago. Yeah.
So you, you did word. Yeah, that's the thing.
Like, you have to be grounded, right?
And I think a lot of the,
and I think that's the hard part.
And I think a lot of that's interesting
is you can also kind of look for if like scaling laws already exist on your data type,
which like for video there were some, but for these like input action labeled sets there,
they really wasn't any.
The other question is like, does it go into LMs?
Does it go into world models?
Does it go into like what type of model is going to be used for?
And I think that's an important thing to know.
And so I just want to, you know, if you're having these conversations with labs about data,
just like make sure that you actually understand like what it's going to be used for
because that's a very, very good way for you to make the decision yourself about what are you
want to pursue that. Now, a lot of them won't tell you that. And I think, you know, I think in that
case, you generally just don't want to do it because, like, I think for our case, like, we really
cared that, like, for instance, there weren't going to be competing products with game developers
built, right? Because we didn't want to, like, bite the hand that feeds us. And I think we are
part of the games industry. So those questions, I think, are normal. And then we eventually
decided, you know, you just have the data. We're just going to go do it ourselves. And
That's when the rest happened.
And he assembled the team and maintained.
I think about it and said that.
I feel like that's, you've aligned a lot of stars in order to make GI happen.
Yeah.
That other data founders, they are at the beginning of restoring me.
Yes.
Oh, I'm a data founder.
Founders who happen to have beta.
But they had a main business, right?
I don't know if you have another.
There's two sides to this, right?
It's really easy to be super naive about it.
And like, I had a lot of people tell me initially, oh, it's not that valuable.
You're just like making this up.
And so for me, like, doing the work and actually understanding it myself was a really, really big part of building that confidence and go start the company.
But a lot of times it is true that like model capabilities increase so quickly that like the certain data you just don't need anymore.
And so I think it is it's really important to like get people to do to work such that you can make these types of distinctions.
And so my recommendation would be go build models with your data, see if you can create any sort of capabilities that aren't.
clearly already there or on path to being there and then figure out where you go.
Yeah.
I did want to ask this earlier, but you give me an opportunity to.
We usually do the learning, do coursework and all that.
And your co-founders gave you some homework.
Yeah.
Is this like some books?
I mean, Coursera?
No, this was Francois Flores.
So he has a little book of deep learning.
And then he also has a full course that he's published on his website.
I went through the entire course over the summer.
I believe it's like something like 30 or 40.
D lectures, which also take home projects and things like that.
And I would recommend anybody does this.
It goes through, right, history of deep learning, like the topology.
It takes you through the linear algebra, the calculus, eventually end up with, like, chain
rule.
And by this time, you've done, like, all the more important concepts, it takes you through
how do you create neural networks using these concepts that you've learned?
Wow, this is super first principles.
This guy, and I've had the opportunity to, the opportunity.
spend some time with him as well. He is one of the most first principles people I've met in my entire
life. I'm convinced, like I actually asked him why did you this course? He said, oh, because I thought all the
other courses weren't right. And because because he's so first principles and he can only explain things
from like everything you see and how he explains this thing. It's everything is from first principles,
including like the history of deep learning itself was part of the course. And yes, he goes,
so all, so he goes through everything and then, and by the end of it, I think I now have like a pretty
good intuitive understanding of how everything works, but obviously still, right?
Like, I like to describe it as, um, I'm like the, the guy who just got his driver's license.
I can drive the car. And like my co-founders are like the F1 drivers that like have done this for
years. They know where all the, um, uh, where all the gaps are. And so I enjoy getting to learn
from them. The cool thing is also that work models is just like a very, very new space.
And so, you know, I got to bring ideas to the table that like, you know, one's thought
of and not because I'm great at this is just because it's such a new space that like people
We'll just haven't tried it yet.
So, let's get a hit on definition.
Yeah.
What are world models to you?
You know, in a video model, you might predict the next likely sequence or the next most entertaining,
the next most entertaining frame.
What world models do is they actually have to understand the full range of possibilities
and outcomes from the current state.
And based on the action that you take, generates the next state, right?
So the next frame.
And so it is a much more sort of complex problem than traditional video models.
So to me, it is a world that is accurately generated based on the actions that you take as a result of what's already been generated.
And just a fact check, that is it needs to understand physics.
It needs to understand if I'm building a type of material, you need to how it interacts with some type of material.
Yeah, I think the interactions is the most important part.
I think the reasons why world models are so fascinating, one of the things that I did when I was studying over.
the summer was I tried to actually build a super rudimentary Pi-Torch-based physics engine,
which I would not recommend writing a physics engine in PyTorch for obvious reasons,
but I wanted to be able to, because it's a differential, so you can generate the...
Sure, for the...
Yeah, exactly. You can... And then you can train.
And so I wanted to... You know, I got so many people ask me about, you know,
why aren't you just using...
why engine is simulating or generating this data.
And I really wanted to understand from first principles why.
And I think the most important thing that I figured out was the compute complexity of simulation goes up really, really rapidly with three variables. First, the numbers of agents in an environment. Second, dare doff. So their individual.
Jewels or freedom. Yeah. And then third, the information that each action reveals. So like, for instance, if you have a text action or a speech action, the environment can change so much based on whether you say, right, water or fight.
that the outcomes are going to be completely different of how a human would behave in that type of situation.
And so it goes up so quickly with those three variables that at some point you just hit a point where you just want to maximally bet on either video transfer or generation of these environments using world models.
Because that type of soccasicity is just incredibly difficult.
But it's already very, very present in a lot of the video pre-training that goes into these world models.
Right.
And so I think for us, it is more so about making a maximal bet on video transfer and interacting
with things that are difficult to simulate.
And the steerability is also really interesting with text
than it is on betting against simulation or something like that.
And so I think there's still a large market for traditional simulation engines.
It's specifically in areas where video is really hard to get.
Is this exactly what the big lads are also same when you're talking to that?
I honestly haven't talked about the big to the big labs.
Since we started working on them ourselves, I think people are more reserved with what they share with us.
Yeah.
Of course.
with him, make sense.
That's funny about question.
How would you contrast your version of war models with that they read,
the Yombe.
Yeah.
So I don't know exactly what Yonlun is doing today.
My understanding it's based on Lefi Jepa, like Le Jepa approach, which is,
so I'll start with Fei-Fali.
I think what's really interesting about Fei-Fali's approach is that you in some way
are able to reuse the, the, um, the spots, right, in game engines and in things that let you
stay in verifiable domain, um, which I think is a really interesting
approach. However, my understanding is they're currently not interactive, which in my opinion is like
the whole point of world models, right? It's environments. They're great environments. And I think
from a business perspective, I think they picked a really important part of the tool chain.
But to me, that's not really a world model. But my guess is they'll get there, right? They'll start
generating. Yeah, just have been reused it. Yeah, exactly, exactly. And I think, right,
Fife is one of the like founders of the entire space. So I think it's going to be
really interesting to me on what maybe that interactive piece looks like for me to really
judge their approach. I think...
We interviewed...
Just before you moved to Jan, we interviewed her with Justin Johnson, her co-founder.
He was more focused on the physics side of the...
Yeah.
And the interactivity...
I do think that basically, the splats, if you just add more dimensions on, I guess, the forces
acting on them, then you get...
to attract you to the out of the box.
Because you are basically, these are virtual atoms that then has all the low more physics applied to them.
Yeah, I'm excited to see what that looks like when they actually release it.
It's really hard for me to comment on anything.
I really like the frame-based approach because all of our video or all of our training data is in this format.
Yes.
Yeah, so we actually asked them about this and they were like, yeah, it's possible,
about literally choosing the SPAPR. Yeah. Yeah. And you can also go from
splat to frames, right? I'm sure you can write like at some, it wouldn't be easy.
Like you'd have to actually render out the environment. Sure, it's not, it's not going to be
a simple problem. But like in theory, it has to be something that you can do if you really
wanted to. So like it's almost like having a more sort of grounds for three dimensional
representation of the underlying world. Yeah. Right. So I think it's an interesting approach.
It might be overkill, right? You're also dealing with like a much larger like degrees of freedom
on the output space, right?
So who knows how well it scales?
I like the fact that, like, I think these video models also use things like auto encoders,
right?
You can actually have the world models predict, like, much smaller, maybe like a...
Rism machine or size.
Yeah, exactly.
And then you can use, like, diffusion upscaling or methods like this to actually enrich.
And so I think that world models just allow a much more, or world models in my sense,
for much more, like, controlled space that we know really well.
I'm not suggesting their approach is wrong.
I'm just, you know, like this is, I think, what we really like about it.
Honestly, Yon's podcast that he did, I don't remember which one it was,
but a long time ago where he basically proclaimed LLMs to be a dead end,
was one of the things that inspired me to do this.
I think this is very consensus around low models.
Basically, everyone who heard this is like stops with their LMs and just goes through to WOM models.
I would say that the main perspective,
I asked this exact question to Nolan Brown.
them over the eye and do us like, well,
be learning this at all moments, right?
So it's basically that we didn't see
including the narcissistic.
Or what are you on in?
Yeah, yeah, I'm not one to proclaim LMs or that end,
personally.
I think they're actually quite useful,
in particularly as orchestrators.
Like the way I think about is,
as demons, right, we had sort of a three-dimensional
worlds, then we invented text as like a,
in a way, in compression method, right?
So you had, we invented text
in order to communicate with each,
other in a common way, in a way that actually compresses all this information that we are
perceiving in three-dimensional space into just like a single sequence. And I think that allowed
science of three-marge, right? It allowed so many literature, like so many parts of the world that we
charge. So I think it's a critical part of the whole picture. I also agree that it's very, very clear
that they do build sort of the internal implicit world models inside LLMs.
And so I think they'll be very helpful as things like orchestrators.
The problem is when it comes to the generalization, I think,
text as a generalization backbone.
When most of the pre-training is text, right,
or largely text sequences, then I think you want that backbone
to be kind of more spiritual in nature and then also just have text as one of the,
as part of that.
And I think the actual argument of LLMs is also, for instance, the auto-regressive nature of the prediction itself.
So the fact that it's running the entire output, right, through the transformer.
And then in order to predict the next token, which doesn't, like, the environment in the real world is continuous, right?
It's always, it's always changing.
And LLMs kind of just forget about that, right?
I think a lot of the argument is in my first, right?
So I think the fact that text doesn't necessarily generalize well to a situation of moral context
and then the auto-rogressive nature of the prediction and using text for that, right?
So I think those are the two main arguments.
I think text prediction is just one of the actions that is going to come out of these, you know,
these policies and world models.
I think speech and text generation will just be one of the actions that can be a part of that.
I think that there will just be labs coming at this problem from both sides.
And everyone ends up in roughly the same place,
and the same place will be whatever people think is cool.
Right?
Like, whatever the consumer is closest to AI.
Yeah.
And so I don't think there's like a clear answer.
I think it's really interesting to come at it from the world modeling side,
but it's also because we have to, right?
Because like text has largely commoditize.
We can import all the texts.
I think it's interesting intending.
Yeah.
Lime detecting, it makes sense that you can probably recover.
It's kind of like you're taking a step back.
You're studying your branch of the ML Research sheet,
but you might just end up recovering all the other tech stuff for merging.
Yeah, yeah, we can import a lot of that research, right?
A lot of that is on.
That's really cool on the research side.
Let's talk about the stuff that GIS producing or like that I guess the sort of research
and products output.
You mentioned the word customer.
What are your turning customers?
Yeah, so we're already working with some of our largest game developers in the world.
Yeah.
We're also working with game engines directly.
And so really what we're doing at the moment is replacing essentially the player controller inside of a game engine.
So anything that you're currently that maybe like behavior trees or things that you're deterministically coding,
we hope to replace with a single API, which is just you stream us frames and we predict actions.
And that can be inside an engine or it can be on a single API.
or it can be eventually even inside the real world.
Hopefully, Dozerden also steerable.
So, well, say you saw word text steerable yet,
but I think we want to get to a point where they're fully texturable.
Well, to see steerable muse like, well, I want you to build to share.
If you're there are anything else, I agree.
Yeah, I think it's sex conditioning on the generation.
So, yeah, the ability to, you're right.
We want to get to a point where you can generally,
and that's why it's called general intuition,
where we can sort of mimic the intuition of all these gamers into human,
like behaviors in any situation.
As I mentioned, also,
lab is named after the
Demis Abyssusclode from Alpha Fold, which is,
wouldn't it be amazing if we could mimic the intuition
of these gamers who are, by the way, only amateur
biologists on his path to
he tried to get an AI to train Fold it,
to generate a lot of data for AlphaFold.
And so for us, really, the North Star, right,
what we hope to get to one day is being able to represent
scientific problems in three-dimensional space and then
have a space nuclear agent capable of
perceiving that space and using hope,
hopefully also the text reasoning capabilities that LLLNs have today,
in addition to the space and poorer capabilities to be able to work on the other side of that problem.
So that for us is sort of the North Star.
That's why we're sort of trying to be hyper-focused,
Spish and the PoroWorld's workloads,
the same way that Anthropic was hyper-focused code,
and use that to then get into organizations and expand from there.
Just as a side note, since you mentioned, Antarctic,
any idea what they did on this to solve what idea?
No, out of any lab I've probably,
probably no entropic at least to go on it.
Yeah.
I admired him, though.
Yeah.
Well, the current working theory is that they had a super lucky role of the ducks.
But, well, and any compounds from there.
That sounds like a nice story. I'm sure I saw that.
Yeah.
Okay.
So, why do the game developers want this?
So if you're a game developer, how well you're actually retaining players.
It's like, if you have a game that's already at skills,
like, decently dependent on how good your bots are.
So if you're logging in at an obscure time, let's say 3 a.m. in America and your player liquidity is low, then you need really, really good bots to keep those players engaged.
Is this a thing?
Yeah.
For sure.
For that, and whatever.
A lot of human worth it is.
Yeah.
And so if you're like, as a human, do I want to play his bots?
Usually it's not just bots.
It's like players mix in with bots because you don't want to play just against bots.
But it's better to have a full game than to have like an empty game.
Yeah.
And so I think it's only as it's part of the end.
environment, I think it's okay.
That means you also have to sort of grade that skill level.
Yeah, yeah, which we can do.
Because we know exactly how good people are at these games.
Yeah, I think for us, bots is kind of like step one, right?
So what I was showing is we're building a general agent that can sort of play any game in real time.
But really that extends into all of simulation, right?
Like in GTIA5, for instance, people are generally role playing real life.
Building.
Right?
And so they're actually behaving in quite aligned ways with the goals they set for themselves.
So you have all these examples represented in video games, right?
You have truck simulator, power wash simulator.
Power wash simulator.
Where, like, actually, the behaviors that you'd want an agent to be able to perceive, they're all there.
Okay.
Yeah.
It's really how seriously some gamers take truck to simulator.
Did he haven't seen these tips?
You should watch it.
Yeah.
They buy the whole truck driving set and they're doing the job of a truck driver.
Yeah.
What I mentioned to you, we have more people at any given time.
on metal playing with steering wheels in like truck simulator and these types of games,
then Waymo has cars on the road.
It's a ridiculous stat, but it's true.
Yeah, yeah.
I mean, so, you know, I used to think that while to self-soldriving,
he kind of just, the interplayed on a GT5.
Yeah, I mean, it's not bad to this.
Yeah, our bet is not that we can zero shot any of these things.
It's just that, like, the next self-driving company can maybe have,
collect 1% of the data.
Because, right, also, for instance, clips already self-select into negative events and adversity, right?
And so a lot of our data set, because it's already highlights, is really precisely what a lot of these companies spend, like, their last 20% doing.
Right. And I think that's the main argument if you're, if you're, if you're another company that's looking at what we're doing, I think the thing that people are not, that people won't understand is that anything that you're currently doing in pre-training.
As long as your robot can be controlled using a game controller, we hope that we can move that to post-training for you.
So our bet is not that we can create the next self-driving car company.
It's just that the next cell driving car company, hopefully, only needs 1% of the data or maybe 10% of the data, I don't know, right, to be able to deliver a really good product.
Yeah, yeah.
It's also, the term that comes to mind a lot is active learning.
I don't know if you've used to identify with that.
It got less cool for a bit, and it seems like the only uptrend, which obviously you have the best data set for the high intensity or you say negative.
But I feel like you thought negatively.
It could be a little bit.
Yeah, for sure.
I think negative events is just because it's the most common term that people use for, like,
if you're, if your Tesla, you want the crashes, you want like...
Right, right, right, right.
But it's only gaming.
It's both.
Yeah, yeah.
So, you know, the model that you saw, obviously, had really, really incredible moments,
and that was largely...
Yeah.
Yeah.
That, um, uh, that it had a large representation of people at their best.
Yes.
Yeah.
And worst.
Yeah.
Yeah. Amazing. Okay, cool.
Uh, anything else on the customer development side that you want to sort of fledge off?
Yeah.
Um, uh, we're also.
We're also already working with robotics companies, but again, and manufacturing, but the key is that the robot has to have gaming inputs.
So our bet is not that we can transfer over to, like, hired off robots and the keyboard and mouse.
It's really just that we can move the hard work of pre-training, hopefully, to post-training.
Yeah, it's kind of like the foundation model that is a very good basis to stuff.
Yeah, you're going to give us frames and likely some text.
OEO license the model, because they've been to want to post-training.
Yeah.
Our business model is initially going to be an API, like the Anthropic API.
But you also saw, for instance, some of the video labeling models that we've been able to develop.
So the goal is for any company to be able to take in their video data as well.
And we can create first, obviously, custom versions of the policy for you, the agent.
If that doesn't work, then we've already working with a customer that is doing.
We distill a model and they turn that into our product for themselves.
So people can engage with you on the agent level,
with API level.
People can engage me on the model level.
Can you sell data?
No.
All right.
Yeah.
We don't sell data.
Okay, cool.
So that's the business.
And is there a world in which,
I mean, I think this is on you, I think this is on you, I think you're in, you know,
front to your labs for, for world models.
Is there a world in which there is a more sort of application layer thing that you,
that comes out, like a chat GPT for whatever?
Yeah.
You're going to see us launch a few things on, on metal itself, that,
are going to blow your mind as a result of this agent.
I'll leave the imagination for now.
If people will integrate out, you know, land.
Yeah, on the world modeling side,
like, I think one people underestimate is that metal is already one of the largest,
you know, video consumption platforms as well.
People watch millions to millions of videos a day.
Whistle.
So, World Model Base Entertainment and things like that.
Well, it's not like a focus for us right now.
I think we'll be, like on the consumer side,
we have the ability to move very, very quickly here and get it integrated in a way
that I don't think anyone else can't.
Yeah.
You could theoretically do
a video gen, like the Asora,
like,
what is the disabam?
What's the middle one?
Meda,
the middle one?
It was not real.
All the vibes?
Yeah.
Could theoretically generate clips
that nobody play.
But you know what's the device.
Yeah, I think for us,
the games being so human-centric
is like a really big part of a mix is special.
Like, I actually just don't think that would work.
Like, one thing that we are really
excited about though, I'll give you one sneak peek of what we're thinking about is
what if you could literally replay any of the clips that you have inside a world model
or your friends can play them. Like I showed you a model that already took part of your
clip as a contact. It seems to replay, enter that walls. But it's also how we go from
imitation learning to RL, right? Because it's part of a research robot anyways to make every single
every single clip on that all playable. So yeah, who is to say that that doesn't apply to
just the actual clips that you take? Yeah, yeah. He's seen one with the RL potential?
We describe metal as the episodic memory of humanity and simulation.
So when you take a clip, really the way to think about it is you get the highlight of what is maybe three hours of playtime, right?
You maybe get like two to three minutes of the things that were the most out of distribution, right?
It is genuinely your episodic memory of that playtime and simulation of things that you most want to remember and share.
We want to be able to load, and this is the work that Anthony Who is doing, the reason why we built world models, is every creation.
crash that you run into in Euro Truck Simulator or American Truck Simulator or a driving game.
We want to be able, right? And again, these are ground-truth labels. So we know precisely the
actions that lead up to the negative events. They're also title labeled when people
uploaded onto their platform. They say, oh, good, it's a crash. Right. And so we can select
all these events. And if we can put them inside a world model, we can go into, right? We can,
we can train reward models to then reward based on how you perform in clips that actually
contain negative events, for example. And so for us, it's, it's, it's, we can train reward models. And so,
for us, it's very much about, right, we can create this, this, this like, LLM moment on,
I think of an invitation learning, but actually making every single clip on the platform playable
at billions of clip scale is how we go from imitation learning to RL.
Cool.
We covered a lot of it.
Is there anything else that you want to do before we need to grab up with the long-term vision stuff?
Yeah, yeah.
I think for us, this is a very, very ambitious, long-term vet.
We need the best researchers in the world that want to work on this stuff.
It's really exciting not being extremely data constrained.
We really get to, like, we get so many learnings every week that we didn't think
were possible, and it makes it for a joy working here.
Also, the other thing is because we have such a large data modes,
we don't have to be as concerned as the LM company is about publishing because we don't
mean the ones that have been able to.
Exactly.
No one can replicate the models, right?
And so for us, we really want to bring back, like, the original culture of open research,
which is why we did the partnership with QTai in France.
I said didn't.
Yeah, we just announced our partnership with Qa Thai in France,
which is an open science lab in Paris,
one of the best research labs in the world.
Eric Schmidt, I believe, funded in addition to some French people.
They are essentially acting as the partner
that's currently doing a lot of open research on the data.
We also want the partner with universities
because we do believe this is the frontier,
but it's so data constraint that really no,
everyone has their hands side behind their back right now.
And so we want to help fix that.
So, for instance, we want to work with universities to build, like, negative event prediction models for maybe like trucks in India on all the truck data where all these crashes occur.
We have all these things that we know we can do that we just have it at the time to do.
And so if you're listening to this and you're maybe an academic institution or something and you want to access to some of this data in a research, in an educational research fashion, I think we're quite open to doing that because we want to educate people.
And, yeah, and other than that, we just want to work with the best infrastructure and research,
years on the planet as we're going into scaling, you know, runs at F,000, sense of thousands,
eventually hundreds of thousands of GPUs. Yeah. Yeah. Amazing. I primed you this as like the closing
question of like, it's a little bit that we know cost that 30, I didn't know. Yeah. So what does
GI become in by the race? Yeah. In 2030, we want to be the gold standard, um, of intelligence.
Uh, and any sequence, uh, long enough is fundamentally sufficient and temporal, right? Which I think is,
um, so by kneeling space with impore reasoning, you go after the root-nealer problem.
of intelligence itself. What the world looks like is we want to have eight, so I sort of group
the sequences of AI in three stages, and I credit Andre Kaparti for teaching this, bits to bits,
atoms to bits, atoms, and then atoms to atoms. In the atoms to atoms stage, I want, like,
I want GI models to be responsible for 80% of all the atoms-to-adams interactions driven by
AI models. And the reason for that is because we were able to unblock intelligence so quickly
in robotics like intelligence is the bottleneck,
that supply chains actually converged on gaming inputs
as their primary input methods,
and they converged on essentially simpler systems
that let us do a lot more or a lot quicker.
So we are essentially the 80% market approach,
and then you have lots of companies
that have kind of specialized,
maybe humanoid robot OS stacks that are the other 20.
And then so I want to be responsible for 80%
of all the atoms-atoms-in interactions
driven by these models
and be the goal center for intelligence
and maybe 100x more in simulation,
because I think simulation will actually be the larger market initially.
So I think in simulation, because you have very little constraints, also from a safety perspective,
simulation is much easier.
So I think a lot of the takeoff initially systems simulation, so a lot of the simulation use cases,
like what I mentioned, scientific use cases, I'm really, really excited about.
And so, yeah, 80% of Adams-Satoms interactions coming downstream from these types of space
of the World Foundation models, and then 100x more in simulation.
Yeah.
It reminds you a lot of.
that what Mark and Fasilla from the Chazzoberg Institute are doing with virtual biology,
because you can do a lot of putting simulation and you can do it.
Yeah, oh, you can do a lot faster with interest.
Amazing.
Thank you for everybody.
That's your office.
Yeah, and thank you sharing a little bit while you're turning.
Thank you.
Yeah.
