a16z Podcast - Google DeepMind Lead Researchers on Genie 3 & the Future of World-Building
Episode Date: August 16, 2025Genie 3 can generate fully interactive, persistent worlds from just text, in real time.In this episode, Google DeepMind’s Jack Parker-Holder (Research Scientist) and Shlomi Fruchter (Research Direct...or) join Anjney Midha, Marco Mascorro, and Justine Moore of a16z, with host Erik Torenberg, to discuss how they built it, the breakthrough “special memory” feature, and the future of AI-powered gaming, robotics, and world models.They share:How Genie 3 generates interactive environments in real timeWhy its “special memory” feature is such a breakthroughThe evolution of generative models and emergent behaviorsInstruction following, text adherence, and model comparisonsPotential applications in gaming, robotics, simulation, and moreWhat’s next: Genie 4, Genie 5, and the future of world models This conversation offers a first-hand look at one of the most advanced world models ever created. Timecodes: 0:00 Introduction & The Magic of Genie 30:41 Real-Time World Generation Breakthroughs1:22 The Team’s Journey: From Genie 1 to Genie 35:03 Interactive Applications & Use Cases8:03 Special Memory and World Consistency12:29 Emergent Behaviors and Model Surprises18:37 Instruction Following and Text Adherence19:53 Comparing Genie 3 and Other Models21:25 The Future of World Models & Modality Convergence27:35 Downstream Applications and Open Questions31:42 Robotics, Simulation, and Real-World Impact39:33 Closing Thoughts & Philosophical Reflections Resources:Find Shlomi on X: https://x.com/shlomifruchterFind Jack on X: https://x.com/jparkerholderFind Anjney on X: https://x.com/anjneymidhaFind Justine on X: https://x.com/venturetwinsFind Marco on X: https://x.com/Mascobot Stay Updated: Let us know what you think: https://ratethispodcast.com/a16zFind a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://x.com/eriktorenbergPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.
Transcript
Discussion (0)
All of the applications basically stem from the ability to generate a world
that just from a few words, you look at it and there's a world that's generated in front
of your eyes and it's amazing that it's happening.
I was very excited about how far can we push that.
And it's at the point where a human who is not an expert will watch it and think it looks
real, right?
And I think that's pretty incredible.
GD3 from Google DeepMind can create fully interactive, persistent worlds in real time from just a few words.
Today, we're joined by the team behind it.
Shlomi Fucter and Jack Parker Holder from Google DeepMind,
plus Anjane Midha, Marco Mascoro, and Justine Moore from A16Z.
We'll talk about how it works, the special memory that keeps world consistent,
the surprising behaviors have learned, and where world models are headed next.
Let's get into it.
Jack, Shlomi, Jeannie 3 has taken over the internet.
We're honored to have you on the podcast today.
as the response surprised you,
reflect a little bit about the reaction.
We weren't sure how big it's going to be,
but today felt definitely that we have something
that was for a long time coming,
basically being able to generate environments in real time.
I think a lot of work that was done
in Google DeepMind and outside
pointed to that direction,
but we really wanted to make it happen
and I hope we have, yeah.
Team, why don't we reflect internally a little bit
about what we found so game-changing about G-3
and why we're so excited to have this conversation,
Mark?
Yeah, for sure.
I mean, first of all, it's an amazing model.
I think there's a lot of excitement around the special memory, the consistency across all the frames.
I think this is the first time I can see, like, you can have some sort of interactive way of doing this stuff with videos.
Because it used to be like, you would do one prompt and you would have 15 seconds of a video.
But now you can actually have some sort of interactive kind of element to it, which I think is very exciting.
So can you elaborate a little bit more like your insights on these?
Like, how was like, for example, figure out what data you should collect?
like how you make it very interactive and keeping the flow of the whole video,
which I thought was phenomenal.
Sure, yeah.
So I think you kind of highlighted a few capabilities, sort of the length of the generation,
the consistency of the world, maybe diversity as well,
of the kind of things you can generate.
I think the main thing is that obviously we made progress in quite a few different fronts,
right, in separate efforts.
So we had this Gini 2 projects that was much more sort of like three-environments
that it could generate.
And it wasn't super high quality.
It felt like it coming from Genie 1,
but it wasn't the same quality as things like V-O-2,
which the state-of-the-art video model at the time
came out on December,
roughly exactly the same time.
It came out a week later than Genie 2.
And obviously, internally,
there was a lot of discussion between the two projects
about the different directions we were pursuing.
And then,
showing me had also worked on Game & Gen, right,
which is the Doom paper, as people know it,
which I think you guys also wrote a nice piece
on straight after that came out.
So I think that also attracted a lot of attention.
And so that we felt that across these different projects,
we had quite a lot of interesting things
that would naturally kind of combine
and we could basically take the most ambitious version
of the combined project and see if it was possible.
And fortunately it was and quite,
I think the timeline is probably the bit that surprised many of us
because obviously we sell ourselves these goals
and we tried very hard to achieve them.
But you never be totally sure how it's going to actually
feel when you've got to that point.
I think it ended up being something
that resonated with people a lot more
than maybe we expected, but we
always believe us. Yeah,
I'll just add to this, that I think there is
time, so a component is really
important, and not many people
experience it first hand, but we really tried
in the release to at least
have a few trusted testers, interact
with it, and also
get the feel of it by adding these overlays
that show what happens, how people can't use
the keyboard to control it.
And I think there is something magical about the real-time aspect.
I felt it for the first time when our model, like game engine model,
started working fast enough.
And we were just like, oh, my God, it's actually, I can actually walk around.
And it was a bit of a wow moment.
And, yeah, I think there is something when it responds immediately that is really magical.
I think that's kind of sparked the imagination of many people when the Doom kind of simulation came out.
And here we really wanted to push it to somewhere.
We weren't sure it's going to work.
So it was definitely at the edge of what's possible, I think.
That's how we felt.
So we just said, yeah, let's try and see if we can make it happen.
I think you guys, I don't know if this was on purpose or not, but you perfectly timed it
when everyone on X and Reddit and everywhere was making those videos of characters walking through games.
But they obviously weren't interactive.
They weren't real time.
And then you guys came out with this release that was like, now this is an actual product and it blew folks away.
I'm curious, because you can imagine so many different applications for this, right?
Like more controllable video generation or making it much easier to create games,
even personal gaming, where someone's just kind of creating their own world,
they walk through, like, RL environments for agents, robotics.
Are there any particular use cases that you're most excited about?
I think all of the applications basically stem from the ability to generate a world,
just from a few words.
And I think for me, kind of like this potential,
I started looking at video models, I think it was pretty early when I think it was one of the
models were like imagined video, which was a model by Google research.
But there are a lot of models that they were very basic compared to what we have today,
but the ability to simulate something, like you look at it and like there's a world that's
generated in front of your eyes, and it's amazing that it's happening.
And I think at this point, I was very excited about how far can we push that, right?
So I think Vial was one way to do it, and Jeannie is definitely another way to make it a bit
more interactive.
So I think all of the applications
basically stemmed from
this core capability.
So it can be entertainment,
of course, as you said,
it can be training agents,
it can be helping agents
to reason about the world,
education.
So I don't think any particular application
is more important or than others.
I think it's really up to
how developers in the future
will be on top of that.
Yeah, I would get basically
the same answer in the end
with a different journey to get there, right,
which is,
I personally myself worked in reinforcement learning
for a few years before starting the GE project in 2022
and the motivation originally was that in RL at a time
we had this problem where we'd say
which environment shall we try and solve right
because once you've already done go
which people thought was years or decades away
and then that was solved in 2016
what's solved but we reached Superhuman 11 2016
and then Starcraft three years later
which is not particularly long time
for something incrementally significant
So it was 2021 time,
it was a big question of what should we try and do with REL.
We know that the algorithms can learn superhuman capabilities
if they have the right environment,
but we don't know what the environment be.
And so we were working on designing our own ones with colors.
But then instead, it seemed like the more promising path
when you had the first text image models coming out,
whereas like, what if we just think long term,
what's the way to really unlock unlimited environments?
That being said, over the course of the project,
and originally we started it, I guess, in 2020,
it was very focused on that one application,
but it seems quite clear now that this could have a big impact
in all those other areas you mentioned, right?
So I think it's like language models in 2021, maybe.
You probably wouldn't have guessed like an IMO gold medal a few years later
would come that fast as a direct application of that technology, right?
It was probably, oh, it can help me with my emails or whatever it was.
And I think it's really cool to build these kind of new class of foundation models
and then see what people can imagine doing with it.
And that's one of the very exciting things about.
sharing the research preview, right?
So you've got this kind of feedback.
So we're hoping a lot of these things can happen.
One of the things in the research preview post,
Jack, that blew me away was this,
and it wasn't even your first jiff, I think, in the blog post.
It was either second or third.
You had this visual of somebody painting the wall
with the paintbrush,
and then the character moves out of, right?
Like out of, to a different part of the wall, paints,
and then moves back.
And the original paint is still.
still there. And I didn't believe it. I was like, there's no way. And then I read, and you're
right, it's described as a special memory. So the persistence part for me, I'm not taking away
from all the other stuff. The interactivity is amazing. But I think, broadly speaking, folks expected
that at some point, video generation, for example, would become real time. When I saw the
Genie 3 posts, it was like, okay, they actually went and did it. But the special memory,
the persistence was when I kind of sat up in my chair and I was like, how did that happen?
Could you talk a little bit about when did you discover that as an emergent property? Or was
that a specific design goal? What's the backstory on that? Because that feels like a big
unlock, Jack. Why don't we start with you? Yeah, so that's a great question. I'll say a few things.
So the TLDR is it was totally planned for, but still incredibly surprising when it worked that
well, right? So that specific sample, when I saw it, it was hard to believe. I actually wasn't
sure that the model generating for a second. I was like that it told me to watch it a few times
and like really check and freeze the frames and look back and check that it was the same. But
go back a few steps. So obviously, Genie 2 had some memory, right? So this got kind of lost
because, I mean, Genie 2 came at a time when there were lots of announcements, very exciting
announcements. I mean, VO2, only a few days later. It was a busy time of the year. And the main
headline act was that we could generate new worlds at all, right? So that was the thing that
we wanted to emphasize. But it did have a few seconds of memory. And we had a couple of examples,
like, I created a robot near a pyramid, looked away, looked back, and the pyramids there.
but it's like kind of blurry
and it's not perfect
but some other models around the same time
or more recently didn't have this feature
right so people kind of indexed to that
because they didn't notice the early signs of it
in the Genie 2 work
and then for Genie 3
we basically went
much more ambitious on the same sort of approach
right and we made it like a headline goal for ourselves
is can we make the memory be what it is
right we said we want minute plus memory
and real time
and this higher resolution
all in the same model
and those are kind of conflicting objectives
right so we sell ourselves
this kind of technical challenge
and we said if we target this
then it's just about feasible
and it'll be pretty incredible
and then you still don't know obviously
it's going to pan out so then when you get to the
end of the research
one seven months later
to see the samples still is quite
mind blowing to be honest so
yeah it's kind of planned for
but still pretty
cool and exciting when you see it because
they research projects aren't
sure things are they so
one thing that we didn't want to do and we didn't
want to build an explicit
representation right so there are
definitely methods that are able to achieve
consistency and they did that through
an explicit some 3D
you know it's their nerves and Goshen
splating and other methods that
basically say okay if we know how the world
looks like we use this kind of like prior
assumptions on how
the word remains static pretty much
much than we can build representation and then what you're looking at.
So that's great, I think, for some applications,
but we didn't want to go down this path because we felt it's somewhat limiting.
And I think so we can definitely say that the model doesn't do that.
And it does generate like frame by frame.
And we think this is really key for the generalization to actually work.
Every time someone interacts with it for the first time and they like test,
they look away and then look back, I'm always like holding my breath.
And then it looks back and it's the same.
I'm like, whoa.
it's really cool
and how long is this special memory
I don't know if you can talk about it
you mentioned a minute plus
but is there some sort of like
measure that you have
is it like can you keep it for
half an hour or what is the limit on that
there is no like fundamental
limitation but currently
the current design we're limited to one minute
of this type of memories
yeah it's also a real time tradeoff
for the guess as well we felt that
because of the breadth
and the other capabilities that like a minute
it was sufficient.
So this version,
like,
it's quite a significant leap,
but obviously,
eventually you'd want to
make sense of this.
One more question
related on the,
between GNI 1 to,
like,
for example,
NLMs,
like you have
Deepseq R1,
like they saw in this paper,
like the longer
they keep it running,
they suddenly will see like
these interesting behaviors
like the model
will start like reasoning
or like,
would give like a,
oh,
I'm wrong in this,
I should self-correct.
Do you see anything
in kind of like
this scaling
from two to three, do you see any sort of like interesting behavior that you were not expecting
that suddenly just appear by increasing the amount of data and the amount of compute?
Yeah, I'll just say, I think there is a bit of like overall, definitely like many generative
models, we see that improvements happen with scale. I think that's not secret. And I don't think
it's not the same type of intelligence, I would say, like, an LLMAS. I'm not sure if reasoning is
the right term. But we do see that some definitely things like it can infer from, you know,
If you approach like the door, and it makes sense for the agents to maybe open it.
So you might see that it's starting to do that, for example.
Or there's some better word understanding that happens over time.
And it just, like, things look better and more realistic.
So I think these are the trends that we've still observed.
Yeah.
And from Gini 2 to 3, I think the real world capability is really increased, right?
So on the physics side, some of the water simulations, you can see some of the lighting as well.
like a really breathtaking.
I think we have this example of the storm on the blog,
and that one I think is super cool.
And it's at the point where like a human who is not an expert
will watch it and think it looks real, right?
And I think that's pretty incredible.
Whereas Virginia, too, it was like,
it kind of understands roughly what these things should do,
but you know it's not real, right?
You can look at it and you can clearly see that it's sort of
not completely photorealistic.
So I think that's quite a big leap on the quality in that side.
Yeah, one of the things
was really cool in all the examples was the water is sort of a great way to see, like,
does it understand, like, what the world is and how objects interact? And that example,
someone posted of the feet going in the puddle was amazing. But then there was also that
example of, like, a cartoon character. It was more of an animated style who was, like,
running across this kind of green patch of land, and then ran into this blue kind of wavy
thing that looks like water, and he started swimming, which I thought was really interesting.
Like, were there particular things you had to do around that for the model to be able to understand, like, how characters should interact in different environments and different styles?
What you're basically describing is, like, the real breadth of different kind of environment terrains and worlds and things like that, like water or walking on sand versus going downhill and snow and how the agent's sort of interactions should differ given the, like, terrain that they're in?
And I think that that really is a property of scale and breadth of training.
So this is very much like an emergent thing.
I don't think there's anything like really specific we do for this, right?
You again, like you hope the model has learned this because it should have like a general world knowledge.
It doesn't always work perfectly.
But in general it's pretty good.
So for the skiing examples, you do go fast when you go downhill.
And then when you try and go back uphill, it's very slow.
if not at all possible.
When you go into water, obviously,
you hope, as you said,
that the agent will start swimming and slashing.
And this does typically happen.
When you look down near a puddle,
hopefully you're wearing Wellington boots.
Like, this kind of stuff does just kind of make sense.
And I think it feels pretty magical
because it very much aligns with what you were thinking
about the world and the models just generated it all.
So, yeah, that's also one of the really exciting things for sure.
Yeah, and on top of that, one kind of trade-off that typically we have is that we want the model to do two things.
We want the model to create the world in a way that looks consistent.
So I just said, like, if you walk in drain or in piles, then probably wearing boots.
But if we provide it with a different description or, like, the prompt is saying something else,
we want it to still follow the prompt.
And there is some tension here because some things are very unlikely, right?
you might say I want to wear flip-flops and jump in the rain or whatever,
then the model still has to try and create something that is very unlikely.
And that's where typically, you know, video models maybe find it more challenging,
and that's where, you know, our models might find it more challenging,
but still it's still successful to a surprising degree to go into this kind of low-probability area.
And I think that's really, in a way, that's what we want, right?
Like many people, they don't want to just look at the video that looks like they're on,
no, maybe this room, but something a bit more exciting.
And that's where I think this is the magic of the models that they can take you to places
that maybe are not so likely to be in reality.
The text following is really amazing in this model.
And that does feel really magical.
I think there's something that the VO does really well as well, right?
pretty much what you ask for.
It's really well aligned with text.
And we have that with Genie 3.
So you could describe very specific worlds
and really kind of like arbitrary, silly things.
And it pretty much works.
Like we actually had this discussion
because people were very disappointed to find out
that the video I made of my dog actually was not my dog's photograph.
I just described her in text.
And, yeah, I don't know if that's a big secret,
but it looks exactly like her.
And the model just kind of knows, right?
I think that's pretty amazing.
So I think that that's actually a really important capability
that we didn't have with GE2 as well, right?
Because we relied on image prompting.
And so there was some transfer issue, like,
where you rely on imagine to generate the image.
And that often does look really good,
but it's not necessarily a good image for starting the world.
Whereas, like, going directly from text,
you get the controllability to print anything you want.
Plus, it just kind of naturally works
because it's in the, like, correct space for the model to do its thing.
And that's something really powerful.
And why is that, Jack?
What do you think led to such a massive instruction following
or text adherence gain?
Because it's a pretty hard thing to do.
Well, I mean, our team had never really worked on this.
And so Genie 1 and 2 both worked with image prompting.
And so obviously, like, for this next phase,
we leverage a lot of the research done internally on other projects
and personnel-wise.
I mean, Shlomi's obviously been co-leading the VEO project.
And so we were able to kind of build on a lot of other work and ideas internally
and that basically may allow us to kind of like turbocharge progress.
Right.
So if we've done this sort of by incrementally building ourselves on an island,
it would have taken, I think, a lot longer than,
and being part of Google DeepMind
where we have these teams
that have a lot of knowledge
in different areas
and sort of lean and build on,
which I think is super exciting
about our being in the company right now
is that we have so many experts
in different areas
that we can seek out advice and help from.
And Shlomi, a question for you on that
is having led the V-O-3 work
which is kind of mind-blowing.
Is there a reason why this is Genie 3
and not like V-O-3 real-time?
So I think it's definitely a bit different, right?
Like, Genie allows you to navigate an environment and then maybe take actions, right?
And that's not something that Vero at this point can do.
But there are other aspects that they're different, that Jenny doesn't have, right?
Doesn't it doesn't have audio, for example, right?
So we just think it's, while definitely there are potential similarities, it's sufficiently different.
also another thing is that at this point
generally free is not available
as a product and well we do think about it as like a product
that is kind of like mainstream and became very
very popular and you know what the future holds
I don't know but I mean at this point we just felt it's
sufficiently different in terms of what capabilities
and how kind of like we think about this
so Genie free is pretty much a research preview
right it's not something we are releasing at this point
you know something we think about a lot
is what are the edges of a modality
we're talking about us all the time
which is you know the lines start blurring pretty quickly
between real-time image and video
and then real-time video and interactive
whatever world generation world model
I don't think we have a good word for what Gini 3 is yet
but you guys called that world model
which is I think a great term
but in your mind like
where does the video generation
modalities stop
and real-time world's
take, you know, start?
And do you think in the future,
are these converging into basically one modality?
Or if you had to predict, over the next few years,
do you guys think, actually, yeah,
these will diverge into completely different disciplines?
It seems like they share kind of one parent today,
which is, you know, video generation.
But where is the world going, do you think?
Are these two completely different fields?
From my perspective, they're different.
So I would say modality is one thing, right?
We have text, we have audio.
Even within audio, there are different type of sub-motalities.
Speech is not the same as music.
We have different products for music generation.
We have other models for speech generation, speech understanding.
So even within one modality, you can have different flavors.
And then, of course, you have video and other things.
So I think basically I would say the modality is one dimension,
and another is how fast or how quickly we can create,
we can create new samples and completely orthogonal maybe the direction is or dimension is how
much control we have, right? So I think we kind of picked a specific direction or a specific
vector in the space for Gen 3. I think different products, different models can try and go in
different direction. I think the space is pretty big and there are a lot of trade-offs to be made.
So, yeah, I don't know.
I think it really depends.
Some people believe there is, you know, one model that's with everything.
Or I think there is still open-end that was the best way.
Like, we're in a place where engineering is a big part of our research, right,
and actually making those, like, it's not a paper, right,
where we want to build something that people can actually use.
So I think this really makes it, like, an abstract idea to go to get you to some point,
but to actually build things we have to make some concrete.
decisions and I think it kind of forces you to decide what you want to do and what
you're going to know. Yeah, I think this is a really interesting point might and
ultimately it has to be driven by like technical decisions and also like the goals right
so we if you look at the models right now we obviously made a choice that we won V-O-3 and
Genie 3 to be separate projects this year right and if you look at look at them both as they are right
now, they have very different capabilities that the other model does not have. And technically to
combine all of that already into one model, right, would be, I think, very challenging to, I mean,
V-O-3 is totally a higher quality threshold than Gene3, right? And it has very different priorities, right?
So then the natural things you could say, oh, well, you know, what if we just took these
together and combined them? But that may not be the best next step for either of those two
to models, right?
So it may not be the case
that the thing that the other one has
is actually the most compelling thing
for a completely different experience.
And I think that given the breadth
of interest in both models,
right, there's actually
quite a small set of people
that are like really actively using both
and they tend to be more folks like yourselves
who are just more broadly interested in AI, right,
rather than like really downstream use cases.
So like you mentioned agent training
for one,
which is a very sort of
high action frequency
requires more ecocentric
sort of I guess
more like
worlds where tasks can be achieved
but doesn't require
you know
that high quality cinema
style videos you could generate
with the B.O. Molobytes
quite different.
And then on the filmmaking element
I mean,
I'm also sure that Gini 3
is really there at this point.
And that would be necessarily the goal.
I don't know.
On filmmaking,
Justin can do some pretty incredible things with the filmmaking tools today.
You'd be surprised.
Give me access, and I will make amazing films with Genie 3.
I guess I did kind of get to one of my questions, though, which is the work you guys are
doing is incredible, and you clearly probably have so much going on in your brains just
to coordinate training these models and managing these teams.
How much do you also have to think about, like, what are the downstream use cases of the
model when you're training it?
Because you could imagine a world in which you're just like,
we don't really know or care what people are going to do with it yet.
We're just going to go in the research direction.
We think we should go and see what happens.
But based on how you guys are talking about it,
it sounds like you've also been pretty thoughtful around
what are the different capabilities or features needed
for different potential use cases, at least, of different models.
Yeah, I'll say that basically we have some applications in mind,
but that's not what's driving the research.
it's more about
how far can we push
in this particular direction
can we make all of that work
like really great quality
really fast generation
real time very controllable
I think that's kind of like
what drives us
I think they're two to have
to develop journey free
and the applications kind of like
follow and I don't think
to be honest I don't know
what would be the applications for
like I think we're very surprised
I'd like to mention like
very free
people find new ways
and how it can be useful
and to prompt it
with visual stuff
people just discovered it right
we didn't even think about it initially
so I expect kind of the same thing
and I think that's why I am excited
for more people to be able to accept in the future
and in general our approach is to
make sure that
over time
there is more access to
the models we build
and I think that's the only way
to discover what's the rate potentials.
I guess one, somewhat related to that, like,
how do you think going forward like Genie 4 or 5
or any other models like, what is like top of mind right now?
Like if you wanted, for example, to focus on, I don't know,
like it seems like gaming could be one of the applications,
having multiplayer type of games,
we have two special memories or two different completely views,
but at some point they merge.
How are you thinking on like going forward?
Like, what's next?
Is it like scaling these models just on more
data more compute, is it creating this sort of like multi-universe type of things where you have
multiple players, multiple people looking at the same model, putting different views. What's like a top
of mind for you guys? Top of mind, I think for the next few days might be a vacation. After that,
maybe walking my dog in the real world. And then I think you mentioned a bunch of really interesting
things, to be honest. And like, I think we are, we're still collecting a lot of feedback on this
current model, right? And I think that in general, we are most interested in building
the most capable models, right? And so we would hope to have even broader impact in future
and really enable other teams to do cool things with it, right? Both internally and externally.
And for me, it's like, I just started this with like a very, very focused vision about AI.
And I still think, honestly, what I'm excited about for AGI and which is more embodied agents,
I really believe this is the fastest path to getting these agents in the real world.
And I think we made a big step towards that.
But I'm still like sometimes even more excited about applications I never thought of
that come up from other people seeing the model, right?
So I think it's kind of this like trade off of, you know, obviously you want to focus on some applications,
but then you want to be open-minded about others.
And I think that's the real joy of building models like this, right?
Is you get to see all of these people can be way more creative than me with it.
So I think that there's always really cool things that we can do.
And I honestly can't really tell you in one year what the biggest application will be.
But we'll definitely be trying to build better models.
Yeah, I'm really excited.
But I think we are only as impressive, you know, maybe the model is.
I think they're all very far from actually simulating the world actually.
being able to do, kind of put a person in there and then do whatever they want.
And, I mean, when I say far, it doesn't mean it's far in terms of, you know,
calendar time because we aren't really even accelerated timeline.
But it feels like there is more work to do to get there.
And I think I just imagine, like once we can actually, you know,
whatever the form factor would be, but step into this world and just kind of like maybe tell it how we want
to what you want to experience.
There are so many applications.
Imagine, for example,
someone is afraid of talking to people on a stage or in a podcast, right?
They can simulate that, right?
Or you can have someone who is afraid of spiders.
They can maybe actually see themselves getting over that.
So that's like, you know, just one example of something that's actually my wife thought about it.
It's not my idea.
So I think it's really, like there's so many things, right?
So I think this is just
It's all
It's all hinges on the ability to simulate the world
And we put ourselves in it
Maybe seeing ourselves from the side
And potentially having agents
Interacting with things
And yeah, the realism and really making it work
In the way that is similar to our world
I think it's really key
I am actually personally petrified of skiing
And the model is already quite good at that
So I might when things quiet and down
Spend some time
Because I promise my wife
that our children would grow up
knowing how to ski
and we're getting close to the age
where I have to live up to my promise
and I'm not sure if I want to do it yet.
We have to improve the model for you, Jack,
so you can actually...
I get that in distribution.
I hope so.
We were just talking about before we started that
we might see applications
like in robotics.
I mean, Jack, you were talking about embodied AI
and now, like, limitation in robotics
is the data, right?
Like how much data you can collect
and now probably you can just generate
a lot of different scenes
that you were not able
to do before purely from like
just recording videos or so
so I think that's another thing that is pretty
exciting and I mean
congrats on the model it's phenomenal
on the robotics
application there was a
conversation that I was listening to
from Demas yesterday where he was talking about
your guys' work on Genie 3
and he mentioned that there's a
there's an agent I think you guys call it Sima
right which can then
interact with the genie agent
and as I was hearing him describe it
which was kind of breaking my mind,
which is that you had one simulation agent
asking the genie agent
to essentially create a real-time environment
for it to interact in, right?
Which was when I realized,
oh, the way you guys have built it,
it's composable with other agents.
Can you talk a little bit about why that's so important
for robotics like Marco was saying?
And what are the major limitations today
that you think we'd have to overcome as a space
to make the robotics,
sort of progress, the rate of progress in robotics much faster than it is now.
So we designed it to be an environment rather than an agent, right?
So Gene-E-3 is very much like an environment model.
Like we don't see it as like an agent itself that can like think and act in the world.
It's more just a general purpose sort of simulator in a sense, right?
That can actually simulate experiences for agents.
And we know that like learning from experience is a really important paradigm for agents, right?
That's how we got AlphaGo because the agent,
AlphaGo learned by playing Go by itself trying new things, right?
And then learning from feedback with reinforcement learning,
learning to improve itself and actually discover new things.
Like it discovered new moves that Move 37 that humans didn't think was a worthwhile move,
right?
But actually AlphaGo learned that it was because it could experience and try things for itself.
And in robotics, we have this paradigm right now where there's some data-driven approaches,
where you can collect
data in a quite a laborious way
but it looks like the downstream tasks
and looks real and there's not so much of a mismatch
between the two domains
or you can learn in simulation
right but the robotic simulations are even the best ones
and we have some of the best ones
at Eid by Mojoko which we work with
they're still quite far away from the real world
right and so you have the sim to real gap
but even the sim to real gap itself
I think is kind of like poorly named
because what people consider to be real in robotics
is typically still a lab
or some very constrained environment
where you've got a bunch of spotlights on a robot
and then tons of researchers crowding around watching
you know
whereas really real for me is
mainly references
it's the ability to walk my dog
when I'm too busy
to hold the lead
cross the street
you know see someone who's scared of dogs
know to go around them
see someone with a ball, change directions,
like all these challenging situations in the real world, right?
And of course, you still have gripping.
You still have these other tasks.
But you need to really discover your own behaviors
from your own experience, right?
And that's that doing that in physical embodied worlds
is super challenging because there's so many reasons
why firstly that could be expensive
to collect data in those settings.
You'd have to keep moving the robot back
to where it started every time it doesn't do something right.
And also it could be unsafe, right?
So there's many reasons why we can't really do
learning from experience in the physical world, right?
So we do it in simulation.
But really what we think with Genie 3 is it's the best of both, right?
Because you're taking a real world data-driven approach, right?
But then you've got the ability to learn in simulation.
So it kind of combines the good parts of each of those.
And so that's why I think it could be super powerful.
Not just for a robot example,
but I really love this idea of having, when it rains in London a lot,
not having to take my dog for the second walk would be great.
And as you can see, we built a modern basically for Jack personnel.
Vacations, that's what driving the product is the point.
There's a lot of dog owners out there.
Yeah.
I just saying clearly, Jack, it's time to move to California.
Yeah.
That's solution.
Less rain.
Less lag.
I mean, I personally love.
California, but my wife's not, I'm most not convinced.
We're convinced here.
Yeah, just to touch on, you know, maybe a final point on the robot's kind of like
robotics part, I think there are, like, it's definitely, you know, robotics means it's
more than visual, right, like we need to be able to, I think this is an important point.
We want, we can drive the decisions of the robot by looking around, but still it has
to, to kind of like, you know, do extraditions, decide where to move, how to respond to the
environment.
So I think there are definitely some gaps.
but still at the core of the problem
being able to reason about the environment
we think this is something
that's the world models
or general purpose world models
such as Genie Free
can really help with
and maybe with future research
we can actually bridge those gaps
of physical understanding
and actually getting responses
physical responses from the world
which is a very interesting direction to explore
one last question from my side
I don't know if you can answer this, but like, is it going to become public?
Like, can developers access it at some point?
Or is there, like, some sort of idea on this?
So, as you can see, we are very excited about having more people accessing it.
So we're definitely want to make it happen.
There is no kind of like a concrete timeline at the moment.
But, you know, I'm sure once we have more to share, we will do.
Awesome.
One of the things I've been thinking about a lot is we see sort of with every, like,
modality, like, you know, maybe first LLMs and then image and video and audio.
There's, like, early kind of glimmers of something really exciting in a project or a research
preview.
And then there's, like, a ton of data and compute and researchers kind of poured out the
problem.
And you hopefully see this sort of, like, exponential progress until you eventually get to the
point where, like, you're out of data or the improvements don't come as easily.
I'm wondering for your thoughts, like, where we are on sort of that curve for world models.
that's a really question.
I actually have a super hand-wavy, somewhat swerving answer, right?
And I think it's actually both.
So I think the current capabilities are actually already quite compelling.
And so you could make the case that like if what you wanted was a minutes of
photo realistic any world generation with memory, that could actually be the end goal, right?
And two or three years ago, I probably would have said that was a five-year goal.
And so at that point, if you just wanted to improve that,
I think you probably end up with this maybe like,
I think the jump from Genie 2 to Genie 3 was absolutely massive
and went from being like kind of a cool bit of research
that was like showing signs of life,
something that could already be very compelling.
But I think there's a lot more that you can do with this.
And Shulami kind of reference this to himself, right?
Like it's not the case that you're dropping yourself in the world, right?
And like it's like the real being in the real world, for example.
It's actually quite different to that.
When you do, you know, take a minute to look away from food to screen,
it's quite a bit richer out there.
And that's just for the real world.
We also want this ability to generate completely new things, right?
So I think we've got a huge gap to close, right,
with the new capabilities that we want to add.
But I think it's made a bit different to language models.
Or actually, maybe it is similar to language models,
but with language models there's been like lots of new steps
that have actually come on top, right?
That maybe we didn't think were possible.
We thought things were plateauing.
And then a new idea came that made a significant change.
And that has happened a couple of times in the past few years.
So I think that there's a few more of those left, for sure.
My final question for you guys is, are we living in a simulation?
Oh, yeah, that's every, every, you know, previous to my thinking about that is,
actually, yeah, I thought about it.
I think
if we live in a simulation
my take is that it doesn't
run on our current
hardware
because it's analog
and not like
it's continuous
all of the observations
that are continuous
and there is nothing
like but maybe
the quantum level
is you know
some limitation of our
you wanted to go
philosophical
is some kind of
like a hardware limitation
of the simulation
we run on
So, yeah, take it or leave it.
That's a great answer.
Clearly there's all to work for the TPU team to do.
Yeah, maybe quantum computing will be actually running our actual simulation.
So, yeah, yeah.
That's a great place to wrap.
Shlomi, Jack, thank you so much for coming on the podcast.
Thank you, guys.
Thanks, guys.
Thanks for us.
Thanks for listening to the A16Z podcast.
If you enjoyed the episode, let us know by leaving a review at rate thispodcast.com.
slash A16Z. We've got more great conversations coming your way. See you next time.
As a reminder, the content here is for informational purposes only. Should not be taken as
legal business, tax, or investment advice, or be used to evaluate any investment or security
and is not directed at any investors or potential investors in any A16Z fund.
Please note that A16Z and its affiliates may also maintain investments in the companies
discussed in this podcast. For more details, including a link to our investments, please see
a16Z.com forward slash disclosures.
