The AI Daily Brief: Artificial Intelligence News and Analysis - Are World Models the Key to AGI?
Episode Date: July 22, 2025A groundbreaking Harvard study trained AI on 10 million solar systems and found it perfectly predicted orbits but completely failed to understand gravity, raising questions about whether LLMs can deve...lop true world models. While companies pour billions into scaling, Meta's Yann LeCun believes current systems will be "obsolete within 3-5 years" and argues that world models could revolutionize AI by giving machines genuine understanding of physical reality. This explores three competing paths to AGI, why Google DeepMind is hiring world model teams, and whether we're looking at the missing piece for human-level AI.Brought to you by:KPMG – Go to https://kpmg.com/ai to learn more about how KPMG can help you drive value with our AI solutions.Blitzy.com - Go to https://blitzy.com/ to build enterprise software in days, not months AGNTCY - The AGNTCY is an open-source collective dedicated to building the Internet of Agents, enabling AI agents to communicate and collaborate seamlessly across frameworks. Join a community of engineers focused on high-quality multi-agent software and support the initiative at agntcy.org Vanta - Simplify compliance - https://vanta.com/nlwPlumb - The automation platform for AI experts and consultants https://useplumb.com/The Agent Readiness Audit from Superintelligent - Go to https://besuper.ai/ to request your company's agent readiness score.The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614Subscribe to the newsletter: https://aidailybrief.beehiiv.com/Join our Discord: https://bit.ly/aibreakdownInterested in sponsoring the show? nlw@breakdown.network
Transcript
Discussion (0)
Today on the AI Daily Brief, what are world models and why do they matter for AGI?
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
Hello, friends, quick announcements before we dive in.
First of all, thank you to today's sponsors, KPMG, Blitzy and Vanta.
And of course, as always, to get an ad-free version of the show, go to patreon.com slash AI Daily Brief.
Now, if you are interested in sponsoring the show, shoot me a note at NLW at Breakdown.
Network, and I will send you all the relevant information.
And lastly, when it comes to today's episode, one of the things that I have heard from you with some
frequency is that on your wish list for the show is occasional deeper dives and more technical deep dives
into big topics. There's been a really interesting paper from Harvard running around about LLMs
and their ability to generate world models. And I thought that provided a good excuse to go deeper
into what world models are in general. Now, this ended up being extremely information dense,
and I thought a perfect fit for that type of episode. However, because it's so information,
Dense, I decided that we're just going to run it as the entire episode today. We will be back
tomorrow with our normal split between headlines and Maine tomorrow. For now, though, let's dive in and
talk about world models, why they matter for AGI, and what this new skeptical Harvard paper has
to say. Welcome back to the AI Daily Brief. World models are a concept that we've talked about a few
times here at the show, but maybe not in as much technical depth, at least from a primer perspective,
as would be helpful. Now, recently there's been a lot of discussion around this interesting paper out
of Harvard, that's all about world models and specifically whether foundation models are able
to develop a world model out of their training sets. Maybe a better way to describe this is just
the abstract of the paper itself. The researchers write, foundation models are premised on the idea
that sequence prediction can uncover deeper domain understanding, much like how Kepler's
predictions of planetary motion later led to the discovery of Newtonian mechanics. However,
evaluating whether these models truly capture deeper structure remains a challenge. We develop a
technique for evaluating foundation models that examines how they adapt to synthetic data sets
generated from some postulated world model. Our technique measures whether the foundation model's
inductive bias aligns with the world model, and so we refer to it as an inductive bias probe.
Across multiple domains, we found that foundation models can excel at their training tasks,
yet fail to develop inductive biases towards the underlying world model when adapted to new
tasks. We particularly find that foundation models trained on orbital trajectories
consistently failed to apply Newtonian mechanics when adapted to new physics tasks.
Further analysis reveals that these models behave as if they develop task-specific heuristics
that fail to generalize. Basically, what the researchers are looking for here is to understand
whether LLMs, which are reductively stated, of course, prediction machines that predict the next
token that's going to make sense in the context of the training data, can they move from
that sort of prediction to adapting a generalized approach to or understanding of the world that
they are operating in. Specifically, they're trying to figure out not just if a foundation model can
predict orbital trajectories based on training data around orbital trajectories it's had,
but about whether it can actually understand the physics principles that underlie those orbital
trajectories in a way that allows it to apply those physics to other types of problem domains.
In other words, the world model that they're interested in with this particular experiment is the
physics that underpin orbits, and they're trying to see if the foundation model can figure out those
physics without necessarily knowing about them in advance. Now, this might all sound like researcher
gobbledy gook, but let me try to convince you that this is actually fairly integral to understanding
how LLMs are likely to develop and what pathways are most likely to produce big advances.
And to the extent that you are a business person who's only interested in what models can actually
do for me, I would contend that this still matters in the sense that a lot of the next set of
use cases to be unlocked will require advances in LLM capabilities that are going to
to need to come either from existing LLM scaling approaches or fundamentally new approaches like this
focus on world models. So that's the context of why we're having this world model conversation now,
but let's dig a little bit deeper into what we actually mean when we say world model. World models
and AI refer to systems that create internal representations of the external environment,
allowing them to simulate and predict future states based on observations, actions, and underlying
dynamics like physics, as we were just discussing, causality and spatial relationships. These models
draw inspiration from how humans subconsciously build mental models to anticipate outcomes,
such as a baseball player predicting a pitch's trajectory without consciously simulating every
possibility. By the way, that example comes from a paper we're going to talk about in just a minute.
Essentially, they act as an AI's internal map of reality, enabling it to handle uncertainty,
forecast events, and make decisions more efficiently by rehearsing scenarios in a simulated space
rather than relying solely on real-world trial and error.
Now, as I mentioned, the concept originated in a 2018 paper by David Haugh and Juergen-Schmidt,
which introduced a framework that consisted of three key components.
The first was a vision model to compress high-dimensional sensory input like images
into compact latent representation,
a memory model to predict future latent states based on past information,
and a controller model to decide actions using these representations.
In their experiments, this architecture was applied to a car racing simulation,
where the agent learned to navigate tracks by hallucinating and planning within its internal model,
demonstrating how world models can train controllers in a dreamlike simulated environment to improve performance.
One of world models' biggest and loudest proponents is Jan Lacoon, who is the chief AI scientist at Meta,
and he wrote a highly technical definition on LinkedIn about a year ago of world models.
But trying to simplify it just a little bit, world models typically involve an encoder to process inputs,
e.g. observations and actions into a state representation,
followed by a predictor to forecast the next dates,
often incorporating latent variables to account for unknowns,
and generate a distribution of plausible outcomes rather than a single prediction.
Training occurs on large datasets of real-world data such as videos or images,
using techniques like diffusion models or transformers to learn dynamics.
Modern variants extend this to multimodal inputs like text images and videos,
and outputs that simulate environments as predicted videos or 3D spaces.
Now, for their proponents, world models are crucial for advancing AI towards human,
intelligence because they enable reasoning, planning, and adaptation in complex uncertain settings.
Lucerne said, we need machines to understand the world, machines that can remember things that have
intuition, have common sense, things that you can reason and plan to the same level as humans.
He also added hinting at a place we'll go in just a minute. Despite what you might have heard
from some of the most enthusiastic people, current AI systems are not capable of any of this.
Which I guess brings us to the next question, which is how do world models differ from pre-training
or test time compute-style approaches to scaling LLMs and achieving the next levels of advanced
artificial intelligence, AGI, whatever you want to call it. And the short answer is that these
are fundamentally different approaches. In other words, different paradigms for advancing LLMs.
Pre-training and test-time compute focus on optimizing compute and data within the current
LLM architecture, which is primarily auto-regressive next token prediction, whereas world models
emphasize a fundamental architectural shift to enable deeper world understanding. So,
pre-training scaling is an approach that relies on the scaling laws hypothesis, where LLM performance
improves predictably by increasing model parameters, training data volume, and computational resources
during the pre-training phase. Models like GPT4 and GROC are trained on vast data sets to learn patterns,
enabling emergent capabilities like zero-shot reasoning or few-shot learning. Now, the strengths of
this approach for AGI is that it builds broad knowledge and generalization from data, allowing
models to handle language, math, and creative tasks. Scaling so far has driven significant
and frankly rapid progress, where what you would expect to happen happens, which is larger models
outperforming smaller ones on benchmarks. The problem is that there are indications that we're seeing
diminishing returns. Basically, we're hitting performance plateaus despite significant compute increases
due to data quality issues, biases, and other sets of factors that we're still trying to figure out.
Critics like Jan Lacoon basically argue that this path is fundamentally flawed for AGI and can't
ever produce human-like intelligence. He put it really bluntly. If you're interested in human-level
AI, don't work on LLMs. Now, of course, it was last fall that we started to talk about hitting
these performance plateaus, and a new approach that was associated with the reasoning models
started to become the thing that everyone was talking about. That was test time compute or inference
time scaling, which shifted the focus from pre-training to allocating more compute during
model use or inference. Techniques include chain of thought prompting where models generate
intermediate steps, tree of thoughts for exploring multiple paths, self-consistency via sampling
or voting, adaptive looping, basically all strategies that allow models to quote-unquote think
harder on problems, which can in some cases allow them to outperform much larger pre-trained models.
So the strength of test-time compute when it comes to reaching AGI is that it enhances reasoning,
error correction, and adaptability without retraining, which makes it better for complex tasks like
math or coding. However, there are still some challenges. Given that we have these models thinking
for minutes or in some cases even longer, it's computationally expensive. It still relies on the
underlying pre-training model, and it doesn't address core LLM flaws like lack of causality or
world grounding. That means even this approach can struggle with ambiguity, long context reasoning,
and scalability for real-world AGI applications. World models, on the other hand, as we've
discussed, create these internal, simulatable representations of the environment that can predict
future states based on observations, actions, physics, and causality. Basically, we're drawing
on the inspiration of human cognition where we mentally simulate outcomes. These approaches
use architectures like joint embedding predictive architecture or JAPA, where models learn latent
representations by predicting future inputs in a non-generative way. So the strengths here are that
world models enable common sense, and can in some cases handle uncertainty in long-term planning,
which is why people like Jan Lacoon view them as the missing link for human-level AI. However, they
have limitations as well. There are high computational demands for training on multimodal data like
videos. There are the risks of hallucinations and simulations, and there are challenges in scaling
these models to real-world dynamics. They're also, frankly, just less mature than LLM scaling,
requiring new datasets and new architectures. So this is the background on world models
and how they differ from our current mainstream approaches to LLMs.
Today's episode is brought to you by KPMG. In today's fiercely competitive market,
unlocking AI's potential could help give you a competitive edge, foster growth, and drive new value.
But here's the key. You don't need an AI strategy. You need to embed AI into your overall business
strategy to truly power it up. KPMG can show you how to integrate AI and AI agents into your
business strategy in a way that truly works and is built on trusted AI principles and platforms.
Check out real stories from KPMG to hear how AI is driving success with its clients at www.kpmg.comg.
again, that's www.kpmg.
Dot U.S.S. slash AI.
This episode is brought to you by Blitzy.
Now, I talk to a lot of technical and business leaders
who are eager to implement cutting-edge AI,
but instead of building competitive moats,
their best engineers are stuck modernizing ancient code bases
or updating frameworks just to keep the lights on.
These projects, like migrating Java 17 to Java 21,
often means staffing a team for a year or more.
And sure, co-pilots help,
but we all know they hit context limits fast,
especially on large legacy systems.
Blitzy flips the script.
Instead of engineers doing 80% of the work,
Blitzy's autonomous platform handles the heavy lifting,
processing millions of lines of code
and making 80% of the required changes automatically.
One major financial firm used Blitzy to modernize
a 20 million line Java code base in just three and a half months,
cutting 30,000 engineering hours
and accelerating their entire roadmap.
Email Jack at Blitzie.com
with Modernize in the subject line for prioritized onboarding.
Visit blitzy.com today
before your competitors do.
As a founder, you're moving fast towards product market fit, your next round, or your first big
enterprise deal. But with AI accelerating how quickly startups build and ship, security expectations
are higher earlier than ever. Getting security and compliance right can unlock growth or stall
it if you wait too long. With deep integrations and automated workflows built for fast-moving teams,
Vanta gets you audit-ready fast and keeps you secure with continuous monitoring as your models,
infra and customers evolve.
Fast-growing customers like Langchain,
writer and cursor trusted Vanta to build a scalable foundation from the start.
And look, as someone who lives in the world of enterprise procurement,
I love how Vanta makes it easy to get compliance right.
The last thing you need when you're trying to win that big deal
is to have it scuttled by something that Vanta has solved for over 10,000 companies.
Go to vanta.com slash NLW to save $1,000 today through the Vanta for Startups program
and join over 10,000 ambitious companies already scaling with Vanta.
That's V-A-N-T-A-com slash NLW to save $1,000 for a limited time.
There are lots of interesting experiments in world models right now.
Fafi Lee's World Labs, for example, shared in December an AI system that could generate 3D worlds from a single image.
If you go to worldlabs.ai slash blog, you can actually click around an experiment with this.
We've also seen examples of more accurate physics simulations, like this droplet of condensation running down this beer bottle.
and some have even suggested that the major labs are backing into world models by developing
highly capable video models like V-O-3.
And this, I think, is particularly pertinent to the conversation about this paper.
In June, Ethan Mollick wrote, AI video tools really do seem to be able to simulate physics
well but not perfectly without having an underlying physics engine.
A world model?
So that brings us back to this Harvard paper.
And I think the best way to understand it is actually just to dig into the long thread on
Twitter from one of the researchers Kavana Vafa.
Basically, what Kavon and his fellow researchers were interested in is whether you can get a generalized
world model from a more limited training set.
In other words, can a well-trained model that can make accurate predictions extrapolate that
knowledge into a general understanding of the world?
Kavon writes, one result tells the story.
A transformer trained on 10 million solar systems nails planetary orbits, but it botches gravitational
laws.
Basically, Vafa and co-trained a small AI model using data from the orbits of
10 million different solar systems, which led to exactly what you'd expect in terms of its ability
to make predictions about planetary orbits. However, it had no ability to generalize those predictions
into a general theory of gravity or any other known physics models. Kavana writes,
our paper aims to answer two questions. One, what's the difference between prediction and world
models? And two, are there straightforward metrics that can test this distinction? Now, interestingly,
and you will know if you are a regular listener that this is where I got interested, he continues,
our paper is about AI, but it's helpful to go back 400 years to answer these questions.
Consider the interest of my inner history major, Pert. Gavon continues,
perhaps the most influential world model had its start as a predictive model.
Before we had Newton's laws of gravity, we had Kepler's predictions of planetary orbits.
Kepler's predictions led to Newton's laws, so what did Newton add?
If you only care about orbits, Newton didn't add much. His laws give the same predictions.
But Newton's laws went beyond orbits. The same laws explained penitence.
Angela, cannonballs, and rockets. This motivates our framework. Predictions apply to one task,
world models generalize to many. Which, by the way, I think is a really nice, crisp summary
of how to think about the distinction. What Vafa and his researchers found was that the model
couldn't transfer its knowledge about orbits to other related physics problems. It failed to
produce Newton's general rule of gravity and instead seemed to believe that gravity worked
differently across different galaxies. Vafa also tested leading commercial reasoning models,
which have Newton's laws in their training sets.
When they were given series of orbital data without being told to apply Newton's laws,
they failed to develop a generalized theory and make a successful prediction.
Vafa was trying to uncover the inductive bias of these models.
That is, test the default set of assumptions used to make predictions.
He asked,
If a foundation model's inductive bias isn't towards a given world model, what is it towards?
One hypothesis, models confuse sequences that belong to different states
but have the same legal next tokens.
This theory was tested using a board for the game of Thello.
The models that Vafa trained were unable to reconstruct a board based on a description of moves,
but they often produced a single legal next move even if the reconstruction was incorrect.
To link it back to the orbital prediction problem, Vafa suggests that LLMs get confused
when two states share a common next step, that is, they conflate the two different states,
making their predictions inaccurate.
Vafa concluded, one, we propose inductive bias probes, a model's inductive bias reveals its
world model. Two, foundation models can have great predictions with poor world models. Three,
one reason world models are poor, models grouped together distinct states that have similar
allowed next tokens. In essence, Vafa is claiming that transformer-based LLMs don't have or aren't
able to develop a strong world model that can be transferred to make predictions about related
tasks. And this seems to cut at the quick of an LLM's ability to transfer from next token prediction
to more generalized intelligence.
Except for the fact that maybe the result is much less generalized than it appears.
Nathan LeBenz from the Cognitive Revolution podcast wrote a long LinkedIn post
arguing exactly this.
With a wink and a nod to the Princess Bride, he wrote,
you keep sharing this paper, but I do not think it means what you think it means.
First he explained what they had done, and then wrote,
The trouble is, what do you do when your first experiments fail?
At a company pushing the AI capabilities frontier, you would try, try again.
In this case, the authors declare victory and invoke Isaac Newton to promote their no-world
model's world model.
The critical mistake is simple.
You can't generalize from a few failed experiments to the conclusion that something is impossible.
Basically, he's saying that if this were a lab context, rather than declaring a generalized
critique on the basis of the LLM failing to generalize a physics model from orbital training
data, the labs would just try again in some different way to see if there was a way to go
from specific dataset and prediction to a more generalized world model.
He also points out that the models and data sets that are used here are small.
He writes,
For orbital mechanics, they used a 109 million parameter transformer and 2 billion tokens,
roughly 110,000th the size of current frontier models and data sets.
For Othello, the dataset is only 7.7 million tokens.
For comparison, the original 2022 work showing that models trained on Othello-Move sequences
do learn board state world models, used a synthetic dataset with 20 million games.
50 times more data. In other words, he says, these are not really foundation models at all.
LeBenz then listed a handful of other papers that showed generalized emergent world models
from LLM pre-training, but they all required either larger models or more training data.
To still man the paper, he said, that a model can predict the next token in a sequence
does not imply that has a robust world model. That much is true. Just don't make the mistake of
believing that they can't develop world models. They clearly can and do.
Ultimately, as you've seen, the verdict is out right now on what the right approach to getting
to the next AI unlocks really is.
The field hasn't settled on a single answer for what new architecture should look like.
What is clear is that there are really interesting things happening in the world model approach.
Faye-Fei's World Labs has made some big strides since those early demos last year.
Martin Casado of Andresen Horowitz recently showed off what it's now capable of when attached
to a traditional 3D rendering engine.
And even if world models aren't the right pathway to AGI,
whatever that means, solving the issue of transferable knowledge would still be a massive unlock.
As a simple example, it would allow media generation to be much more consistent because
models could transfer their understanding from one context to another. A16 Z's Justine Moore is eager
for things that would become unlocked, posting, So This is the Dream, a video world model that takes
an image as input and renders an environment you can explore and interact with. It could be a constant
video stream, like your own lofi girl, or you could jump in and play as a character.
Now, she believes that this is already possible with modern video models, but requires a lot of
consistency hacks. In other words, this kind of product would be far more viable if models had a
transferable understanding of the world. Now, as to this question that was brought up by Ethan Malik,
of whether generative video models are a backdoor to broader world models, a Google research
paper from earlier this year sort of argues that the answer is no. A research team found that
video models don't really learn about physical reality, they just learn about visual realism.
that lets them create believable videos, but it does little to help them make realistic predictions
across other domains.
Still, it's a super exciting field, one that feels almost inevitably to me, to be likely to
contribute significantly to the advancement of AI in some way, and so, of course, we will
continue to cover it here.
Hopefully now you not only understand this paper in the discussion around it a little bit
better, but have a better framework for understanding world models and how they relate to
other approaches to LLM scaling.
For now, that's going to do it for today's AI Daily Brief.
Until next time, peace.
