The AI Daily Brief: Artificial Intelligence News and Analysis - Are World Models the Key to AGI?

Starting point is 00:00:00 Today on the AI Daily Brief, what are world models and why do they matter for AGI? The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Hello, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG, Blitzy and Vanta. And of course, as always, to get an ad-free version of the show, go to patreon.com slash AI Daily Brief. Now, if you are interested in sponsoring the show, shoot me a note at NLW at Breakdown. Network, and I will send you all the relevant information. And lastly, when it comes to today's episode, one of the things that I have heard from you with some

Starting point is 00:00:36 frequency is that on your wish list for the show is occasional deeper dives and more technical deep dives into big topics. There's been a really interesting paper from Harvard running around about LLMs and their ability to generate world models. And I thought that provided a good excuse to go deeper into what world models are in general. Now, this ended up being extremely information dense, and I thought a perfect fit for that type of episode. However, because it's so information, Dense, I decided that we're just going to run it as the entire episode today. We will be back tomorrow with our normal split between headlines and Maine tomorrow. For now, though, let's dive in and talk about world models, why they matter for AGI, and what this new skeptical Harvard paper has

Starting point is 00:01:15 to say. Welcome back to the AI Daily Brief. World models are a concept that we've talked about a few times here at the show, but maybe not in as much technical depth, at least from a primer perspective, as would be helpful. Now, recently there's been a lot of discussion around this interesting paper out of Harvard, that's all about world models and specifically whether foundation models are able to develop a world model out of their training sets. Maybe a better way to describe this is just the abstract of the paper itself. The researchers write, foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler's predictions of planetary motion later led to the discovery of Newtonian mechanics. However,

Starting point is 00:01:56 evaluating whether these models truly capture deeper structure remains a challenge. We develop a technique for evaluating foundation models that examines how they adapt to synthetic data sets generated from some postulated world model. Our technique measures whether the foundation model's inductive bias aligns with the world model, and so we refer to it as an inductive bias probe. Across multiple domains, we found that foundation models can excel at their training tasks, yet fail to develop inductive biases towards the underlying world model when adapted to new tasks. We particularly find that foundation models trained on orbital trajectories consistently failed to apply Newtonian mechanics when adapted to new physics tasks.

Starting point is 00:02:33 Further analysis reveals that these models behave as if they develop task-specific heuristics that fail to generalize. Basically, what the researchers are looking for here is to understand whether LLMs, which are reductively stated, of course, prediction machines that predict the next token that's going to make sense in the context of the training data, can they move from that sort of prediction to adapting a generalized approach to or understanding of the world that they are operating in. Specifically, they're trying to figure out not just if a foundation model can predict orbital trajectories based on training data around orbital trajectories it's had, but about whether it can actually understand the physics principles that underlie those orbital

Starting point is 00:03:13 trajectories in a way that allows it to apply those physics to other types of problem domains. In other words, the world model that they're interested in with this particular experiment is the physics that underpin orbits, and they're trying to see if the foundation model can figure out those physics without necessarily knowing about them in advance. Now, this might all sound like researcher gobbledy gook, but let me try to convince you that this is actually fairly integral to understanding how LLMs are likely to develop and what pathways are most likely to produce big advances. And to the extent that you are a business person who's only interested in what models can actually do for me, I would contend that this still matters in the sense that a lot of the next set of

Starting point is 00:03:51 use cases to be unlocked will require advances in LLM capabilities that are going to to need to come either from existing LLM scaling approaches or fundamentally new approaches like this focus on world models. So that's the context of why we're having this world model conversation now, but let's dig a little bit deeper into what we actually mean when we say world model. World models and AI refer to systems that create internal representations of the external environment, allowing them to simulate and predict future states based on observations, actions, and underlying dynamics like physics, as we were just discussing, causality and spatial relationships. These models draw inspiration from how humans subconsciously build mental models to anticipate outcomes,

Starting point is 00:04:33 such as a baseball player predicting a pitch's trajectory without consciously simulating every possibility. By the way, that example comes from a paper we're going to talk about in just a minute. Essentially, they act as an AI's internal map of reality, enabling it to handle uncertainty, forecast events, and make decisions more efficiently by rehearsing scenarios in a simulated space rather than relying solely on real-world trial and error. Now, as I mentioned, the concept originated in a 2018 paper by David Haugh and Juergen-Schmidt, which introduced a framework that consisted of three key components. The first was a vision model to compress high-dimensional sensory input like images

Starting point is 00:05:07 into compact latent representation, a memory model to predict future latent states based on past information, and a controller model to decide actions using these representations. In their experiments, this architecture was applied to a car racing simulation, where the agent learned to navigate tracks by hallucinating and planning within its internal model, demonstrating how world models can train controllers in a dreamlike simulated environment to improve performance. One of world models' biggest and loudest proponents is Jan Lacoon, who is the chief AI scientist at Meta, and he wrote a highly technical definition on LinkedIn about a year ago of world models.

Starting point is 00:05:42 But trying to simplify it just a little bit, world models typically involve an encoder to process inputs, e.g. observations and actions into a state representation, followed by a predictor to forecast the next dates, often incorporating latent variables to account for unknowns, and generate a distribution of plausible outcomes rather than a single prediction. Training occurs on large datasets of real-world data such as videos or images, using techniques like diffusion models or transformers to learn dynamics. Modern variants extend this to multimodal inputs like text images and videos,

Starting point is 00:06:12 and outputs that simulate environments as predicted videos or 3D spaces. Now, for their proponents, world models are crucial for advancing AI towards human, intelligence because they enable reasoning, planning, and adaptation in complex uncertain settings. Lucerne said, we need machines to understand the world, machines that can remember things that have intuition, have common sense, things that you can reason and plan to the same level as humans. He also added hinting at a place we'll go in just a minute. Despite what you might have heard from some of the most enthusiastic people, current AI systems are not capable of any of this. Which I guess brings us to the next question, which is how do world models differ from pre-training

Starting point is 00:06:48 or test time compute-style approaches to scaling LLMs and achieving the next levels of advanced artificial intelligence, AGI, whatever you want to call it. And the short answer is that these are fundamentally different approaches. In other words, different paradigms for advancing LLMs. Pre-training and test-time compute focus on optimizing compute and data within the current LLM architecture, which is primarily auto-regressive next token prediction, whereas world models emphasize a fundamental architectural shift to enable deeper world understanding. So, pre-training scaling is an approach that relies on the scaling laws hypothesis, where LLM performance improves predictably by increasing model parameters, training data volume, and computational resources

Starting point is 00:07:28 during the pre-training phase. Models like GPT4 and GROC are trained on vast data sets to learn patterns, enabling emergent capabilities like zero-shot reasoning or few-shot learning. Now, the strengths of this approach for AGI is that it builds broad knowledge and generalization from data, allowing models to handle language, math, and creative tasks. Scaling so far has driven significant and frankly rapid progress, where what you would expect to happen happens, which is larger models outperforming smaller ones on benchmarks. The problem is that there are indications that we're seeing diminishing returns. Basically, we're hitting performance plateaus despite significant compute increases due to data quality issues, biases, and other sets of factors that we're still trying to figure out.

Starting point is 00:08:10 Critics like Jan Lacoon basically argue that this path is fundamentally flawed for AGI and can't ever produce human-like intelligence. He put it really bluntly. If you're interested in human-level AI, don't work on LLMs. Now, of course, it was last fall that we started to talk about hitting these performance plateaus, and a new approach that was associated with the reasoning models started to become the thing that everyone was talking about. That was test time compute or inference time scaling, which shifted the focus from pre-training to allocating more compute during model use or inference. Techniques include chain of thought prompting where models generate intermediate steps, tree of thoughts for exploring multiple paths, self-consistency via sampling

Starting point is 00:08:51 or voting, adaptive looping, basically all strategies that allow models to quote-unquote think harder on problems, which can in some cases allow them to outperform much larger pre-trained models. So the strength of test-time compute when it comes to reaching AGI is that it enhances reasoning, error correction, and adaptability without retraining, which makes it better for complex tasks like math or coding. However, there are still some challenges. Given that we have these models thinking for minutes or in some cases even longer, it's computationally expensive. It still relies on the underlying pre-training model, and it doesn't address core LLM flaws like lack of causality or world grounding. That means even this approach can struggle with ambiguity, long context reasoning,

Starting point is 00:09:32 and scalability for real-world AGI applications. World models, on the other hand, as we've discussed, create these internal, simulatable representations of the environment that can predict future states based on observations, actions, physics, and causality. Basically, we're drawing on the inspiration of human cognition where we mentally simulate outcomes. These approaches use architectures like joint embedding predictive architecture or JAPA, where models learn latent representations by predicting future inputs in a non-generative way. So the strengths here are that world models enable common sense, and can in some cases handle uncertainty in long-term planning, which is why people like Jan Lacoon view them as the missing link for human-level AI. However, they

Starting point is 00:10:12 have limitations as well. There are high computational demands for training on multimodal data like videos. There are the risks of hallucinations and simulations, and there are challenges in scaling these models to real-world dynamics. They're also, frankly, just less mature than LLM scaling, requiring new datasets and new architectures. So this is the background on world models and how they differ from our current mainstream approaches to LLMs. Today's episode is brought to you by KPMG. In today's fiercely competitive market, unlocking AI's potential could help give you a competitive edge, foster growth, and drive new value. But here's the key. You don't need an AI strategy. You need to embed AI into your overall business

Starting point is 00:10:52 strategy to truly power it up. KPMG can show you how to integrate AI and AI agents into your business strategy in a way that truly works and is built on trusted AI principles and platforms. Check out real stories from KPMG to hear how AI is driving success with its clients at www.kpmg.comg. again, that's www.kpmg. Dot U.S.S. slash AI. This episode is brought to you by Blitzy. Now, I talk to a lot of technical and business leaders who are eager to implement cutting-edge AI,

Starting point is 00:11:24 but instead of building competitive moats, their best engineers are stuck modernizing ancient code bases or updating frameworks just to keep the lights on. These projects, like migrating Java 17 to Java 21, often means staffing a team for a year or more. And sure, co-pilots help, but we all know they hit context limits fast, especially on large legacy systems.

Starting point is 00:11:43 Blitzy flips the script. Instead of engineers doing 80% of the work, Blitzy's autonomous platform handles the heavy lifting, processing millions of lines of code and making 80% of the required changes automatically. One major financial firm used Blitzy to modernize a 20 million line Java code base in just three and a half months, cutting 30,000 engineering hours

Starting point is 00:12:01 and accelerating their entire roadmap. Email Jack at Blitzie.com with Modernize in the subject line for prioritized onboarding. Visit blitzy.com today before your competitors do. As a founder, you're moving fast towards product market fit, your next round, or your first big enterprise deal. But with AI accelerating how quickly startups build and ship, security expectations are higher earlier than ever. Getting security and compliance right can unlock growth or stall

Starting point is 00:12:28 it if you wait too long. With deep integrations and automated workflows built for fast-moving teams, Vanta gets you audit-ready fast and keeps you secure with continuous monitoring as your models, infra and customers evolve. Fast-growing customers like Langchain, writer and cursor trusted Vanta to build a scalable foundation from the start. And look, as someone who lives in the world of enterprise procurement, I love how Vanta makes it easy to get compliance right. The last thing you need when you're trying to win that big deal

Starting point is 00:12:54 is to have it scuttled by something that Vanta has solved for over 10,000 companies. Go to vanta.com slash NLW to save $1,000 today through the Vanta for Startups program and join over 10,000 ambitious companies already scaling with Vanta. That's V-A-N-T-A-com slash NLW to save $1,000 for a limited time. There are lots of interesting experiments in world models right now. Fafi Lee's World Labs, for example, shared in December an AI system that could generate 3D worlds from a single image. If you go to worldlabs.ai slash blog, you can actually click around an experiment with this. We've also seen examples of more accurate physics simulations, like this droplet of condensation running down this beer bottle.

Starting point is 00:13:35 and some have even suggested that the major labs are backing into world models by developing highly capable video models like V-O-3. And this, I think, is particularly pertinent to the conversation about this paper. In June, Ethan Mollick wrote, AI video tools really do seem to be able to simulate physics well but not perfectly without having an underlying physics engine. A world model? So that brings us back to this Harvard paper. And I think the best way to understand it is actually just to dig into the long thread on

Starting point is 00:14:04 Twitter from one of the researchers Kavana Vafa. Basically, what Kavon and his fellow researchers were interested in is whether you can get a generalized world model from a more limited training set. In other words, can a well-trained model that can make accurate predictions extrapolate that knowledge into a general understanding of the world? Kavon writes, one result tells the story. A transformer trained on 10 million solar systems nails planetary orbits, but it botches gravitational laws.

Starting point is 00:14:30 Basically, Vafa and co-trained a small AI model using data from the orbits of 10 million different solar systems, which led to exactly what you'd expect in terms of its ability to make predictions about planetary orbits. However, it had no ability to generalize those predictions into a general theory of gravity or any other known physics models. Kavana writes, our paper aims to answer two questions. One, what's the difference between prediction and world models? And two, are there straightforward metrics that can test this distinction? Now, interestingly, and you will know if you are a regular listener that this is where I got interested, he continues, our paper is about AI, but it's helpful to go back 400 years to answer these questions.

Starting point is 00:15:07 Consider the interest of my inner history major, Pert. Gavon continues, perhaps the most influential world model had its start as a predictive model. Before we had Newton's laws of gravity, we had Kepler's predictions of planetary orbits. Kepler's predictions led to Newton's laws, so what did Newton add? If you only care about orbits, Newton didn't add much. His laws give the same predictions. But Newton's laws went beyond orbits. The same laws explained penitence. Angela, cannonballs, and rockets. This motivates our framework. Predictions apply to one task, world models generalize to many. Which, by the way, I think is a really nice, crisp summary

Starting point is 00:15:42 of how to think about the distinction. What Vafa and his researchers found was that the model couldn't transfer its knowledge about orbits to other related physics problems. It failed to produce Newton's general rule of gravity and instead seemed to believe that gravity worked differently across different galaxies. Vafa also tested leading commercial reasoning models, which have Newton's laws in their training sets. When they were given series of orbital data without being told to apply Newton's laws, they failed to develop a generalized theory and make a successful prediction. Vafa was trying to uncover the inductive bias of these models.

Starting point is 00:16:14 That is, test the default set of assumptions used to make predictions. He asked, If a foundation model's inductive bias isn't towards a given world model, what is it towards? One hypothesis, models confuse sequences that belong to different states but have the same legal next tokens. This theory was tested using a board for the game of Thello. The models that Vafa trained were unable to reconstruct a board based on a description of moves, but they often produced a single legal next move even if the reconstruction was incorrect.

Starting point is 00:16:41 To link it back to the orbital prediction problem, Vafa suggests that LLMs get confused when two states share a common next step, that is, they conflate the two different states, making their predictions inaccurate. Vafa concluded, one, we propose inductive bias probes, a model's inductive bias reveals its world model. Two, foundation models can have great predictions with poor world models. Three, one reason world models are poor, models grouped together distinct states that have similar allowed next tokens. In essence, Vafa is claiming that transformer-based LLMs don't have or aren't able to develop a strong world model that can be transferred to make predictions about related

Starting point is 00:17:16 tasks. And this seems to cut at the quick of an LLM's ability to transfer from next token prediction to more generalized intelligence. Except for the fact that maybe the result is much less generalized than it appears. Nathan LeBenz from the Cognitive Revolution podcast wrote a long LinkedIn post arguing exactly this. With a wink and a nod to the Princess Bride, he wrote, you keep sharing this paper, but I do not think it means what you think it means. First he explained what they had done, and then wrote,

Starting point is 00:17:44 The trouble is, what do you do when your first experiments fail? At a company pushing the AI capabilities frontier, you would try, try again. In this case, the authors declare victory and invoke Isaac Newton to promote their no-world model's world model. The critical mistake is simple. You can't generalize from a few failed experiments to the conclusion that something is impossible. Basically, he's saying that if this were a lab context, rather than declaring a generalized critique on the basis of the LLM failing to generalize a physics model from orbital training

Starting point is 00:18:13 data, the labs would just try again in some different way to see if there was a way to go from specific dataset and prediction to a more generalized world model. He also points out that the models and data sets that are used here are small. He writes, For orbital mechanics, they used a 109 million parameter transformer and 2 billion tokens, roughly 110,000th the size of current frontier models and data sets. For Othello, the dataset is only 7.7 million tokens. For comparison, the original 2022 work showing that models trained on Othello-Move sequences

Starting point is 00:18:42 do learn board state world models, used a synthetic dataset with 20 million games. 50 times more data. In other words, he says, these are not really foundation models at all. LeBenz then listed a handful of other papers that showed generalized emergent world models from LLM pre-training, but they all required either larger models or more training data. To still man the paper, he said, that a model can predict the next token in a sequence does not imply that has a robust world model. That much is true. Just don't make the mistake of believing that they can't develop world models. They clearly can and do. Ultimately, as you've seen, the verdict is out right now on what the right approach to getting

Starting point is 00:19:22 to the next AI unlocks really is. The field hasn't settled on a single answer for what new architecture should look like. What is clear is that there are really interesting things happening in the world model approach. Faye-Fei's World Labs has made some big strides since those early demos last year. Martin Casado of Andresen Horowitz recently showed off what it's now capable of when attached to a traditional 3D rendering engine. And even if world models aren't the right pathway to AGI, whatever that means, solving the issue of transferable knowledge would still be a massive unlock.

Starting point is 00:19:52 As a simple example, it would allow media generation to be much more consistent because models could transfer their understanding from one context to another. A16 Z's Justine Moore is eager for things that would become unlocked, posting, So This is the Dream, a video world model that takes an image as input and renders an environment you can explore and interact with. It could be a constant video stream, like your own lofi girl, or you could jump in and play as a character. Now, she believes that this is already possible with modern video models, but requires a lot of consistency hacks. In other words, this kind of product would be far more viable if models had a transferable understanding of the world. Now, as to this question that was brought up by Ethan Malik,

Starting point is 00:20:29 of whether generative video models are a backdoor to broader world models, a Google research paper from earlier this year sort of argues that the answer is no. A research team found that video models don't really learn about physical reality, they just learn about visual realism. that lets them create believable videos, but it does little to help them make realistic predictions across other domains. Still, it's a super exciting field, one that feels almost inevitably to me, to be likely to contribute significantly to the advancement of AI in some way, and so, of course, we will continue to cover it here.

Starting point is 00:21:01 Hopefully now you not only understand this paper in the discussion around it a little bit better, but have a better framework for understanding world models and how they relate to other approaches to LLM scaling. For now, that's going to do it for today's AI Daily Brief. Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - Are World Models the Key to AGI?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.