a16z Podcast - The Quest for AGI: Q*, Self-Play, and Synthetic Data

Starting point is 00:00:00 Highly ambiguous problems with unclear reward functions that do have correct answers. That's sort of the elusive goal right now. The big idea for why people are so interested in understanding what QSTAR is, is that if you can produce an AI system that is four to six orders of magnitude better than GPT4, right? So 10,000 times better or 100,000 times better, then you start approaching this North Star of an AGI. Synthetic data and self-play come into relevance here because when you're having an AI, score each individual step of your reasoning, then you're generating a bunch of really valuable data

Starting point is 00:00:34 that then you can train the system on. We don't actually know how much data is required to get and surpass human level intelligence. ChatGBT was launched on November 30th, 2022. Little did we know just how much would change in the whirlwind year that followed. And quite frankly, the speed of change during the last few weeks has been no different. One topic at the center of the AI universe this week was a potential breakthrough called QSTAR. Now, little has been revealed about this Open AI project other than its likely relationship to solving certain grade school math problems. Amid much speculation, we decided to bring in our new general partner at A16C, who's focused on all things AI to sift through this sea of noise. So today, together with

Starting point is 00:01:24 Anjne Mitter, we discuss the key frontier research areas that AI labs are exploring on their path toward a potential generalizable intelligence, from self-play to model-free reinforcement learning to synthetic data. Anjani also shares his insights on which approach he expects will be the most influential in the next wave of LLMs, and why math problems are even a suitable testing ground for this kind of research. And if you like this timely coverage on this very fast-moving topic, let us know, and we'll produce more content just like this. All right, let's dive in. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast.

Starting point is 00:02:21 For more details, including a link to our investments, please see A16C.com slash disclosures. All right. Ange, thank you so much for jumping on. There is so much happening within the space of AI and LLMs and you are seeing so much of this action. And since there's a lot of speculation, I didn't necessarily want to add to the speculation, but I wanted to get your take on maybe the next wave of LLMs or what we might see coming. And all of this has culminated in people speculating about something called Q Star. And so maybe we we could just start there and specifically touch on what it might be and more so why the few things that have been said about it might be important, right? People are saying that it can solve certain mathematical problems at a level of grade school students. So why would that even be important when it comes to these LLMs and the unlocks on the way? Right. Yeah. So I think it might make sense to set a little bit of context here in that I think the big idea for why people are so interested in understanding what QSTAR is, is that if you can produce an AI system that is four to six orders of magnitude better than GPT4, right? So 10,000 times better or 100,000 times better, then you start approaching this north star of an AGI that everybody is super interested in because that's the sort of north star for labs like open AI and so on. And if you ask, well, what's the big missing piece? Why are we not there yet? It's,

Starting point is 00:03:59 that these models, the frontier models, don't yet really exhibit complex, multi-step reasoning of the kind that humans are capable of, right? Let me take for granted. And so I think that's the big piece of speculation going on is, are there breakthroughs right now that unlock that kind of complex multi-step reasoning that we're missing? That's the big gap between where we are today with the current generation of models and a future generation. And so that's sort of the context, right, for what all of this is going on, is can we get to complex multi-step? step reasoning, planning, and the most common attack vector to make progress against complex multi-step reasoning is often to find the right prototype of a problem, sort of a tiny

Starting point is 00:04:40 version of a problem that can serve as a model for a much bigger, large-scale version of that effort. And so I think in the space of complex multi-step problems, there have been sort of two buckets of prototypes that have proved to be pretty useful, petri dishes for researchers to work on. One is formal games of logic and reasoning, like Go and poker and diplomacy. And a second bucket is sort of well-reasoned, well-defined, and well-constrained problems like grade school math. The big question is, are there solutions that work so well at those prototypes that we can then scale them up to solve sort of human-level intelligence, right? General-purpose intelligence.

Starting point is 00:05:24 And so I think the solutions that have worked well so far, at least for games, are systems that look like self-play, where not only do you train the model on the way humans have played the game successfully before, but then you can go even one step further and say, actually, now you as an AI, we're going to have you play against yourself and then reward your outcomes. whenever you teach yourself to get better and better than any past level you've seen from humans. And so that set of techniques is broadly called self-play. And there are a number of breakthroughs that have happened over the years at specific games. So AlphaGo famously was a system by DeepMind that solved this by having the system play against itself recursively and improve. It's a pretty popular branch of reinforcement learning. Yep. And then a number of researchers tried to get these systems to generalize beyond individual games.

Starting point is 00:06:23 That is the big open question, right? I think there have been some recent developments that show that it's possible, and there's some promising signals there that you can get these self-play systems to generalize beyond at least one game. You know, I used to play chess competitively, and that's a game where there is a correct answer, right? The ratings are definitive. A 1900 player beats a thousand player every single time, right? versus something like diplomacy is more complex and something like creativity is perhaps even further along that spectrum where there probably is not a correct answer. And many LLMs are being applied

Starting point is 00:06:56 to creative spaces or spaces where, again, it's a little more gray. And so maybe we could break down some of the things that you mentioned. I heard self-play. I heard reinforcement learning. I also know there are some other aspects that are related to QSTAR or the speculation there, things like synthetic data, planning. So maybe we can talk about what each of those pieces means and also why that facet might be important for the next stage of LLM. So why don't we start with this idea of reinforcement learning and this idea specifically of model-free reinforcement learning, which a lot of people are kind of appending to this idea of Q-Star. What does that mean? Model-free reinforcement learning, and why would that be important if we introduce that to the existing kind of

Starting point is 00:07:44 LLMs? Yeah, this is a pretty important, I think, foundation to cover, which is that reinforcement learning is this very powerful set of machine learning techniques that offer a mechanism by which these algorithms can get dramatically better at a task or a set of tasks really efficiently. And sort of classical reinforcement learning basically says you take an agent that needs to perform some set of tasks. You then create a model of the world that the agent needs to interact with, and then you assign very explicit reward functions to certain outcomes. And then you tell the agent, please go maximize your reward function. So the example here would be, if you're trying to train

Starting point is 00:08:26 a chef agent to be really good at baking cakes, you'd say, your job is to bake cakes. Here's how cakes are baked. You get the flour together and so on, and then you put it in the oven, and then you take it out and you put icing on it. So that is the model that you construct a certain set of steps to bake a cake. And then you serve the cake to your customers. And if your customers give you a score of 10 on 10, then that's your reward for taste. And if they give you two on 10, you haven't maximized your reward. And now go figure out how to bake the best cakes that maximize the amount of reward you get. And this works really well in systems where you can sort of enumerate all the steps in the system. Yeah. Right. Because then you can assign explicit rewards to the system. But

Starting point is 00:09:05 the problem with that approach is you actually need to define the system in great detail. If you wanted to just generalize to any system, any model where you didn't have to write the 50 steps required to bake a cake. And you said, actually, I just want an agent that could bake a cake and drive a car and that could walk around the streets of San Francisco delivering your dinner. More cakes. Then you kind of need a system of reinforcement learning that just works for any arbitrary environment. And that's what model-free reinforcement learning is so great at. And Q learning has been historically a type of reinforcement learning that can be more robust, flexible, and doesn't require an explicit model of the environment compared to traditional reinforcement learning.

Starting point is 00:09:49 Right. And I think why it's so relevant to this conversation, why people are, other than the name Q-star leaking, of course, why people are so excited about this branch of machine learning is because if you can unlock a way for these models to get better at any set of tasks, then you're dramatically closer to a generalizable intelligence as opposed to a specific intelligence. And remember, artificial general intelligence is sort of the goal here for all these labs, right? So that's what reinforcement learning and model-free reinforcement learning are relevant for. But I think when you combine them with, you know, we were talking earlier about how unlocking complex and multi-step reasoning is the Nord Star.

Starting point is 00:10:33 The way model-free reinforcement learning and Q learning become relevant is that you can use it to generalize beyond grade school math problems that are described as word problems and train a system to get better and better at solving those. And I was talking earlier about how there are two buckets of prototypes that are proved pretty useful. One is these games of logic like Go and Booker. And the second is grade school math. The reason grade school math is so interesting is because it's pretty well scoped.

Starting point is 00:11:00 And you can ask the model to break down a word problem, you know, one of these classic, Jimmy has 10 apples and John has 16 apples, if they've combined their apples, how many do they have in total? And you can ask a model to break down its reasoning step by step. And when it does that, it doesn't just give you the end answer. It gives you all the series of steps in the middle that it took to get to the final answer. and what you can do with reinforcement learning applied to sort of chain of thought prompting is you can say, hey, instead of traditional reinforcement learning from human feedback where you just asked a human grader to grade the outcome of the model, the final step, or you asked another AI system to grade the final step and say, hey, is this answer right or wrong?

Starting point is 00:11:46 You can actually have an AI now score every intermediary step and score whether each step of the way was correct or not. And then what you get is a much more set of granular rewards. And so in the case of the cake baking, instead of just scoring whether the end cake was good or not, you can start scoring the individual steps. Did they bake it at the right temperature? Did they put the right amount of dough in? And the big question is, does that allow the model to start learning about reasoning, not just producing some end outcome, right? And if you can get it to reason, then it can start solving, planning, and doing higher order sort of thinking of a kind that humans have. So that's where these sort of techniques all play together is reinforcement learning traditionally

Starting point is 00:12:31 has been a way to get these models to improve. You know, Q learning or model-free learning is a way to generalize reinforcement learning beyond just well-enumerated systems like games. And then lastly, synthetic data and self-play come into relevance here because when you're having an AI score each individual step of your reasoning, then you're generating a bunch of really valuable data that then you can train the system on. That's synthetic. It's not generated by a human. It's generated by an existing system. So these are sort of the big components that are working together to potentially ultimately produce a model that's 10,000 times or 100,000 times better than GPD4. Right. And let's double click on that element of self-play and potentially it's synthetic data.

Starting point is 00:13:11 And to underscore that, right now, many of these models do have an element of reinforcement learning, but that comes from human feedback, right? So RLHF. is one of the methods that these different labs are using to orient the existing models, but obviously that requires humans in the loop. And so what you're getting at is potentially the ability to obfuscate that and have AIs in the loop. Can you speak a little bit more to that idea of self-play and the ability to create synthetic data? What does that unlock other than just scale and speed? Or is that enough for us to get to those next levels?

Starting point is 00:13:49 Yeah, scale and speed are pretty valuable. Yes, definitely. I think if you just go back to first principles on how these models are trained and get better, the rate at which they improve is remarkably predictable. They're predicted by a set of AI scaling laws that basically say, hey, you've got three ingredients to producing machine intelligence. You've got compute, you've got data, and you've got sort of algorithmic innovation. And I think what self-play and synthetic data do is they allow the scaling of these models

Starting point is 00:14:17 much more rapidly because whereas with reinforcement from human feedback, you're kind of constrained by the number of humans you can have provide that feedback. Synthetic data and in particular, the types of AI feedback scoring that we're talking about here is sort of orders of magnitude more scale. It's not 2x or 3X. The sheer constraint is actually really how much compute do you have

Starting point is 00:14:42 to run these AI models to do all this scoring. To make this a little bit more concrete, it might be helpful to just talk a little bit about reinforcement learning from human feedback. Yeah, please. The 1.0 version of this world, a common example of reinforcement learning from human feedback is a system like chat GPT or Dolly or Mid Journey. When Mid Journey generates an image or when chat GPT generates a message for you, there's a little thumbs up, thumbs down button. And they track when users give you a thumbs up or thumbs down, and they use the number of times people give

Starting point is 00:15:13 thumbs up or thumbs down, and use that as a signal for reinforcement to improve the kinds of messages it gives you next time around. The reinforcement learning from AI feedback in this case replaces that thumbs up and thumbs down from users like you and me, and instead substitutes that with increasingly smarter AI models that these labs have built to then provide much more granular scoring than just the thumbs up, thumbs down. You can now start providing a score of 1 to 10 on the intermediary steps. And so you're right, scale is really the biggest unlock here. When you replace humans with models that provide this feedback 24-7

Starting point is 00:15:50 at orders of magnitude larger volume than we could do with just humans, then the speed at which we get to an AGI is dramatically faster. So any breakthrough in those systems represents, I think, a non-linear increase or acceleration towards this future. Yeah. And I mean, on that note, people use the term safety, some people use alignment. if we, instead of having humans, have AIs grading themselves or each other,

Starting point is 00:16:15 then I guess the natural question is, how do we ensure that what's happening at the scale that is much greater than what we can currently do with human feedback? What are the mechanisms really to instruct the AI to give feedback effectively? Does that make sense? It does. And I think this is a raging debate, right, in the industry. The short answer is there are a number of,

Starting point is 00:16:38 of promising techniques. It's not clear anyone is sort of a silver bullet, but you could break down sort of all of AI research into two big categories, capabilities research, and then there's alignment research, right? The goal of capabilities research is to get the model to be as smart as possible and to either match human intelligence or ultimately exceed it at all kinds of tasks in a general purpose way, planning, reasoning, and so on. The alignment research is sort of the flip side of that and says, well, when you've got a really smart model, how do we make sure that it's aligned, or we're able to control the outcomes here, and it doesn't do things that we didn't want to do. And I think the kinds of techniques we've been talking about today, you know, reinforcement learning,

Starting point is 00:17:17 the Q learning we've talked about, and the chain of thought-based scoring, they all largely are forms of unsupervised deep learning that continue treating these models as black boxes. And my belief is that long as we keep modeling these systems as black boxes, their reliability is somewhat limited. Because if you don't know how they work, and they're not open source, it's kind of hard to steer them. And that's why on the alignment side, my belief continues to be when you've got a black box, step one is x-ray the box, open source the box, and figure out what's going on inside, trace the problematic parts of the model. And then you can accelerate about to steer it, edit it out, trim it, control it. And that's where all the interpretability research that's happening in the field is focused on.

Starting point is 00:18:01 And so I think that's related but separate body of work that is trying to keep pace with the capabilities research. One of the sort of meta observations here is that when you have a model that gets capable enough, at some point, if it's capable of complex reasoning, you can just ask it to help you solve the alignment problem. Yeah. Right. And that is, in fact, a whole set of techniques and experiments that other folks are working on. And it's still open-ended research.

Starting point is 00:18:26 But there's reasons to believe that that may end up being the most efficient path. That'll probably scale in the most general purpose way. Could you speak a little bit more to the impact of synthetic data? because we've discussed scaling laws, and as these models get bigger and bigger and bigger, there's constantly this question of like, what will the next waves look like? And do we even have enough data to train the next echelon of LLMs? And one potential answer to that is the use of synthetic data. And so can you speak to what value that has and also what we're seeing in that world?

Starting point is 00:19:00 Yeah. So this is a big open-ended question. I think because of the way scaling laws work, right, where you have compute data and sort of algorithmic innovations as the three core ingredients, at least one out of those three are just a direct function of investment, right? Whether you can get compute or not is a function of how quickly we can as a species produce enough silicon to power these models. I think the data is an interesting question because we don't actually know how much data is required to get and surpass human level intelligence. And I'm not sure that it's clear to us as an industry

Starting point is 00:19:35 that the current techniques that the frontier models rely on, which are largely transformer-based and sort of next token prediction-based, will continue being able to scale with the current data bottlenecks that we have. And so that's why there's a tremendous sort of amount of investment going into figuring out whether these systems can extract reasoning and learn how to plan and think like humans from very, very small data sets. Because if we can teach these systems to actually reason, instead of just parodying.

Starting point is 00:20:07 The most critical view of this would be that these systems don't reason at all. They don't generalize at all beyond the training data they've seen. And therefore, for them to be able to surpass human intelligence, you're going to have to keep feeding them bigger and bigger and bigger data sets. The opposing point of view is that, no, actually, all the data we have at the moment is sufficient. And all we need to figure out is an algorithmic basis by which the model can learn how to reason from existing data. Now, I think where synthetic data comes in handy is where if we need bigger and bigger and bigger data sets, or if we will need to find a way to generate more data than exist today.

Starting point is 00:20:43 And there's, you know, lots of arguments about why that won't work, probably the leading one being what's called mode collapse, where if you're just going to generate more data of the kinds we already have, you're not teaching these systems anything new. You're just mirroring all the info we already have. There's a fundamental information constraint. That may or may not be true. again, line of research that's sort of unproven, but in a world where actually these systems can reason and we can teach them to think and do sort of complex multi-step reasoning and

Starting point is 00:21:11 planning, then actually the value of synthetic data is tremendously clear. It's just to do what we were describing earlier where what humans would be offering reward-based feedback naturally over time T. You can just collapse that time T to a tenth of it. And you can get there much, much, much, much faster. So it's a method of acceleration. to get to the intelligence, that is very compelling, right? And what would have otherwise taken a team of like 10,000 humans could be done in, you know, 10 days with synthetic data pipelines. And so in that world, it becomes very clear what the value of synthetic data is. In a world where these models actually can't reason and aren't understanding the world and reasoning about them, it is unlikely that synthetic data will help with the scaling laws. Yep. And it sounds like based on everything you've shared, that is the key unlock determining whether they can reason. And we've mentioned a few different things like multi-step planning, reinforcement learning, especially model-free reinforcement learning. We talked about self-play. I know those all somewhat fit together, but is there one of those that you're particularly excited about or you think is especially important in this next wave? Yes, I'm most interested in generalizable self-play, because I think if we solve that one, there's a very clear path to getting these models to reason more and more like humans do and plan their approach to problem-solving the way humans do than they currently are capable of. And I think a lot of the biggest value and impact that we could get from large scaling loss-based models is if we can start to rely on them for some of the most challenging problems that we as humans haven't been able to solve yet.

Starting point is 00:22:48 And so I think the solving self-play gets us, you know, to the happy path of solving cancer and discovering novel new cures for diseases that we just have not been able to figure out ourselves. And to do that, I do think we need to unlock really robust, generalizable self-play. And that's why these games that at first blush seem almost like toy-like and orthogonal and interesting research experiments, right? Like, what's the relationship between poker and cancer? Well, it turns out if we can find the right prototype for a problem as complex as cancer at a really, really small scale, then self-play might be the path to getting us to those discoveries that we haven't been able to unlock ourselves. Totally. And you use the game's analogy, but I mean, for chess, I think a bunch of people were surprised, not just when it beat.

Starting point is 00:23:34 But when all of a sudden we were seeing moves that we didn't understand. And I think that's a relic of that pleasant surprise. We have not thought of this before. Even though it could have existed, we are learning from this system. And so, yeah, I do think it's maybe at least an example of where things can be headed. We've got to hope. Yeah, definitely. I mean, just to close out, you touched on a few things like the potential of curing cancer, but any other second, third order effects or implications that come to mind of such an advancement of really having these models be able to reason. Like, that's pretty fundamental. Yeah, I think, you know, the big one, of course, is that these models can start to solve problems where precision is really important and correctness is important. Today, I think

Starting point is 00:24:18 we're seeing just tons and tons of value being unlocked in use cases like creativity, right, where hallucination is really a feature, not a bug, where when the models can get creative and tell you things that you didn't expect, that's actually a great outcome for use cases like storytelling or helping you sharpen or rewrite a piece of prose that you're working on. And so that's why we're finding such explosive demand from people using these tools as creative tools. In cases where more formal verification of the answer

Starting point is 00:24:50 or the output of the model is really important, and that's why solving math problems is such a great litmus test for intelligences, because in a formal domain like math, there are often 10 different ways to get to the right answer, but it really usually is only one verifiable answer. And the frontier models today are remarkably poor and brittle at correctness, right? The most obvious and highest impact value of us cracking sort of general purpose reasoning via self-play or via any of these techniques we've talked about today would be they can start actually solving

Starting point is 00:25:22 a whole class of precision problems that they're not capable of today. Every field has a set of these high precision problems in physics, in engineering, in health care, where a wrong answer is in fact a bug, not a feature. And that's where formal reasoning and multi-step planning and so on are just basics that these models aren't capable of today. That's what I'm most sort of excited for is when we can rely on the models actually to start doing science themselves and start discovering new capabilities because they're able to approach highly ambiguous. problems with unclear reward functions that do have correct answers. That's sort of the elusive goal right now. I love that framing. Highly ambiguous problems with unclear reward functions.

Starting point is 00:26:09 All right, that's all for now. If you did make it to the end of this episode and enjoyed this kind of coverage, be sure to let us know by leaving a review at rate this podcast.com slash a16Z. Or you can also email us at potpitches at a16Z.com with timely topics that you like to see us covered just like this. All right, we'll see you next time.

a16z Podcast - The Quest for AGI: Q*, Self-Play, and Synthetic Data

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.