The a16z Show - Columbia CS Professor: Why LLMs Can’t Discover New Science

Starting point is 00:00:00 Any LLM that was trained on pre-1915 physics would never have come up with a theory of relativity. Einstein had to sort of reject the Newtonian physics and come up with this space-time continuum. He completely rewrote the rules. AGI will be when we're able to create new science, new results, new math. When an AGI comes up with a theory of relativity, it has to go beyond what it has been trained on. To come up with new paradigms, new science. That's by definition of AGO. Yeah.

Starting point is 00:00:31 Vichal Mistra was trying to fix a broken cricket stats page and accidentally helped spark one of AI's biggest breakthroughs. On this episode of the A16Z podcast, I talk with Vashal and A16Z's Martin Thessado about how that moment led to retrieval augmentation generation and how Vichal's formal models explain what large language models can and can't do. We discussed why LLMs might be hitting their limits, what real reasoning looks like,

Starting point is 00:00:54 and what it would take to go beyond them. Let's get into it. Martin, I knew you wanted to have. have Vishal on. What do you find so remarkable about him and his contributions that inspired this? Vishal actually have very similar backgrounds. We both come from networking. He's a much more accomplished networking guy than I am. That's a high bar given you a field. And so we actually view the world in an information theoretic way. It is actually part of networking. And with all this AI stuff, there's so much work trying to create models that can help us understand how these LLMs work.

Starting point is 00:01:28 And in my experience over the last three years, the ones that have most impact, my understanding, and I think I've been the most predictive, are the ones that Vishal has come up with. He did a previous one that we're going to talk about, called Matrix, is it? Beyond the black box, but yeah. Beyond the black box. Actually, we should put this in the notes for this,

Starting point is 00:01:46 but the single best talk I've ever seen on trying to understand how LLM's work is one that Fischal did at MIT, which Haribala Krishna pointed me to, and I watched that. So he did that work, and then he's doing more recent work that's actually trying to scope out not only how LLM's reason, but like it has some reflections on humans' reason too.

Starting point is 00:02:05 And so I just think he's doing some of the more profound work in trying to understand and come up with models, formal models for how LLM's reason. On that note, you said his most recent work helped you change how humans think. On you, you flesh that out a little bit. How did it sort of? Well, okay, so can I just try to take a rough sketch at it

Starting point is 00:02:22 and then you just tell me how wrong I am? Right ahead. You're trying to describe how LLM's work. And one thing that you've found is that they reduce a very, very complex multidimensional space into basically a geometric manifold that's a reduced state space.

Starting point is 00:02:43 So it's reduced degrees of freedom, but you can actually predict where in the manifolds the reasoning can move to, roughly. So you've reduced the dimensionality of the problem to a geometric manifold, and then you can actually formally specify kind of how far you can reason within that manifold.

Starting point is 00:03:01 And the articulation is that we, or one of the intuitions is that we as humans do the same thing, is we take this very complex, heavy-tailed stochastic universe and we reduce it to kind of this geometric manifold, and then when we reason, we just move along that manifold. Yeah, I think you captured it accurately. That's kind of the spirit of the work. Wait, wait, can I just hear it in your words?

Starting point is 00:03:24 Because I'm a VC, so. You know, VC with an H index of what, 60? True. Yeah, so ultimately what all these LLMs are doing, whether the early LLMs or the LLMs that we have today with all sorts of post-training, RLHF, whatever you do, at the end of the day, what they do is they create a distribution for the next token.

Starting point is 00:03:51 So given a prompt, these LLMs create a distribution for the next token, or the next word, and then they pick something from that distribution using some kind of algorithm to predict the next token, pick it, and then keep going. Now, what happens because of the way we train these LLMs,

Starting point is 00:04:12 the architecture of the transformers, and the loss function, the way you put it is right, it sort of reduces the world into these Bayesian manifolds. Yeah. And as long as the LLM is going in, sort of traversing through these manifolds,

Starting point is 00:04:29 it is confident and it can produce something which makes sense. The moment it sort of wears away from the manifold, then it starts hallucinating and thought spotting nonsense. Confident nonsense, but nonsense.

Starting point is 00:04:43 So it creates these manifolds and the trick is the distribution that is produced. You can measure the entropy of the distribution. Right? Entropy the way Shannon is.

Starting point is 00:04:55 Share it. It's an entropy. It's an entropy, not thermodynamic entropy. So suppose you have a vocabulary of, let's say, 50,000 different tokens, and you have a distribution next token distribution over these 50,000 tokens. So let's say the cat sat on the, right? If that is a prompt, then the distribution will have a high probability for map or hat or table and a very low probability of, let's say, ship or whale.

Starting point is 00:05:26 or something like that, right? So because of the way it's trained, it has these distributions. Now, their distributions can be low entropy or high entropy. A high entropy distribution means that there are many different ways that the LLM can go with high enough probability for all those paths. Low entropy means that there are only a small set of choices for the next token. And the prompts also, you think,

Starting point is 00:05:56 can categorize into two kinds of prompts. One prompt is, as you can say, high information entropy. Yeah. And one prompt is low information entropy. So the way these manifolds work, the LLM start paying attention to prompts that have high information entropy and low prediction entropy. So what do I mean by that? So when I say I'm going out for dinner. Yeah. Right. So when I say I'm going out for dinner, that phrase, the LLMs have been trained. They've seen it a lot.

Starting point is 00:06:36 And there are many different directions I can go with it. I can say, I'm going for dinner tonight. I'm going for dinner to McDonald's or I'm going to dinner, blah, blah, blah. There are many different. But when I say I'm going to dinner with Martin Casaro, you know, the LLM, now this is, information rich. This is sort of a rare phrase. And now the sort of realm of possibilities

Starting point is 00:07:00 reduce it because Martin is only going to take me to Michelin Star restaurants. I'm not going to go to McDonald's. You get what I'm saying. The moment you add more context, you make the prompt information rich, the prediction entropy reduces. Yep, yep, yep.

Starting point is 00:07:18 And another example that I often cite. But just quickly, But what is your takeaway? What is your implication on that? Which is, of course, as you're interested, so, yeah, so you're, so, sorry, I forgot how you described it, but so the more precise you are, the more tokens you are, I presume, the less options you have for the next token. Is that correct or not correct? Yeah, yeah, essentially.

Starting point is 00:07:45 So you're reducing it. You're reducing it to a very specific state space when it comes to. confidence in an answer. And this is kind of a manifold that you can go on. And then, I mean, do you have kind of a conclusion of what that means for systems or what that means for reasoning? Or is it just a nice way to articulate the bounds of LLMs? No, there is something, I don't know if I should say profound, but there is something

Starting point is 00:08:17 about it which tells what these LLMs can or cannot do. right so one of the examples that i often tell is suppose i ask you what is 769 times 1025 you have no idea you can have some vague idea given the two numbers right and so in your mind the next token distribution of the answer is going to be diffuse right you don't know you have maybe a vague guess if you are mathematically very good maybe your guess is more precise but it is going to be diffused and it's not going to be the correct answer but if i if you say can i write it down and do it the way we have learned multiplication tables now you know exactly what to do the next step right you write 7169 and then 1025 and then you know exactly so at each stage of

Starting point is 00:09:11 that process your prediction entropy is very low you know exactly what to do because you have been this algorithm. And by invoking this algorithm saying, okay, I'm not going to just guess the answer, but I'm going to do it step by step. Then your prediction and entropy reduces. And you can arrive at an answer which you're confident of and which is correct.

Starting point is 00:09:39 And the LLMs are pretty much the same way. That's why chain of thought works. What happens with chain of thought is you ask the LLM to do something, chain of thought. it starts breaking the problem into small steps. These steps, it has seen in the past. It has been trained on.

Starting point is 00:09:57 Maybe with some different numbers, but the concept it has been trained on. And once it breaks it down, then it's confident. Okay, now I need to do A, B, C, D, and then I arrive at this answer. Whatever better is. Let's zoom back out. I want to get into LLMs,

Starting point is 00:10:13 but first, Michelle, maybe you can give more context on your background and how that informs your work here. Okay. So yeah, as Martin said, my background is very similar to his. We, you know, we come from doing networking. So my PhD thesis, my sort of early work at Columbia has all been in networking. But there's another side of me, another hat that I wear, which is both an entrepreneur and a cricket fan. I was going to say, don't you own a cricket team or something? I'm a minority for your local cricket team, the San Francisco Unicons. That's right. I'm very proud to have you.

Starting point is 00:10:55 But, so in the 90s, I was one of the people who started this portal called Crick Info. And Crick Info, at one point it was the most popular website in the world. It had more hits than Yahoo. That was before India came on. And so, you know, we built cricket is a very start-rich sport. You'll think baseball multiplied by 1,000. And we had built this free searchable stats database on cricket called Stats Guru. And this has been available on cricket for since 2000.

Starting point is 00:11:36 But because you can search for anything, everything was made available on Stats Guru. and you know you can't expect people to write SQL queries to query everything so how do you how would we do it well it was a web form you know where you could formulate your query using that form and in the back end that that was translated into SQL query got the results and got it back but as a result that because you could do everything everything was made available the web form had like 25 different checkboxes 15 text fields 18 different drop downs the interface was a mess was very daunting. So, and ESPN acquired CRIK Info

Starting point is 00:12:20 in the mid-2006, I think, but they still kept the same interface. And that has always sort of nagged me. And so I still know the people... Wait, wait, what nagged you? Is that Cric Info did not have informal language and had a web form for doing queries? That web form was terrible.

Starting point is 00:12:43 Because of that, only the real nerds. Of all the things of the world to bother you, the fact that an old website was a web form. I appreciate your commitment to aesthetic. So I'm still friendly with the people who run ESPen Creek and further. The editor-in-chief, whenever he comes to New York, you know, we meet up, we go out for a drink. And so he was here in 2000.

Starting point is 00:13:08 So now the story shifts to how LLMs and me sort of met. So January 2000, right before the pandemic, he was here and I again said, why did you do something about Statskuru? And he looks at me and says, why did you do something about Statskuru? He was kind of joking, but he thought maybe, you know, I had some ways to fix the interface. So anyway, then the pandemic hit, the world stopped. But in July of 2020, the first version of GPD3 was released.

Starting point is 00:13:40 And I saw someone use GPD3 to write a SQL query for their own database using natural language. And I thought, can I use this to fix Stats Guru? So I got early access to

Starting point is 00:14:00 GP3, you know, getting access those days were difficult, but somehow I got it. But soon I realized that, you know, no, I cannot really do it. Because stats guru, the, the backend databases were so complex and if you remember GPD3 had only a 2048 token context window. There was no way in hell I could fit the complexities of that database in that context window.

Starting point is 00:14:24 And GPD3 also did not do instruction following at that time. But then in trying to solve this problem, I accidentally invented what's now called Rack. Where based on the natural language query, I created a database of natural language queries and the structured queries. I created a DSL, which then translated into a rest call to stats guru. So based on the new query, I would look through my set up natural language queries.

Starting point is 00:15:00 I had about 1,500 examples, and I would pick the six or seven most relevant ones. And then that and the structured query, I would send as a prefix and the new query that, GPD Trish magically completed it and the accuracy was very high so that had been running in production since September 2021 you know about 15 months before Chad GPD came

Starting point is 00:15:23 and you know the whole revolution in some sense started and RAC became very popular I didn't call it Rack but this is something sort of accidentally did in trying to solve that problem for quick info now once I

Starting point is 00:15:37 once I built it you know I was print that this work, but I had no idea why it worked. You know, I stayed at that, I stayed at that transformer architecture diagram. I read those papers, but I couldn't understand how or why it worked. So then I started in this journey of developing a mathematical model, trying to understand how it worked. So that's been sort of my journey through this world of AI and LLMs because I was

Starting point is 00:16:11 to solve this cricket problem. Yeah. Amazing. And so maybe reflecting back since the release of GP3, what has most surprised you about how LLMs have developed? So what is most surprised me, the pace of development. So GPD3 was, you know, it was a nice pilot trick and you had to jump through hoops to get it to do something useful. But starting with the chat GPD was an advance over GPD3.

Starting point is 00:16:39 and then you had all these things like chain of thought, instruction following. GPD4 really made it polished and, you know, the pace of development has really surprised me. Now, you know, when I started working with GPD3, I could sort of see what its limitations were, what I could make it do, what I couldn't make it do.

Starting point is 00:17:00 But I never thought of it as, you know, what these LLMs have become for me now and what have become from millions of people around the world. we treat these models as our co-workers, almost like an intern, that you're constantly chatting with them, brainstorming, making them do all sorts of work, which we couldn't imagine, you know,

Starting point is 00:17:24 just when Chad GPD was released, it was nice, it could write poems, it could write Limericks, it could answer some hallucinated questions, but the capabilities that have emerged now, that pace has been very sort of, surprising to me. Do you see progress plateauing?

Starting point is 00:17:42 Or how do you, either now or in the near future, how do you see it going? Yes. In some sense, progress is plateauing. It's like the iPhone. You know, with the iPhone came out, wow, what is this thing? And the early iterations, you know,

Starting point is 00:18:02 constantly we were amazed by new capabilities. But the last, you know, seven, eight, nine years, it's maybe the camera I got a little bit better or you know one thing changed here or memory is more but there has been

Starting point is 00:18:15 no fundamental advance in what it's capable of you can sort of see a similar thing happening with these edelps and this is not true for just one

Starting point is 00:18:27 one company and one model right you look at what open air is coming up with or what anthropic google or all these open source

Starting point is 00:18:38 Chinese model or mistrial, the capabilities of LLMs has not fundamentally changed. They've become better, right? They've improved. But they have not crossed into a different realm. So this is something that I really appreciate about your work. And so the thing that really struck me is as soon as these things showed up, you actually got busy trying to have a formal model of what they're capable of, which was in stark contrast to what everybody else was doing. Everybody else was like, AGI, these things are going to, you know, recursively self-improve, like, or, or they'll say, oh, these are just stochastic parents, which doesn't mean anything. So everybody had rhetoric,

Starting point is 00:19:25 and sometimes this rhetoric was fanciful. And sometimes this rhetoric was almost reductionist, like, oh, it's just a database, which is clearly not true. And the thing that really struck me about your work is you're like, no, let's figure out exactly what's going on. Let's come with a formal model, and once we have a formal model, we could reason about what that means. And then, you know, in my reading of your work, I kind of break it up in two pieces. There's the first one where you basically, you came up with this, you know, matrix abstraction. I think it's worth you talking through. And then you took in-context learning as an example, and you mapped it to Bayesian reasoning,

Starting point is 00:19:59 which to me was incredibly powerful because at the time, nobody knew why in-context learning worked. So I think it would be great for you to discuss that because again, I think it was the first real kind of formal effect on like how are these things working. And then the more recent work that you're working on now is a kind of more generalized version of what is the state space that these models output when it comes to confidence, which is the manifold that we're talking about before.

Starting point is 00:20:32 So I think it would be great if you just described your matrix model and then how you use that to provide some bounds what in context learning is doing. What's happening? Okay. So, so yeah, let's start with that matrix abstraction. So the idea we on the matrix is you have this gigantic matrix where every row corresponds to a prompt.

Starting point is 00:20:59 And then the number of columns of this matrix is the vocabulary of the LLM. the number of tokens it has that it can emit. So for every prompt, this matrix contains the distribution over this vocabulary. So when you say the cat sat on the, you know, the column that corresponds to mat will have a high probability. Most of them will be zero.

Starting point is 00:21:27 But, you know, reasonable continuations will have a non-zero probability. And so you can imagine that there's this gigantic matrix. Now, the size of this matrix is, you know, if we just take just the old first generation GPD3 model, which had a context window of 2,000 tokens and a vocabulary of 50,000 next tokens or 50,000 tokens, then the size of it, the number of rows in this matrix is more than the number of atoms across all galaxies. that we know of. So clearly we cannot represent it exactly.

Starting point is 00:22:12 Now, fortunately, a lot of these rows do not appear in real life. Right? An arbitrary collection of tokens, you are not going to use that as a prompt. Similarly, you saw a lot of these rows are absent and a lot of the column values are also zero. Right? When you say the cat sat on the, it's unlikely to be, followed by the token corresponding to, let's say, numbers. Or, you know, an arbitrary collection of tokens.

Starting point is 00:22:41 There will only a very small subset of tokens that can follow a particular prompt. So this matrix is very, very sparse. But even after that sparsity and even after removing the sort of gibberish prompts, the size of this matrix is too much for these models to represent, even with a trail in parameters. So what, in an abstract sense, what is happening is the models get trained on certain, you know, data from the training set. And certain some, a subset, a small subset of these rows, you have reasonable values for the next token distribution. Whenever you give the prompt something new, then it will try to interpolate with what,

Starting point is 00:23:34 it has learned and what's there in the new prom and come up with a new distribution. But it's basically, so it's more than a stochastic parrot. It is sort of Bayesian on this subset of the matrix that it has been trained on. So when I say, you know, I'm going out for dinner with Martin tonight. Now, I'm reasonably sure that it has never encountered that phrase in its training data, right? But it has encountered variants of this phrase. And given that I'm going out with Martin, it can produce a Bayesian posterior. It uses that evidence that Martin is the one that I'm going for dinner with, and it will produce a next token distribution that will focus on the likely places that we are going.

Starting point is 00:24:28 So this matrix, because it's represented in a compressed way, yet the models respond to everything, every prompt. How do they do it? Well, they go back to what they've been trained on, interpolate there, and use the prompt as sort of some evidence to compute a new distribution. Right, so right. So the context of the prompt impacts the posterior distribution. Exactly, yeah. Right. And you mapped to Bayesian learning where the context is the new evidence.

Starting point is 00:25:10 New evidence, exactly. So I'll give you, so for instance, the cricket example that I spoke about earlier. So I created my own DSL, which, you know, mapped a natural language query in cricket to this DSL, which then I can translate into a SQL query or a Rest API, whatever. But getting the DSL is important. Now, these LLMs have never seen that DASL. I designed it. Yeah.

Starting point is 00:25:36 Right? But yet after showing a few examples, it learned it. How did it learn it? And this is in the prompt. You didn't know training. It's in the prompt, right? So like it's the weights are standard. Yeah, yeah.

Starting point is 00:25:51 Yeah, yeah. This was happening in October 2020. Right. I had no access to internals of Open AI. I could just access their API. Openly, I had no access to internal structure of Statskuru or the DSL that I cooked up in my head. Yet, after showing it only a few examples, it learned it right away.

Starting point is 00:26:12 So that's an example where it has seen DSLs or structures in the past. And now using this evidence that I show, okay, this is what my DSL looks like. now a new natural language query, it is able to create the right posterior distribution for the tokens, that map to the example that I've seen. Now, the other beautiful thing about this is, this is an example of few short learning or in context learning, right? But when I give that prompt,

Starting point is 00:26:47 along with these examples to this LLM, I'm not saying to the LLM, okay, this is an example of few short learning. So learn from these examples. You just pass this to the LLM as a prompt and it processes it exactly the way it would process any other prompt which is not an example of in context learning. So that really means that the underlying mechanism is the same.

Starting point is 00:27:14 Whether you give a set of examples and then ask it to complete a talk, a task like it in context learning, or just give it some prompt for continuation and I'm going out for dinner with Martin tonight. There's no in-context learning there. But the process with which it's generating or doing this inferencing is exactly the same.

Starting point is 00:27:37 And that's what I have been trying to model and come up with a formal model of. What I've found very impressive is you've used this basic model to show a number of things, right, to describe context learning and to map to page and learning. But you did it for another one where you kind of, you've sketched out this almost glib argument on Twitter, on X,

Starting point is 00:27:59 where you made this, um, uh, you made a rough argument for why recursive self-improvement can't happen without additional information. And so maybe, maybe just walk through very quickly how like this same model you can just very quickly show that a model can never recursively self-improve. So, uh,

Starting point is 00:28:23 You know, another phrase that we've been using recently is, you know, the output of the LLM is the inductive closure of what it has been trained on. So when you say that it can recursively self-improve, it could mean one of two things. So let's get back to the... Well, actually, you know what's kind of interesting is like often the... Most people agree that if you have one LLM and you just feed the output and the input, like it's not going to do anything. But then often people will say,

Starting point is 00:28:58 well, what if you have two LLLM, you have no external information, but you have two LLMs talking to each other. Maybe they can improve each other and then you can have like, you know, a takeoff scenario. But again, you even address this, even in the case of like N number of LLMs

Starting point is 00:29:11 using kind of the matrix model to show that like you just aren't gaining any information. Yeah. Yeah. So you can represent the sort of information contained in these models. and let's go back to that matrix analogy that have, the matrix abstraction. So like I said, you know,

Starting point is 00:29:32 these models are represent a subset of the rows. So a subset of the rows are represented. But some of these rows are able to help fill out some of the missing rows. For instance, you know, if the model knows, how to do multiplication doing the step by step, then every row that is corresponding to let's a 7, 69 times 125 or whatever, all those multiplications. It can fill out the answer.

Starting point is 00:30:03 Because it has those algorithms sort of embedded in them, you just need to unroll them. So it can sort of self-improve up to a point. But beyond a point, these models can only sort of generate what they have been trained on. So let me give you I'll give it three examples. So any model, any LLM that was trained on pre-1915 physics would never have come up with a theory of relativity.

Starting point is 00:30:40 Einstein had to sort of reject the Newtonian physics and come up with this space-time continuum. He completely rewrote the rules, right? So that is an example of, you know, AGI. where you are generating or generating new knowledge. It's not simply unrolling what's all the universe, right? It's not computing something. It's actually discovering something fundamental about the universe.

Starting point is 00:31:04 And for that, you have to go outside your training set. Similarly, you know, any LLM that was trained or didn't would not have come up with quantum mechanics. Right? That's where particle duality or this whole probabilistic notion or that, you know, energy is not continuous, but it is quantized. you had to reject Newtonian physics. Or Giddell's incompleteness serum. He had to go outside the axioms to say that, okay, it is incomplete.

Starting point is 00:31:31 So those are examples where you're creating new science or fundamentally new results. That kind of self-improvement is not possible with these architectures. They can refine these, they can fill out these rows where the answer already exists. Another example, which has received a lot of press these days is these IMO results

Starting point is 00:31:55 international math will appear. You know, whether it's a human solving it or the LLM solving it, they are not inventing new kinds of math. They are able to connect known results in a sequence of steps to come up with the answer. So even the LLMs, what they are doing

Starting point is 00:32:17 is they are exploring all sorts of solutions. in some of these solutions, they start going on this path where their next token entropy is low. So that's where I say they are in that Bayesian manifold. Where you have this entropy collapse. And by doing those steps, you arrive at the answer. But you're not inventing new math. You're not inventing new axioms or new branches of mathematics.

Starting point is 00:32:46 You're sort of using what you've been trained on to arrive at that answer. So those things LLMs can do, they'll get better at it, of connecting the known dots. Yeah. But creating new dots, I think we need an architectural advance.

Starting point is 00:33:06 Yeah. So Martine was talking earlier about how the discourse, you know, was it was either stastatic parrots or, you know, AGI recursive or something. How are you, how do you conceive

Starting point is 00:33:16 of sort of the AGI discourse or even, the concept, what does it mean to the extent that it's useful? How do you think about that? The way I think about it, the way we have tried to formulate in our papers, it's beyond a sarcastic parrot, but it's not AGI. It's doing Bayesian reasoning over what it has been trained on. So it's a lot more sophisticated than just a stochastic parrot.

Starting point is 00:33:43 How do you define AGI? Okay, so AGI. So how do I define? find AGI. So the way I would say that LLMs currently navigate through this known Bayesian manifold,

Starting point is 00:34:01 AGI will create new manifolds. So right now, these models navigate, they do not create. AGI will be when we're able to create new science, new results, new math. When an AGI comes up with a theory of relativity, I mean, it's an extremely high bar, but you get what I'm saying. It has to go beyond

Starting point is 00:34:21 what it has been trained on to come up with new paradigms, neoscience, and that's by definition of AGA. Vichael, can you, do you think that, based on the work you've done, can you bound the amount of data,

Starting point is 00:34:37 computer, or data or compute that would be needed in order for it to evolve? So one of the problems, if you just take LMs as they exist, is there was so much data used to create them, to create a new manifold will need a lot more data just because of the

Starting point is 00:34:57 basic mechanisms, right? Otherwise, it'll just kind of like, you know, get kind of consumed into the existing set of data. Have you found any bounds of what would be needed to actually evolve the manifold in a useful way? Or do you think we just need a new architecture? I personally think that we need a new architecture. The more data that we have, the more compute we have, we'll get maybe smoother manifold. So it's like a map. Yeah, because, I mean, there's this view that people have. They're like, well, Vichal, this is all, this is all, you know, good and well.

Starting point is 00:35:31 But, you know, I could just take an LLM and I can give it eyes and I can give it ears and I can put it in the world and it'll gain information. And based on that intervention, it'll improve itself. And therefore, it can learn new things. But the counterpoint that I've always just intuitively thought to that is the amount of data used to train these things is so low. large, how much can you actually evolve that manifold given an incremental, I mean, almost none at all, right? There has to be some other way to generate new manifolds that aren't evolving the existing

Starting point is 00:36:02 one. I completely agree. There has to be a new sort of architectural leap that is needed to go from the current, you know, just throwing more data and more compute, you know, it's going to plateau. It's, you know, the iPhone 15, 16, 17. And are there any research, directions that are promising in your mind that might help us, you know, go beyond LLLM limitations? So, I mean, again, I love LLMs. They are fantastic. They are going to increase productivity like nobody's business. But I don't think they are the answer.

Starting point is 00:36:39 So, you know, Yard Lickin famously says that LLMs are a distraction on the road to AGM. They're a dead end. They're a dead end to AJA. I don't think, I'm not quite in that camp. But I think we need a new architecture to sit on top of LLMs to reach AGI. You know, a very basic thing. You know what Martin just said, you give them eyes and you give them ears. You make them multimodal.

Starting point is 00:37:04 Of course, they'll become more powerful. But you need a little bit more than that. You know, the way human brains learns with very few examples, that's not the way transformers learn. And, you know, I'm not saying that we need to create an Einstein. or a gator, but there has to be an architectural leap that is able to create these manifolds. And just throwing new data will not do it. It'll just smoothen out the already existing manifolds.

Starting point is 00:37:32 Is that something? So is your goal to actually help, like, think through new architectures, or are you primarily focused on putting formal bounds on existing architectures? A bit of both. I mean, the former goal is the more ambitious one that everybody is chasing, and yeah, I think about that constantly. Are there any new even like sort of hints at a new architect, or like have we started to make any progress

Starting point is 00:38:00 on new architectures or is it? You know, Jan has been pushing at this JAPA architecture, energy-based architectures. They seem promising. The way I have been sort of thinking about it is, you know, there's this set of a benchmark or the ARC prize. Yeah.

Starting point is 00:38:33 Right? Mike Knoop and Fraschale have. And if you understand why the LLMs are failing on this test, maybe you can sort of reverse engineer a new architecture that will help you succeed in that, right? And I agree with a lot of, what several people say that, you know, language is great, but language is not the answer.

Starting point is 00:39:01 You know, when I'm looking at catching a ball that is coming to me, I'm mentally doing that simulation in my head. I'm not translating it to language to figure out where it will land. I do that simulation in my head. So, you know, one of the new architectures, architectural things is, how do we do, how do we get these models to do approximate simulations? to test out that idea and whether to proceed or not. So, so, yeah, we have, you know,

Starting point is 00:39:34 another thing that I've always wondered about is, did we develop as humans, did we develop language because we were intelligent, or because we developed language, we accelerated our intelligence? So I don't know which side of the camp you follow on that question. I mean, what's interesting is, like, you have these anecdotal examples of humans developing languages de novo that have been recorded, right?

Starting point is 00:40:02 Like it's either the Guatemalan or Nicaraguan sign language, right? Where there is these students that develop their own language without being taught. And so that would suggest that language follows intelligence. The problem is they're all anecdotal, right? Like who knows if somebody didn't teach them sign language? Like nobody really knows. There is no controls. So this is all these observational studies.

Starting point is 00:40:25 And there's so few of them, you have to wonder if it's just kind of sloppy observation. And so I think that the question is still outstanding. Yeah. So, I mean, language definitely accelerated our intelligence. There's no question about that. But which followed which we don't know. I view it as a networking problem naturally, which is once you have languages, you can communicate. And when you can communicate, you can store, you can replicate, yeah.

Starting point is 00:40:53 Yeah, exactly. Exactly right. Cool. Again, this is kind of a wonky question, but, you know, I think one thing that you've brought to the discourse, and for those that are listening to this, I really think that you should look up Vishal's work and read it. I just think it'll give you a really, really, especially if you have a systems background, like a networking system's background, it'll give you a really, really good understanding of kind of the bounds on these. But, like, the toolkit that you draw from is, like, information theory and, like, more formal. Have you found that the AI community is receptive to this?

Starting point is 00:41:27 Or is it like two different cultures, two different planets trying to communicate and not a lot of common ground? How have you found bringing the networking view of the world to the AI realm? Some of them are receptive to it, definitely. But, you know, these large conferences at their reviewing process, it's so random. And the kind of questions they ask,

Starting point is 00:41:55 you know, I'm a modeling person. I like to bottle things. And, you know, I submitted one version of this work to one very famous machine learning or AI conference. And the reviewer said, okay, this is a model, so what? So there is... That's absolutely remarkable. So, like, you're actually taking a system that nobody understands, we have no models for.

Starting point is 00:42:25 You actually provided some model that we can use to analyze it. And that alone wasn't sufficient. They're asking, so where are the large-scale experiments to prove this? I do, listen, I honestly, I mean, I find there's so much empiricism in, like, the current, you know, AI community. Exactly because we don't understand the systems. You know, it kind of reminds me. I feel like systems went the other way, right? It's like we had all of these models,

Starting point is 00:42:51 but then we didn't understand how the systems worked, and then we just actually did measurement. It feels like the AI stuff is the opposite, which is like, we know we don't understand them, and so we just measure them, but now we're trying to come up with the models. Yeah, exactly. So it was so easy in some sense to build these artifacts

Starting point is 00:43:11 and then just measure them that people have been going around trying to do that. And one time I'd really dislike, is prompt engineering. Why? Engineering used to mean sending a man to the move or providing five-nines reliability. Prompt engineering is prompt twiddling. You fiddle with a prompt and the boggles changes and the inference, the output changes.

Starting point is 00:43:39 And you have hundreds of papers just doing one X one, one the other, changing a problem this way, that way, and writing their observations. And as a result, you know, lots of these papers are being written, are being submitted for review. Reviewers get busy looking at all this kind of empirical work. And my personal taste is to first try to understand model it. Yeah. And then you can do the other things. So like a true theory guy. I don't know about this bit twiddling.

Starting point is 00:44:15 Let me ask one more LAM question, which is, are there any, benchmarks or real world tasks that if they occurred, you'd sort of reevaluate and say, hey, maybe LLMs are, you know, closer to the path to AGI than I thought. If there were any real world tasks. Good question.

Starting point is 00:44:47 You know, which for LLMs or these models, the one domain where you have the most training data is probably coding.

Starting point is 00:45:01 and coding is where you can also have the most structure and yet anyone who's used these tools whether it's cursor or whatever or cloud code LLMs continue to hallucinate continue to generate unreasonable code

Starting point is 00:45:24 you know you have to you have to constantly babysit these models So the day and LLM can create a large software project without any babysitting is the day I'll be a little bit convinced

Starting point is 00:45:44 that it's to what's easier but again I don't think it'll be able to create new science if it does that's when I'll be convinced I think that you can almost

Starting point is 00:45:56 take a definitional approach to answer this question Vishal like the problem with these types of questions is if you have billions of dollars and you can collect whatever data you want. You can make a model do anything you want, right? And so like, you know what I'm saying? Like, at some level, you've got this entire capital structure, machinery behind these models. So you're like, oh, it can be good at science. Well, sure, you put a billion dollars at solving materials science and collect all this data, you'll be good at material science or whatever it is. And so, but there is a definitional answer,

Starting point is 00:46:26 which is, and I'm going to draw from your work, which is there is a manifold that's there based on the data's been training on. And then the question is, is if it ever produces something that's off, like a new manifold, so considering the existing traded data, if it ever does that, if it does something that's outside of that distribution, then clearly we're on a path to learning new things. And if not, then everything is just a computational step from what's already known. Yeah. And I guess I guess the count would be maybe all humans do is work on their own manifold and Einstein,

Starting point is 00:47:02 was lucky or something, I guess would be the counter to that. But there's several many answer and examples. And yeah, it's creating this new manifold. I didn't want to use that definitional answer. I thought it might sound too. Yeah. Too wonky to mathematical.

Starting point is 00:47:20 But essentially, if LLMs really created this new manifold, then I would be convinced. But so far, they have just gotten better at navigating the existing manifold, the existing training set. Which is hugely powerful and is going to change the world. Which is hugely powerful.

Starting point is 00:47:37 I'm not denying that. I think they are extremely, extremely good at what they can do. But there's a limit to what they can do. So I've one quick question. What's next for you? I mean, you've tackled in context learning. You've got a model for LLMs

Starting point is 00:47:49 and I've got a generalized model for like their solution space. What are you thinking about tackling next? In terms of modeling or? Academically, an LLM. Academically, I'm, you know, I'm thinking of this. What is the architecture leap that is needed?

Starting point is 00:48:13 Oh, that's exciting. To create this new manifold. And how do we use, you know, multimodal data? Awesome. To expand. When you figure that, I'll come back and talk to us. That's right. We'd love that.

Starting point is 00:48:27 So, I mean, you know, even with LLMs, you know, in the paper, we say that you can improve the inference by following this low or minimum entropy path. So that's a very sort of small step that we are building and training models that will do inference based on the entropy path. Yeah. By the way, is model probe still up?

Starting point is 00:48:55 Token probe. Yeah, yeah. Token probe is still up. And you can see actually the, you know, token probe is a software that we built and thanks to Martin and A16G's generosity is running on your servers and anyone can go and test.

Starting point is 00:49:10 And what we have done there is we actually show the entropy. Yeah. It is so enlightening. I recommend anybody listening to this who's interested. Actually, check out token probe. It only shows you the confidence. Yeah, as you go along.

Starting point is 00:49:23 It's remarkable. So in context learning, you know, you create your new DSL and you give it to the prompt and you can see the confidence rising with each new example, the entropy reducing. And that sort of is a validation of the bottle.

Starting point is 00:49:38 You can see it sort of unfurling and right in front of your eyes. The token probe is chaotic. Thanks, thanks again. Michelle, thanks so much for coming on the podcast. It's a great conversation. It was great fun. Thank you.

Starting point is 00:49:50 Thank you so much again. Thanks for listening to this episode of the A16Z podcast. If you like this episode, be sure to like, comment, subscribe, rating or review and share it with your friends and family. For more episodes, go to YouTube, Apple Podcasts, and Spotify. Follow us on X, A16Z, and subscribe to our substack at A16Z.com.

Starting point is 00:50:15 Thanks again for listening, and I'll see you in the next episode. As a reminder, the content here is for informational purposes only. Should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see A16Z.com forward slash disclosures.

The a16z Show - Columbia CS Professor: Why LLMs Can’t Discover New Science

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.