a16z Podcast - Columbia CS Professor: Why LLMs Can’t Discover New Science

Starting point is 00:00:00 Any LLM that was trained on pre-1915 physics would never have come up with a theory of relativity. Einstein had to sort of reject the Newtonian physics and come up with this space-time continuum. He completely rewrote the rules. AGI will be when we're able to create new science, new results, new math. When an AGI comes up with a theory of relativity, it has to go beyond what it has been trained on. To come up with new paradigms, new science. And that's by definition of Asia. Vichal Mistra was trying to fix a broken cricket stats page

Starting point is 00:00:34 and accidentally helped spark one of AI's biggest breakthroughs. On this episode of the A16Z podcast, I talk with Vashal and A16Z's Martin Thessado about how that moment led to retrieval augmentation generation and how Vichal's formal models explain what large language models can and can't do. We discussed why LLMs might be hitting their limits, what real reasoning looks like, and what it would take to go beyond them.

Starting point is 00:00:56 Let's get into it. Martin, I knew you wanted to have Vishal on. What do you find so remarkable about him and his contributions that inspired this? Vishal actually have very similar backgrounds. We both come from networking. He's a much more accomplished networking guy than I am. That's a high bar given you a field. And so we actually view the world in an information theoretic way.

Starting point is 00:01:16 It is actually part of networking. And with all this AI stuff, there's so much work trying to create models that can help us understand how these LLMs work. And in my experience over the last three years, the ones that have most impacted my understanding, and I think have been the most predictive, are the ones that Vishal has come up with. He did a previous one that we're going to talk about, called Matrix, is it? Beyond the black box, but yeah. Beyond the black box.

Starting point is 00:01:44 Actually, we should put this in the notes for this, but the single best talk I've ever seen on trying to understand how Alam's work is one that Fischal did at MIT, which Hari Bala Christian pointed me to, and I watched that. So he did that work, and then he's doing more. recent work that's actually trying to scope out not only how LLM's reason, but like it has some reflections on humans' reason too. And so I just think he's doing some of the more profound work in trying to understand and come up with models, formal models for how LLM's reason. On that note, you said his most recent work helped you change how humans think.

Starting point is 00:02:17 On you flesh that out a little bit, how did it sort of. Well, okay, so can I just try to take a rough sketch at it and then you just tell me how wrong I am? Right ahead. You're trying to describe how LLM's work. And one thing that you've found is, that they reduce a very, very complex multidimensional space into basically a geometric manifold that's a reduced state space. So it's reduced degrees of freedom. But you can actually predict where in the manifolds

Starting point is 00:02:47 the reasoning can move to, roughly. So you've reduced the dimensionality of the problem to a geometric manifold, and then you can actually formally specify kind of how far you can reason within that manifold. And the articulation is that we, or one of the intuitions is that we as humans do the same thing, is we take this very complex, heavy-tailed, stochastic universe,

Starting point is 00:03:11 and we reduce it to kind of this geometric manifold, and then when we reason, we just move along that manifold. Yeah, I think you captured it accurately. That's kind of the spirit of the work. Wait, wait, can I just hear it in your words? Because I'm a VC, so. You know, VC with an H-index of what, 60? True.

Starting point is 00:03:32 Yeah, so ultimately what all these LLMs are doing, whether the early LLMs or the LLMs that we have today with all sorts of post-training, RLHF, whatever you do, at the end of the day, what they do is they create a distribution for the next token. So given a prompt, these LLMs create a distribution, for the next token or the next word and then they pick

Starting point is 00:03:59 something from that distribution using some kind of algorithm to predict the next token, pick it and then keep going. Now, what happens because of the way we train these LLMs, the architecture of the transformers and the loss function,

Starting point is 00:04:16 the way you put it is right, it sort of reduces the world into these Bayesian manifolds. Yeah. And as long as the LLM is going in sort of traversing through these manifolds, it is confident.

Starting point is 00:04:31 And it can produce something which makes sense. The moment it sort of wears away from the manifold, then it starts hallucinating and thought spotting nonsense. Confident nonsense, but nonsense. So it creates these manifolds. And the trick is the distribution that is produced, you can measure the entropy of the distribution. Right?

Starting point is 00:04:53 Entropy the way, share an entropy Shannon entropy not thermodynamic entropy so suppose you have a vocabulary of let's say 50,000 different tokens

Starting point is 00:05:04 and you have a distribution next token distribution over these 50,000 tokens so let's say the cat sat on the right if that is a prompt then the distribution

Starting point is 00:05:14 will have a high probability for map or hat or table and a very low probability of let's say ship or whale or something like that

Starting point is 00:05:27 right so because of the way it's trained it has these distributions now their distributions can be low entropy or high entropy

Starting point is 00:05:36 a high entropy distribution means that there are many different ways that the LLM can go with high enough probability for all those paths low entropy

Starting point is 00:05:47 means that there are only a small set of choices for the next token and the prompts also you can categorize into two kinds of prompts. One prompt is as you can say

Starting point is 00:06:02 high information entropy and one prompt is low information entropy so the way these manifolds work the LLM start paying attention to prompts that have high information

Starting point is 00:06:18 entropy and low prediction entropy so what do I mean by that? So So when I say, I'm going out for dinner. Yeah. Right. So when I say, I'm going out for dinner, that phrase, the LLMs have been trained. They've seen it a lot.

Starting point is 00:06:36 And there are many different directions I can go with it. I can say, I'm going for dinner tonight. I'm going for dinner to McDonald's or I'm going to dinner, blah, blah, blah. There are many different. But when I say, I'm going to dinner with Martin Casaro, you know, the LLM, now this is in information rich. This is sort of a rare phrase. And now the sort of realm of possibilities reduce it,

Starting point is 00:07:01 because Martin is only going to take me to Michelin-Star restaurants. I'm not going to go to McDonald's. You did what I'm saying. The moment you add more context, you make the prompt information rich, the prediction entropy reduces. Yep, yep, yep, yep.

Starting point is 00:07:18 And another example that I often say... But just quickly, But what is your takeaway? What is your implication on that? Which is, of course, as you're interested, so, yeah, so you're, so, sorry, I forgot how you described it, but so the more precise you are, the more tokens you are, I presume, the less options you have for the next token. Is that correct or not correct? Yeah, yeah, essentially. So you're reducing it.

Starting point is 00:07:47 You're reducing it to a very specific state space when it comes to. confidence in an answer and this is kind of a manifold that you can go on and then I mean do you do you have kind of a conclusion of what that means for systems or what that means for

Starting point is 00:08:05 reasoning or is it just a nice way to articulate the bounds of LLMs No there is something I don't know I don't know if I should say profound but there is something about it which tells what these LLMs can or cannot do

Starting point is 00:08:20 right so one of the examples that I often tell is, suppose I ask you what is 769 times 1,025? You have no idea. You can have some vague idea given the two numbers, right? And so in your mind, the next token distribution of the answer is going to be diffuse, right? You don't know. You have maybe a vague guess. If you are mathematically very good, maybe your guess is more precise, but it is going

Starting point is 00:08:51 to be diffuse. and it's not going to be the correct answer. But if you say, can I write it down and do it the way we have learned multiplication tables, now you know exactly what to do the next step. You write 769 and then 1025 and then you know exactly. So at each stage of that process, your prediction entropy is very low.

Starting point is 00:09:16 You know exactly what to do. Because you have been taught this algorithm. and by invoking this algorithm saying, okay, I'm not going to just guess the answer, but I'm going to do it step by step. Then your prediction and entropy reduces. And you can arrive at an answer which you're confident of and which is correct.

Starting point is 00:09:39 And the LLMs are pretty much the same way. That's why chain of thought works. What happens with chain of thought is you ask the LLM to do something, chain of thought. It starts breaking the problem into small steps these steps it has seen in the past it has been trained on maybe with some different numbers but the concept it has been trained on and once it breaks it down then it's confident okay now

Starting point is 00:10:05 i need to do a b c d and then i arrive at this answer whatever it is let's zoom back i want to into lLMs but first of it michael maybe you can give more context on your background and how that informs your work here okay so you Yeah, as Martin said, my background is very similar to his. We, you know, we come from doing networking. So my PhD thesis, my sort of early work at Columbia has all been in networking. But there's another side of me, another hat that I wear, which is both an entrepreneur and a cricket fan. I was going to say, don't you own a cricket team or something?

Starting point is 00:10:46 I'm a minority for your local cricket team, the San Francisco Unicons. Yeah, that's right. I was very proud to have you. But, so in the 90s, I was one of the people who started this portal called Crick Info. And Crick Info, at one point it was the most popular website in the world. It had more hits than Yahoo. That was before India came on. And so, you know, we built, cricket as a very start with sport.

Starting point is 00:11:22 you think baseball multiplied by 1,000. And we had built this free searchable stats database on cricket called Stats Guru. And this has been available on Prickin 4 since 2000. But because you can search for anything, everything was made available on Stats Guru. And you know, you can't expect people to write SQL queries to query everything. So how do you, how would we do it? well, it was a web form, you know, where you could formulate your query using that form,

Starting point is 00:11:56 and in the back end, that was translated into SQL query, got the results and got it back. But as a result, that because you could do everything, everything was made available, the web form had like 25 different checkboxes, 15 text fields, 18 different drop downs. The interface was a mess. It was very daunting.

Starting point is 00:12:15 So, and ESPN acquired the, Crick Info in the mid-2006, I think, but they still kept the same interface. And that has always sort of nagged me. And so I still know the people... Wait, wait, what nagged you? Is that Crick Info did not have informal language

Starting point is 00:12:35 and had a web form for doing queries? That web form was terrible. Because of that, only the real nerds use... Of all the things of the world to bother you. The fact that an old website It was a web form. I appreciate your commitment to aesthetic. So I'm still friendly with the people who run ESPen Creek and further.

Starting point is 00:13:01 The editor-in-chief, whenever he comes to New York, we meet up, we go out for a drink. And so he was here in 2000. So now the story shifts to how LLMs and me sort of met. So January 2000, right before the pandemic, he was here. and I again said, why did you do something about Stats Guru? And he looks at me and says, why did you do something about Stats Guru? He was kind of joking, but he thought maybe, you know, I had some ways to fix the interface. So anyway, then the pandemic hit, the world stopped.

Starting point is 00:13:35 But in July of 2020, the first version of GPD3 was released. And I saw someone use GPD3 to write a SQL query for, for their own database using natural language. And I thought, can I use this to fix Stats Guru? So I got early access to GPD3, you know, getting access those days were difficult, but somehow I got it. But soon I realized that, you know, no, I cannot really do it. Because stats grew, the backend databases were so complex.

Starting point is 00:14:12 And if you remember GPD3 had only a 2048 token context window. there was no way in hell I could fix the complexities of that database and that context window and GPD3 also did not do instruction following at that time but then

Starting point is 00:14:32 in trying to solve this problem I accidentally invented what's now called Rack where based on the natural language query I created a database of natural language queries and structure the structured queries

Starting point is 00:14:47 like I created a DSL which then translated into a rest call to stats guru so based on the new query I would look through my set of national language queries I had about 1,500 examples and I would pick the 6 or 7 most relevant ones

Starting point is 00:15:05 and then that and a structured query I would send as a prefix and the new query and GPD3 magically completed it and the accuracy will very high so that had been running in production since September 2021. You know, about 15 months before Chad GPD came and, you know, the whole revolution in some sense started

Starting point is 00:15:27 and RAC became very popular. I didn't call it RAC, but this is something sort of I accidentally did in trying to solve that problem for quick info. Now, once I built it, you know, I was thrilled at this work, but I had no idea why it worked. You know, I stayed at that,

Starting point is 00:15:46 I stared at that transformer architecture diagram. I read those papers, but I couldn't understand how or why it worked. So then I started in this journey of developing a mathematical model, trying to understand how it worked. So that's been sort of my journey through this world of AI and LLMs because I was trying to solve this cricket problem. Yeah, amazing. And so maybe reflecting back since there were

Starting point is 00:16:16 least of GPD3, what is most surprised you about how LLMs have developed? So what has most surprised me, the pace of development. So GPD3 was, you know, it was a nice pilot trick, and you had to jump through hoops to get it to do something useful. But starting with the, you know, Chad GPD was an advance over GPD3. And then you had all these things like chain of thought, instruction following. GPT4 really made it polished. And, you know, the pace of development has really surprised me.

Starting point is 00:16:52 Now, you know, when I started working with GPD3, I could sort of see what its limitations were, what I could make it do, what I couldn't make it do. But I never thought of it as, you know, what these LLMs have become for me now and what have become from millions of people around the world. We treat these models as our co-workers, almost like an intern,

Starting point is 00:17:15 that, you know, you constantly, chatting with them, brainstorming, making them do all sorts of work, which we could imagine, you know, just when Chad GPD was released. It was nice.

Starting point is 00:17:27 It could write poems. It could write limericks. It could answer some hallucinated questions. But the capabilities that have emerged now, that pace has been very sort of surprising to me. Do you see progress plateauing? Or how do you, either now or in the near future, how do you see it going?

Starting point is 00:17:46 I, yes, in some sense, progress is plateauing. It's like the iPhone, you know, with the iPhone came out, wow, what is this thing? And the early iterations, you know, constantly we were amazed by new capabilities. But the last, you know, seven, eight, nine years, it's maybe the camera got a little bit better or, you know, one thing changed here or memory is more. but there has been no fundamental advance in what it's capable of, right? You can sort of see a similar thing happening with these LLMs.

Starting point is 00:18:25 And this is not true for just one company and one model, right? You look at what OpenEI is coming up with or what Anthropic Google or all these open source Sani's model or Mistral. The capabilities of LLMs has not fundamental,

Starting point is 00:18:44 changed. They've become better, right? They've improved. But they have not crossed into a different realm. So this is something that I really appreciate about your work. And so the thing that really struck

Starting point is 00:19:00 me is as soon as these things showed up, you actually got busy trying to have a formal model of what they're capable of, which was in stark contrast to what everybody else was doing. Everybody else was like, AGI.

Starting point is 00:19:14 These things are going to, you know, recursively self-improve, like, or, or they'll say, oh, these are just stochastic parents, which doesn't mean anything. So everybody had rhetoric, and sometimes this rhetoric was fanciful, and sometimes this rhetoric was almost reductionist, like, oh, it's just a database, which is clearly not true. And the thing that really struck me about your work is you're like, no, let's figure out exactly what's going on, let's come up with a formal model. And once we have a formal model, we could reason about what that means. And then, you know, in my reading of your work, I kind of break it up in two pieces. There's the first one where you basically, you came up with this, you know, matrix abstraction. I think it's worth you talking through. And then you took in-context learning as an example and you mapped it to Bayesian reasoning,

Starting point is 00:19:59 which to me was incredibly powerful because at the time, nobody knew why in-context learning worked. So I think it would be great for you to discuss that because, again, I think it was the first real kind of formal effect on, like, how are these things working? And then the more recent work that you're working on now is a kind of more generalized version of what is the state space that these models output when it comes to confidence, which is the manifold that we're talking about before. So I think it would be great if you just described your matrix model and then how you use that to provide some bounds

Starting point is 00:20:41 what in context learning is doing. What's happening? Okay, so, so yeah, let's start with that matrix abstraction. So the idea we on the matrix is you have this gigantic matrix where every row corresponds to a prompt. And then the number of columns of this matrix is the vocabulary of the LLM, the number of tokens it has that it can emit.

Starting point is 00:21:08 So for every prompt, this matrix contains the distribution over this vocabulary. So when you say the cat sat on the, you know, the column that corresponds to mat will have a high probability. Most of them will be zero. But, you know, reasonable continuations will have a non-zero probability. And so you can imagine that there's this gigantic matrix.

Starting point is 00:21:36 Now, the size of this matrix is, you know, if we just take just the old first generation GPD3 model which had a context window of 2,000 tokens and a vocabulary of 50,000 next tokens or 50,000 tokens then the size of it, the number of rows

Starting point is 00:21:59 in this matrix is more than the number of atoms across all galaxies that we know of. So clearly we cannot represent yet exactly. Now, fortunately, a lot of these rows do not appear in real life. An arbitrary collection of tokens, you are not going to use that as a prompt. Similarly, a lot of these rows are absent and a lot of the column values are also zero. When you say the cat sat on the, it's unlikely to be followed by the token corresponding

Starting point is 00:22:37 to, let's say, numbers. or an arbitrary collection of tokens. There will only a very small subset of tokens that can follow a particular prompt. So this matrix is very, very sparse. But even after that sparsity and even after removing the sort of gibberish prompts, the size of this matrix is too much for these models to represent,

Starting point is 00:23:02 even with a trail in parameters. So what, in an abstract sense, what is happening is, the models get trained on certain, you know, data from the training set. And certain some, a subset, a small subset of these rows, you have reasonable values for the next token distribution. Whenever you give the prompt something new, then it will try to interpolate with what it has learned and what's there in the new prompt and come up with a new prompt.

Starting point is 00:23:37 and come up with a new distribution. But it's basically, so it's more than a stochastic parrot. It is sort of Bayesian on this subset of the metric that it has been trained on. So when I say, you know, I'm going out for dinner with Martin tonight. Now, I'm reasonably sure that it has never encountered that phrase in its training data, right? but it has encountered variance of this phrase

Starting point is 00:24:10 and given that I'm going out with Martin it can produce a Bayesian posterior it uses that evidence that Martin is the one that I'm going for dinner with and it will produce a next token distribution that will focus on the likely places that we are going so this matrix because it's represented in a compressed way

Starting point is 00:24:34 yet the models respond to everything every prompt how do they do it well they go back to what they've been trained on interpolate there and use the prompt as sort of some evidence to compute a new distribution right so right so the context of the prompt impacts the posterior distribution exactly yeah right and you mapped to Bayesian learning where the the context is is the new evidence. New evidence, exactly. So I'll give you, so, so for instance,

Starting point is 00:25:14 the cricket example that I spoke about earlier. Yeah. So I created my own DSL, which, you know, mapped a natural language query in cricket to this DSL, which then I can translate into a SQL query or a Rest API, whatever. But getting the DSL is important.

Starting point is 00:25:31 Now, these LLMs have never seen that DSL. I designed it. Yeah. but yet after showing a few examples it learned it how did it learn it and this is this is in the prompt you didn't know training it's in the prompt right so like it's the weights are standard yeah yeah yeah this is this was happening in october 2020 right i had no access to internals of open a i could just you know access their API open i had no access to internal structure of stats guru or the

Starting point is 00:26:06 DSL that I cooked up in my head. Yet, after showing it only a few examples, it learned it right away. So that's an example where it has seen DSLs or structures in the past. And now using this evidence that I show, okay, this is what my DSL looks like. Now, a new natural language query, it is able to create the right posterior distribution for the tokens, that map to the example that I've seen. Now, the other beautiful thing about this is this is an example of few short learning or in context learning, right?

Starting point is 00:26:43 But when I give that prompt, along with these examples to this LLM, I'm not saying to the LLM, okay, this is an example of few short learning. So learn from these examples, right? You just pass this to the LLM as a prompt and it processes it exactly the way it would process any other prompt, which is not an example of in-context learning. So that really means that the underlying mechanism is the same. Whether you give a set of examples

Starting point is 00:27:17 and then ask it to complete a talk, a task like it in-context learning, or just give it some kind of prompt for continuation that I'm going out for dinner with Martin tonight. There's no in-context learning there. But the process with which it's generating or doing this inferencing is exactly the same. And that's what I have been trying to model

Starting point is 00:27:39 and come up with a formal model of. What I've found very impressive is you've used this basic model to show a number of things, right, to describe context learning into maps to page and learning. But you did it for another one where you kind of, you've sketched out this almost glib argument on Twitter on X

Starting point is 00:27:58 where you made this you made a rough argument for why recursive self-improvement can't happen without additional information and so maybe just walk through very quickly how like this same model

Starting point is 00:28:15 you can just very quickly show that a model can never recursively self-improve. So you know, another phrase that we have been using recently

Starting point is 00:28:30 is, you know, the output of the LLM is the inductive closure of what it has been trained on. So when you say that it can recursively self-improve, it could mean one of two things. So let's get back to the...

Starting point is 00:28:47 Well, actually, you know what's kind of interesting is like often the... Most people agree that if you have one LLM and you just feed the output and the input, like it's not going to do anything. But then often people will say, well, what if you have two LLM, you have no external.

Starting point is 00:29:00 information, but you have two LLMs talking to each other, maybe they can improve each other and then you can have like, you know, a takeoff scenario. But again, you even address this, even in the case of like N number of LLMs using kind of the matrix model to show that like you just aren't gaining any information. Yeah. Yeah. So you can represent the sort of information contained in these models. And let's go back to that matrix analogy that have, the matrix subtraction. So like I said, you know, these models are represent a subset of the rows, right? So a subset of the rows are represented, but some of these rows are able to help fill out some of the missing rows. For instance, you know, if the model knows how to do

Starting point is 00:29:54 multiplication doing the step by step, then every row that is corresponding to let's a seven 69 times 125 or whatever. It can fill out the answer, yeah. It can fill out the answer because it has those algorithms sort of embedded in them. You just need to unroll them. So it can sort of self-improve up to a point. But beyond a point, these models can only sort of generate what they have been trained on. So let me give you, I'll give it three examples.

Starting point is 00:30:26 Yeah. So any model, any LLM that was trained on pre-1915 physics would never have come up with the theory of relativity. Einstein had to sort of reject the Newtonian physics and come up with this space-time continuum. He completely rewrote the rules, right? So that is an example of, you know, AGI, where you are generating or generating new non-law.

Starting point is 00:30:56 It's not simply unrolling about the universe, right? It's not computing something. It's actually discovering something fundamental about the universe. And for that, you have to go outside your training set. Similarly, you know, any LLM that was trained or didn't would not have come up

Starting point is 00:31:11 with quantum mechanics. Right? That's where particle duality or this whole probabilistic notion or that, you know, energy is not continuous, but it is quantized. You had to reject Newtonian physics. Yeah. Or get us in completeness serum.

Starting point is 00:31:26 He had to go outside the axioms to say that, okay, it is incomplete. So those are examples where you're creating new science or fundamentally new results. That kind of self-improvement is not possible with these architectures. They can refine these, they can fill out these rows where the answer already exists. Another example, you know, which has received a lot of press these days, is these IMO results, international math, you know whether it's a human solving it or the LLM solving it they are not inventing new kinds of math they are able to connect known results

Starting point is 00:32:11 in a sequence of steps to come up with the answer so even the LLMs what they are doing is they are exploring all sorts of solutions in some of these solutions they they start going on this path where their next token entropy is low. So that's where I say they are in that Bayesian manifold, where you have this entropy collapse. And by doing those steps, you arrive at the answer. But you're not inventing new math.

Starting point is 00:32:42 You're not inventing new axioms or new branches of mathematics. You're sort of using what you've been trained on to arrive at that answer. So those things LLMs can do, you know, they'll get better at it of connecting the known dots but creating new dots

Starting point is 00:33:01 I think we need an architectural advance yeah so Martine was talking earlier about how the discourse you know was it was either static parrots or

Starting point is 00:33:12 you know AGI recurs or something how are you how do you conceive of sort of the AGI discourse or even the concept what does it mean to the extent that it's useful how do you think about that

Starting point is 00:33:24 So the way, you know, I think about it, the way we have tried to formulate in our papers, it's beyond a sarcastic parrot, but it's not AGI. He's doing Bayesian reasoning over what it has been trained on. So it's a lot more sophisticated than just a stochastic parrot. How do you define AGI? Okay, so AGI. So how do I define AGI? So the way I would say that LLMs currently navigate through this known Bayesian manifold, AGI will create new manifolds.

Starting point is 00:34:04 So right now these models navigate, they do not create. AGI will be when we are able to create new science, new results, new math. When an AGI comes up with a theory of relativity, I mean, it's an extremely high bar, but you get what I'm saying. It has to go beyond what it has been trained on to come up with new paradigms, neoscience, and that's by definition of AGA. Vichael, do you think that, based on the work you've done, can you bound the amount of data computer or data or compute that would be needed in order for it to evolve?

Starting point is 00:34:44 So one of the problems, if you just take LLMs as they exist, is there was so much data used to create them. To create a new manifold will need a lot more data just because of the basic mechanisms, right? Otherwise, it'll just kind of like, you know, get kind of consumed into the existing set of data. Like, have you found any bounds of what would be needed to actually evolve the manifold in a useful way?

Starting point is 00:35:11 Or do you think we just need a new architecture? I personally think that we need a new architecture. the more data that we have the more compute we have we'll get maybe smoother manifolds so it's like a map yeah because I mean there's this view that people have they're like well Vishal this is all this is all

Starting point is 00:35:30 good and well but you know I could just take an LLM and I can give it eyes and I can give it ears and I can put it in the world and it'll gain information and based on that intervention it'll improve itself and therefore it can learn new things but the counterpoint that I've always just intuitively thought to that is the amount of data used to train these things

Starting point is 00:35:50 is so large, how much can you actually evolve that manifold given an incremental almost none at all, right? There has to be some other way to generate new manifolds that aren't evolving the existing one. I completely agree. There has to be a new sort of architectural leap

Starting point is 00:36:09 that is needed to go from the current, you know, just throwing more data and more compute. You know, it's going to plateau. It's, you know, the iPhone 15, 16, 17. And are there any research directions that are promising in your mind that might help us, you know, go beyond LLM limitations? So, I mean, again, I love LLMs. They are fantastic.

Starting point is 00:36:33 They are going to increase productivity like nobody's business. But I don't think they are the answer. So, you know, Yard Likun famously says that LLMs are a distraction on the road to Aege. Dead-end. They're dead-end, AJ. I don't think, I'm not quite in that camp, but I think we need a new architecture to sit on top of LLMs to reach AGI.

Starting point is 00:36:59 You know, a very basic thing, you know what Martin just said. You give them eyes and you give them ears. You make them multimodal. Of course, they'll become more powerful. But you need a little bit more than that. You know, the way human brains learns with very few examples,

Starting point is 00:37:13 that's not the way transformers learn. And, you know, I'm not saying that we need to create an Einstein or a gator, but there has to be an architectural leap that is able to create these manifolds, and just throwing new data will not do it. It'll just smoothen out the already existing manifolds. Is that something? So is your goal to actually help, like, think through new architectures, or are you primarily focused on putting formal bounds on existing architectures?

Starting point is 00:37:42 a bit of both I mean the former goal is the more ambitious one that everybody is chasing and yeah I think about that constantly are there any new even like sort of hints that a new architect

Starting point is 00:37:58 or like have we started to make any progress on new architectures or is it you know Jan has been pushing at this J-PAR architecture energy-based architectures they seem promising

Starting point is 00:38:18 the way I have been sort of thinking about it is you know there's this set of a benchmark or the arc prize right that Mike Kanoop

Starting point is 00:38:35 and Fraschale have and if you understand why the LLMs are failing on this test maybe you can sort of reverse engineer a new architecture that will help you succeed in that, right? And I agree with a lot of what several people say that, you know, language is great, but language is not the answer.

Starting point is 00:39:01 You know, when I'm looking at catching a ball that is coming to me, I'm mentally doing that simulation in my head. I'm not translating it to language to figure out where it will land. I do that simulation in my head. So, a way, you know, one of the new architectures, architectural things is how do we do, how do we get these models to do approximate simulations, to test out that idea and whether to proceed or not? So, so, yeah, we have, you know, another thing that I've always wondered about is, did we develop as humans?

Starting point is 00:39:39 Did we develop language because we were intelligent? or because we developed language, we accelerated our intelligence. So I don't know which side of the camp you follow on that question. What's interesting is, like you have these anecdotal examples of humans developing languages de novo

Starting point is 00:40:01 that have been recorded, right? Like it's either the Guatemalan or Nicaraguan sign language, right? Where there is these students that develop their own language without being taught. And so that would suggest that language just follows intelligence. The problem is, is they're all anecdotal, right? Like, who knows if somebody didn't teach them sign language? Like, nobody really knows.

Starting point is 00:40:21 There is no controls. So this is all these observational studies. And there's so few of them, you have to wonder if it's just kind of sloppy observation. And so I think that the question is still outstanding. Yeah. So, I mean, language definitely accelerated our intelligence. there's no question about that. But which followed which we don't know.

Starting point is 00:40:45 I view it as a networking problem naturally, which is once you have languages, you can communicate. And when you can communicate, you can communicate, you can replicate, yeah. Yeah, exactly. Exactly, right. Cool.

Starting point is 00:40:56 Again, this is kind of a wonky question, but, you know, I think one thing that you've brought to the discourse, and for those that are listening to this, I really think that you should look up Vishal's work and read it. I just think it'll give you a really, really, especially if you have a systems background, like a networking systems bracket give you a really, really good understanding

Starting point is 00:41:13 of kind of the bounds on these. But like the toolkit that you draw from is like information theory and like more formal. Have you found that the AI community is receptive to this? Or is it like two different cultures, two different planets trying to communicate

Starting point is 00:41:32 and not a lot of common ground? Like how have you found like bringing like the networking view of the world to the AI realm? Some of them are receptive to it, definitely. But, you know, these large conferences at their reviewing process, it's so random. And the kind of questions they ask, you know, I'm a modeling person. I like to bottle things.

Starting point is 00:41:58 And, you know, I submitted one version of this work to one very famous machine learning or AI conference. And the reviewer said, okay, this is. the model, so what? So there is... That's absolutely remarkable. So, like, you actually take in a system that nobody understands, we have no models for, you actually provided some model that we can use to analyze it, and that alone wasn't sufficient. They're asking, so where are the large-scale experiments to prove this?

Starting point is 00:42:35 I do, listen, I honestly, I mean, I find there's so much empiricism. in, like, the current, you know, AI community exactly because we don't understand the systems. You know, it kind of reminds me. I feel like systems went the other way, right? It's like we had all of these models, but then we didn't understand how the systems worked, and then we just, like, actually did measurement.

Starting point is 00:42:55 It feels like MLN, or the AI stuff is the opposite, which is like, we know we don't understand them, and so we just measure them, but now we're trying to, like, come up with the models. Yeah, exactly. So it was so easy in some, sense to build these artifacts and then just measure them

Starting point is 00:43:13 that people have been going around trying to do that and you know what one term I'd really dislike is prompt engineering why you know engineering used to mean sending a matter of the move

Starting point is 00:43:27 or providing five nine's reliability prompt engineering is prompt twiddling you fiddle with a prompt and the boggles changes and the inference the output changes And, you know, you have like hundreds of papers just, just, you know, doing one X whenever on the other, changing a problem this way, that way, and writing their observations.

Starting point is 00:43:49 And as a result, you know, lots of these papers are being written, are being submitted for a review. Reviews get busy looking at all this kind of empirical work. And my personal taste is to first try to understand, model it. Yeah. And then you can do the other things. like a true theory guy. I don't know about this bit twiddling. Let me ask one more LM question, which is,

Starting point is 00:44:17 are there any benchmarks or real-world tasks that if they occurred, you'd sort of reevaluate and say, hey, maybe LLMs are closer to the path to AGI than I thought. If the very real-world tasks. Good question. You know, which for LLMs or these models,

Starting point is 00:44:55 the one domain where you have the most training data is probably coding. And coding is where you can also have. the most structure and yet anyone who's used these tools whether it's cursor or whatever cloud code lLMs continue to hallucinate continue to generate unreasonable code you know you have to constantly babysit these models so the day and lLM can create a large software project without any babysitting is the day I'll be a little bit convinced that it's towards easy. But again

Starting point is 00:45:47 I don't think it'll be able to create new science. If it does, that's when I'll be convinced. I think that you can almost take a definitional approach to answer this question, Vishal. The problem with these types of questions is if you have billions of dollars and you can collect whatever data you want,

Starting point is 00:46:05 you can make a model do anything you want, right? And so like, you know what I'm saying? at some level, you've got this entire capital structure, machinery behind these models. So you're like, oh, it can be good at science. Well, sure, you put a billion dollars of solving materials science and collect all this data, you'll be good at material science or whatever it is.

Starting point is 00:46:24 And so, but there is a definitional answer, which is, and I'm going to draw from your work, which is there is a manifold that's in there based on the data's been training on. And then the question is, if it ever produces something that's off, like a new manifold. So considering the existing traded data, if it ever does that, if it does

Starting point is 00:46:43 something that's outside of that distribution, then clearly we're on a path to learning new things. And if not, then everything is just a computational step from what's already known. Yeah. And I guess I guess the counter to that would be maybe all humans do is

Starting point is 00:46:59 work on their own manifold and Einstein you know was lucky or something, I guess would be the counter to that. But I just So, you know, there's several mini-Ansign examples, and, yeah, it's creating this new manifold. I didn't want to use that definitional answer. I thought it might sound too, too wonky to mathematical.

Starting point is 00:47:20 But essentially, if LLMs really created this new manifold, then I would be convinced. But so far, they have just gotten better at navigating the existing manifold, the existing training set. Which is hugely powerful and is it going to change the world. Which is hugely powerful. I'm not denying it. that I think they are extremely, extremely good at what they can do.

Starting point is 00:47:42 But there's a limit to what they can do. So I've one quick questions, what's next for you? I mean, you've tackled in context learning. You've got a model for LLMs, and I've got a generalized model for, like, you know, like their solution space. What are you thinking about tackling next? In terms of modeling or? Academically, an LLM.

Starting point is 00:48:03 Academically, yeah, academically, I'm you know, I'm thinking of this, what is the architecture leap that is needed? Oh, that's exciting. To create this, you know, manifold. And how do we use, you know, multimodal data? Awesome. To expand.

Starting point is 00:48:23 When you figure that, I'll come back and talk to us. That's right. We'd love that. So, I mean, you know, even with LLMs, you know, in the paper, we say that you can improve the inference by following this low or minimum entropy path so that's a very sort of small step

Starting point is 00:48:43 that we are building and training models that will do influence based on the entropy path yeah by the way is model probe still up token probe yeah token probe is still up

Starting point is 00:48:57 and you can see actually the you know token probe is a software that we built and thanks to Martin and A16 these generosity is running on your servers and anyone can go and test. And what we have done there is we actually show

Starting point is 00:49:13 the entropy. Yeah. It is so enlightening. I recommend anybody listening to this who's interested. Actually, check out token probe. It literally shows you the confidence. Yeah, as you go along. It's remarkable. So in context learning, you know, you create your new DSL and you give it to

Starting point is 00:49:29 the prompt and you can see the confidence rising with each new example. The entropy is reducing. and that sort of is a validation of the bottle. You can see it sort of unfurling and right in front of your eyes. The token prop is here writing. Thanks, thanks again. Michelle, thanks so much for coming on the podcast. It's a great conversation.

Starting point is 00:49:48 It was a great fun. Thank you. Thank you so much again. Thanks for listening to this episode of the A16Z podcast. If you like this episode, be sure to like, comment, subscribe, leave us a rating or review, and share it with your friends and family. episodes, go to YouTube, Apple Podcasts, and Spotify. Follow us on X, A16Z, and subscribe to our substack at A16Z.substack.com. Thanks again for listening, and I'll see you in the next episode.

Starting point is 00:50:18 As a reminder, the content here is for informational purposes only. Should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see A16Z.com forward slash disclosures.

a16z Podcast - Columbia CS Professor: Why LLMs Can’t Discover New Science

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.