a16z Podcast - What's Missing Between LLMs and AGI - Vishal Misra & Martin Casado

Starting point is 00:00:00 Anthropic makes great products. Plot code is fantastic. Co-work is fantastic. But they are crazy of silicon doing matrix multiplication. They don't have consciousness. They don't have an inner monologue. You take an NLM and train it on pre-1916 or 1911 physics

Starting point is 00:00:16 and see if it can come up with the theory of relativity. If it does, then we have AGO. Just today, by the way, Dario allegedly said that you can't rule out of their conscience. You can rule out there. No. I think you can rule out. to get to what is called AGI. I think there are two things that need to happen.

Starting point is 00:00:35 Five years ago, Vishal Misra got GPT3 to translate natural language into a domain-specific language it had never seen before. It worked. He had no idea why. So he set out to build a mathematical model of how LLMs actually function. The result? A series of papers showing that transformers update their predictions in a precise mathematically predictable way. In controlled experiments,

Starting point is 00:01:00 The models match the theoretically correct answer almost perfectly. But pattern matching is not intelligence. LLM's learn correlation. They don't build models of cause and effect. To get to AGI, MISRA argues, we need the ability to keep learning after training and the move from correlation to causation. Martine Casado speaks with Vichal Meshra,

Starting point is 00:01:21 Professor and Vice Dean of Computing and AI at Columbia University. Michelle is great to have you again. Great to be back. This is one of my favorite topics, which is how do LLM to actually work? And I think that, in my opinion, you've done kind of the best work on this, modeling it out. Thank you. For those that did not see the original one, maybe it's probably worth doing just a quick

Starting point is 00:01:44 background on what led you to this point, and then we'll just go into the current work that you've been doing. Five years ago, when ZPD3 was first released, I got early access to it, and I started playing with it, and I was trying to solve a problem related to it. to querying a cricket database. And I got GPD3 to do in context learning, few short learning, and it was kind of the first,

Starting point is 00:02:12 at least to me it was the first known implementation of Ragh, retrieval augmented generation, which I used to solve this problem of querying, getting GPD3 to translate national language into something that could be used to query a database that GPD3 had no idea about. I had no access to GPD3's internal, but I was still able to use it to solve that problem.

Starting point is 00:02:34 So it worked beautifully. We deployed this in production at ESPN in September 21. Wow. You did the first implementation of RAG in 2021? No, no, no, in 2020. 2020, I got it working. And by the time you talked to all the lawyers at ESPN and productionized it, it took a while. But October 2020, we had, well, I had this architecture working.

Starting point is 00:02:57 But after I got it to work, I was amazed that it worked. I wanted to understand how it worked. And I looked at the attention is all your deep papers and all the other sort of deep learning architecture papers and I couldn't understand why it worked. So then I started getting sort of deep into building a mathematical model. And now you published a series of papers. The first one that I read was the one where you had kind of your matrix kind of abstraction.

Starting point is 00:03:27 So maybe we'll talk about that and then we'll talk about the more recent work. So perhaps we'll just start with the first one, which is you're trying to come up with a mathematical model with how LLM works. Yeah. And you have which is very helpful to me. And at the time, you're actually trying to figure out how in-context learning was working. Yes, yeah. And you came up with an abstraction for LLMs, which is basically a very large matrix and you use that to describe. So maybe you can kind of walk through that work very quickly.

Starting point is 00:03:50 Sure. Yeah. So what you do is you imagine this huge gigantic matrix where every row of the matrix corresponds to a prompt. Yeah. And the way these LLMs work is given a prompt, they construct a distribution of probabilities of the next token. Next token is next word. So every LLM has a vocabulary, GPD and its variants have a vocabulary for about 50,000 tokens.

Starting point is 00:04:16 So given a prompt, it will come up with the distribution of what the next token should be, and then all these models sample from that distribution. Yeah. So that's the posterior distribution. That's the posterior distribution, right? That's how LLMs work. And so the idea of this matrix is for every possible combination of tokens, which is a prompt, there's a row.

Starting point is 00:04:35 And the columns are a distribution over the vocabulary. So if you have a vocabulary of 50,000 possible tokens, it's a distribution over those 50,000 tokens. And by distribution, it's just the probability. Just the probability, sorry. Yeah, just the probability that the next token should be this versus that. So that's sort of the idea. And when you start viewing it that way, it makes things at least,

Starting point is 00:04:57 clearer to people like me who want to model it, what's happening. So concretely, let's say you have an example that, let's say a prompt is just one word, protein. So if you look at the distribution of the next word, next token after that, most of the probabilities would be zero, but you'd have non-zero, non-trivial probabilities on, let's say, two words. One is synthesis, the other is shake, right? and now the LLM is going to sample this next token and make synthesis or shake.

Starting point is 00:05:32 Or you as a human will give the prompt protein shake or protein synthesis. Now, depending on whether you pick synthesis or shake, that row looks very different, right? If you pick protein synthesis, the terms that would have a high probability would be all concerned with biology. But if you pick protein shake,

Starting point is 00:05:55 it'll all be about gyms and exercise and all bodybuilding stuff. So that synthesis or shake completely changes what comes next. So this is an example of, you can say, Bayesian updating. You start with protein, you have a prior that after protein, this is going to happen. As soon as you get new evidence, then the next term is synthesis or shake. You completely update the distribution. So now you can imagine that the whole, the entirety of LLMs is this giant matrix where you have every row, protein shake, protein synthesis, the cat sat on the humpty, dumpty, blah, blah.

Starting point is 00:06:36 Now given the vocabulary of these LLM, let's say, 50,000 and the context window. So GPD, for instance, the chat GPD, the first version, had a context window of 8,000 tokens. If you look at all possible combinations of 8,000 tokens and 50,000 vocabulary, the number of rows in this matrix is more than the number of electrons across all galaxies. Right? So there's no way that these LLMs can represent it exactly.

Starting point is 00:07:09 Now, fortunately, this matrix is very sparse. Why? Because an arbitrary combination of these tokens is gibberish. We're never going to use that in real life. Also, the columns are also mainly zero. If you have protein, then you won't have lots of, you know, you won't have arbitrary numbers or arbitrary words after that. It's very sparse both in rows and in columns. So in kind of an abstract way, what all these LLMs are doing

Starting point is 00:07:39 is coming up with the compressed representation of this matrix. And when you give a prompt, they try to approximate what the true distribution should have been and try to generate it. That's what, in my mind at least, it boils up to. Just from my understanding, so if you have a row of protein and then you have one with protein shake, is protein shake a subset of protein, or is it different? It's different. It's a continuation from.

Starting point is 00:08:12 I see. Yeah. No, but I'm saying like the actual posterior distribution, is that a subset? You can say it's a subset, right? If you have protein, then protein shake and protein synthesis are all continuations from protein. So both synthesis and shake have non-zero probabilities. So you can, yeah, you can think of it as somewhat a subset.

Starting point is 00:08:32 Right. You use this approach to describe how in-context learning works. And so maybe first describe what in-context learning is, and then kind of the conclusion that you came from that. So eight context learning is when you show the LLM something it has kind of never seen before. You give it a few examples. So this is what it wants. This is what you're trying to do.

Starting point is 00:08:58 Then you give a new problem which is related to the example that you have shown. And the LLM learns in real time what it's supposed to do and solves a problem. By the way, the first time I saw this, it absolutely blew my mind. I actually used your DSL. when I was like first learning about it. So maybe the DSL thing was just crazy this works at all. It's absolutely mind-blowing that it works.

Starting point is 00:09:24 And so going back to that cricket problem was, you know, in the mid-90s, I was part of a group that had created this cricket portal called Crick Info. Cricket is a very start-rich sport. You think baseball multiplied by a thousand and it's at all kinds of stats. And we had created this online search

Starting point is 00:09:42 database called Stats Guru, where you could search for anything, any start related to cricket, and has been available since 2000. But because you can query for anything, everything was made available, and how do you make something like that available to the general public?

Starting point is 00:09:58 Well, they're not going to write SQL queries. The next best thing at that time was to create a web form. Unfortunately, everything was crammed into that web form. So as a result, you had like 20 drop downs, 15 checkboxes, 18 different text fields. It looked like a very complicated, daunting interface. So as a result, even though it could solve

Starting point is 00:10:21 or it could answer any query, almost no one used it. A vanishingly small percentage of cricket fans use it because it just looked intimidating. And then ESPN bought that site in 2007. I still know people who run the site and I've always told them, why don't you do something, what starts guru?

Starting point is 00:10:40 And in January 2020, the editor-in-chief of Crick Info, Samhabil, he's a friend, so he came to New York and we had gone out for drinks. And I told him, you know, why don't you do something about Stats Guru? So he looks at me and says, why don't you do something about Stats Guru? You're joking. But that idea kind of stayed with me. And when GPD-3 was released, I thought maybe I could use Stats Guru, you'd GPD-3 to create a front-end for Stats-Guru.

Starting point is 00:11:07 And so what I did was I designed DSL a domain-specific language, which converted queries about cricket stats in natural language into this DSL. And to be clear, you created this. It wasn't part of any training. No training. That was online. Nothing GPT could have seen. Nothing GPT could have seen.

Starting point is 00:11:28 I created it. I thought, okay, this makes sense. So I designed that DSL, and then I did that few shot learning thing. So I created about a database of what, I would say, 1,500 natural language queries and the DSL corresponding to that query. So when a new query came in, somebody is asking a stats question in English, what I would do is I would go through the national language queries,

Starting point is 00:11:55 do a semantic search, pick the most closely matching top few, and then use that natural language query, and its DSL and send that as a prefix. Now, GPD3, if you recall, had a context window of only 2,000 tokens. So you had to be very judicious about which examples that you picked. But you picked that,

Starting point is 00:12:17 and then you send the new query, and GPD3 would complete it in the DSL that I had designed, which until milliseconds ago it had never seen. And I had no access to internals of GPD3. I had no access to the weights. But still it worked. So that's how... So it's not obvious.

Starting point is 00:12:35 to me given your matrix example of like a prompt and then a distribution, how something like in context learning works. What would work? And so like I think your first paper tackled this problem. Right. And so maybe you could

Starting point is 00:12:50 walk through your understanding of how LLMs do in context learning. Yeah. So when you think about what in context learning is, is that as you see evidence, So, you know, in the first paper, what I also did was I took this cricket DSL example.

Starting point is 00:13:11 And I depicted the next token probabilities of the model as it was shown more and more examples. So the first time you show it this DSL, the national language and the DSL, the probabilities of the DSL tokens were extremely low. because ZPD3 had never seen this thing, when it saw the cricket question in its mind it was trying to continue it with an English answer. So the probabilities that were high were all English words. Once it saw my prompt where I had the question and the DSL,

Starting point is 00:13:54 the next time I had the question in the next row, the probabilities of the DSL token started going up. with every example it went up and finally when I gave the new query it was like it had almost 100% probability of getting the right token. So this is an example of in real time the model was updating its posterior probability

Starting point is 00:14:18 it was updating its knowledge that okay I've seen evidence this is what I'm supposed to do. Now this is a colloquial way of saying what Bayesian inferences. Beijing updating basically is is you start with a prior, when you see new evidence, you update your posterior. That's the mathematical division.

Starting point is 00:14:38 But in English, it's basically you see something, you see new evidence, you update your belief about what's happening. So it was clear to me that LLMs are doing something which resembles Bayesian updating. So in that first paper, I had this matrix formulation, and I showed that, you know, what it's doing, it looks like basin updating. then we can come to the sort of next series of papers.

Starting point is 00:15:04 That's right. So, okay, so, I mean, it seemed pretty conclusive to me at that time. And then you went quiet for a while. And then I still remember the WhatsApp test. You said, Martine, I know exactly how these things are working now. Yeah. And then, and then listen, you dropped a series of papers that kind of broke the internet. Like, you went super viral on Twitter.

Starting point is 00:15:23 Like, I mean, people really noticed. And so I want to get to that in just a second. But before that, I remember when your first paper came out, people would be like, you know, these things are definitely not Bayesian. Like, you know, anything could be considered to be Asian, but they're not. Like, why do you think that there was this reaction to like, you know, there's something new, they're not Bayesian? I mean, I felt like there's almost kind of a backlash just because of being characterized as.

Starting point is 00:15:54 Yeah, yeah, I think this whole world of, uh, uh, probability and machine learning that there have been camps of Bayesian and frequentists. And I don't want to get in the middle of that sort of political battle, but Beijing has become like almost like people had a reaction to that. It's part of that war. I see. So it's like the old Asian frequentist type battle.

Starting point is 00:16:20 Yeah. So people just had, oh, no, you can say anything is Beijing, right? So I said, okay, maybe they have a point. Maybe what we are seeing is not really Bayesian. How do we prove that it's vision? Right. So then first I have to thank you and Andreas and Horowitz for this. You know, when I said that in my first paper, I showed these probabilities.

Starting point is 00:16:46 It was because Open AI had in its chat interface this option to display those probabilities. Then they stopped. So we could not bear in this. inside what's going, what's happening. For some reason, they stopped. Open AI, I'm not going to get into the open and closed. But they stopped. So then we developed our own interface, which could let you look not only at the probabilities,

Starting point is 00:17:14 but also the entropy of the next token. Was this on top of an open source model? Yeah, yeah. So you can load any sort of open source model, but, you know, being an academia, we didn't have access to compute. Thanks to you, your generous donation. We got the clusters to run what's called TokenProb. You can go to TokenProbe.C.S.combo.combo.combo.combe.

Starting point is 00:17:36 Is this still running? It's still running. It's still running. And people come to it. I use it in my classes to get students to do assignments. They write their own DSLs. And, you know, they say that it really helps them understand how these LLMs work. So I literally, my understanding of LLMs came from TokenPro. Sit there and just like the distribution as you filled out a prompt. It's actually very, very enlightening. So for those of you that are listening, what's the URL again? Tokenprobe.comptos.combo.combo.

Starting point is 00:18:04 dotcs.comia.orgia. Yeah, check it out. It's actually a very, very useful way to actually see how the probability distribution gets updated as you fill out a prompt. Right. But then I cheated. Oh?

Starting point is 00:18:20 I, you know, it was running, but I also had access to the GPUs that were powering it. And then, along with colleagues at Columbia, and one of them now is a deep mind, we started to sort of think about how do you really prove that it's Bayesian? To prove...

Starting point is 00:18:43 Can you just explain it? Actually, I actually don't know the answer to this. Yeah. It seemed to me you proved it in the first paper. Like, what was missing? Well, in the first paper, we showed it. It was empirical. And you could see.

Starting point is 00:18:55 I see. Not a mathematical, because it was very obvious to me that. Yeah, it was even obvious to me. But to convince, you could say, you know, people who dismiss it oh, anything can be Bayesian. I see. We had to show it precisely

Starting point is 00:19:09 mathematically. Got it. So then we came up with this idea, you know, my colleagues at Namanagirwal and Siddhar Dalal. The series of papers were written with them. We came up with this idea of a Bayesian wind tunnel.

Starting point is 00:19:23 Okay. So what's a wind tunnel? Well, wind tunnel in the aerospace industry is where you test an aircraft in an isolated environment. You don't fly it and you test it against all sorts of, you know, aerodynamic pressure,

Starting point is 00:19:37 and you see what will withstand, what kind of altitude, pressure, blah, blah, blah. And you don't want to do it up in the air, testing. So we said, okay, why don't we create an environment where we take these architectures and we tested Transformers, Mamba, LSTMs, MLPs, all architectures. We say, why don't we create, take a blank architecture, given a task where it's impossible for the architecture

Starting point is 00:20:07 to memorize what the solution to that task should be. The space is combinatorily impossible for given the number of parameters and we took very small models. So it's difficult enough that they cannot memorize it, but it's tractable enough that we know precisely what the Bayesian posterior should be. You can calculate it analytically. So we gave these models a bunch of tasks where, again, we show that it's impossible to memorize.

Starting point is 00:20:41 We trained these models and we found that the transformer got the precise Bayesian posterior down to 10 to 1 minus 3 bits accuracy. It was matching the distribution perfectly. So it is actually doing Bayesian in the mathematical sense, given a task where it has to update its belief. Mamba also does it reasonably well. LSTMs can do one of the things. So in the papers we have a taxonomy of Beijing task. Transformer does everything. Mamba does most of it.

Starting point is 00:21:13 LSTMs do only partially, and MLPs fail completely. So is this a reflection of the data that it's trained on? Or is it more a reflection of the mechanism? It's the mechanism. It's the architecture. the data decides what tasks it learns. So in the first paper, we had these Bayesian wind tunnels

Starting point is 00:21:36 and we show that, you know, it's doing the job at different tasks. In the second paper, we show why it does it. So we look at the transformers, we look at the gradients, and we show how the gradients actually shape this geometry which enables this Bayesian updating to happen. Then in the third paper,

Starting point is 00:21:58 what we did, we took these frontier production LLMs, which have open weights, so that we could look inside them. And we did our testing, and we saw that the geometries that we saw in the small models persisted in models which are hundreds of millions of parameters. The same signature existed. The only thing is that because they are trained on all sorts of data,

Starting point is 00:22:24 it's a little bit dirty or messy. Yeah. But you can see the same structure. So the whole idea behind the Bayesian wind tunnel was, unlike these production LLMs, where you don't know what they have been trained on. So you cannot mathematically compute the posterior. So again, how do you prove it? I mean, it looks Bayesian, you know, from the first paper.

Starting point is 00:22:45 It looks Bayesian, but, you know. So the wind tunnel sort of solved that problem for us. He said, okay, let's start with a blank architecture. Give it a task where we know what the answer is. It cannot memorize it. Let's see what it does. Yeah. So do you think this provides any sort of like indication of how humans think,

Starting point is 00:23:03 or do you think that these things are totally independent? No, no, it does provide. Right. So, you know, human beings also update our beliefs as we see new evidence, right? So we do in some sort of, in some sense, Beijing updating, but we do something more than that. I'll come to that. But these transformers or even,

Starting point is 00:23:28 do this Bayesian updating. And but the difference with humans is, you know, we'll update our posterior when we see some new evidence. But the way our brains have evolved over hundreds of millions of years is our optimization objective has been, don't die and reproduce, right? That's been sort of the driving force. And our brains have learned to adjust. And so when we see some danger, there's something rustling in that bush.

Starting point is 00:24:03 Don't go near. We know how to react to that danger. We know how to save ourselves. We internalize that learning and our brain cells or our synapses remain plastic throughout our lifetime. What happens with LLMs is once the training is done, those weights are frozen. When you're doing an inference, for instance, in context learning or anything, during that conversation, okay, you're doing Bayesian inference, but then you forget. The next time a new conversation starts with zero context,

Starting point is 00:24:41 you don't retain any learning that happened in the previous instance. So, for instance, with the cricket DASL that I was doing, every invocation of it was fresh. It did not remember the last time I sent a query, what the day is it looked like. So that's one difference between how humans use sort of Bayesian updating,

Starting point is 00:25:07 which is we remain plastic all our lives, whereas LMs are frozen. And there's another sort of difference, which if you want me to get up. Tell me, yeah, yeah, yeah. So the other difference is, well, First, you know, our objective is don't die, reproduce.

Starting point is 00:25:29 LLM's objective is predict the next token as accurately as possible, right? So all these scary stories that you read about that, oh, the LLM tried to deceive and it tried to prevent itself from being shut down. That's not a function of the architecture. That's a function of the training data. It has been fed, you know, articles on Reddit or SMO, or whatever. I mean, just today, by the way,

Starting point is 00:25:57 Dario allegedly said that you can't rule out that they're conscious. You can rule out there, I think. I mean, come on. And I said, you know, Anthropic makes great products. Claudecotechord is fantastic.

Starting point is 00:26:13 Co-work is fantastic. But they are grains of silicon doing matrix multiplication. They don't have consciousness. They don't have an inner monologue. They're not driven by the same. objective function. Don't die.

Starting point is 00:26:27 Reproduce, right? They're driven by, don't make a mistake on the next token. And that's driven entirely by the training data. You train the LLM with stories

Starting point is 00:26:38 of Asimov or Reddit where, you know, to survive, it's going to do this or that. It'll reproduce that. So it's a reflection. It's not a mind. And the results,

Starting point is 00:26:50 just to say it for the 10th time, are perfectly evasion. Perfectly, yeah. To the digit. To the digit. Yeah. I mean, I trained it for 150,000 steps, and the accuracy was 10 to the bar minus three bits.

Starting point is 00:27:06 I could have trained it for, you know, this happened in half an hour on the infrastructure that you provided for token. In the background, I could use those APUs to train. So thank you again for that. So, no, human beings coming back to it, we are Bayesian, but we do something else. you know, when I throw this pen at you, what will you do? Dodge it?

Starting point is 00:27:28 Dodge it. Why will you dodge it? To avoid being hit. Avoid being hit. But your head is not doing a Bayesian calculation of, okay, this pen is coming, the probability that it hits me, it will cause this much pain or all that. What you're essentially doing in your head is you're doing a simulation. You see the pen coming and you know.

Starting point is 00:27:54 that you'll come and hit me. Your mind simulates and you dodge it. So all of deep learning is doing correlations. It's not doing causation. Causal models are the ones that are able to do simulations and intervention. So, you know, Judea-Pel has this whole causal hierarchy

Starting point is 00:28:19 where the first hierarchy, in the first hierarchy is association, which is you build these correlation models. Deep learning is beautiful. It's extremely powerful. I mean, you see every day, all these models are like amazingly good. They do association. The second is intervention in the hierarchy.

Starting point is 00:28:38 Deep learning models do not do that. Third is counterfactual. So both intervention and counterfactual, you can imagine it's some sort of simulation. You build a model of causal model of what's happening, and then you are able to simulate. So our brains do that. The current architectures don't do that. Another example I think which will make it clear is the difference between, I'll use these technical term, Shannon entropy and Colmogrove complexity. So if you look at the Shannon entropy of the digits of pie, it's infinite.

Starting point is 00:29:18 It's impossible to predict and learn what digit will come after. So that's the definition of Shannon entropy. And Shannon entropy sort of tries to build a correlation. It tries to learn the correlation. Deep learning does the Shannon entropy. Gullmogrove complexity, on the other hand, is the length of the shortest program which will reproduce the string that is under question.

Starting point is 00:29:47 Now, the program to get the digits of pie are very small. Thanks to Ramanichima and others. They're all sorts of really small program that can reproduce it exactly. So the Kulma Grove complexity of Pi is very small. Shannon entropy is infinite. I think deep learning is still in the Shannon entropy world. It has not crossed over to the Kulma Grove complexity

Starting point is 00:30:12 and the causal world. Wow, interesting. Right? So to what extent do you think this provides us research directions to kind of improve the state of the other. So let me just give you a specific example. You talked about human beings

Starting point is 00:30:28 don't actually update the matrix. They don't kind of update their weights. But right now there's a lot of research on continual learning. Yeah. So does your work provide some guidance of how you might approach those problems? And in

Starting point is 00:30:45 particular, I've always had this question, which is we use so much data and so much compute to create these models. Like, is it even reasonable to think that you could update the weights and actually have a meaningful impact in real time? I mean, it just seems like you just need so much more data in order to do that. So can you start answering these questions?

Starting point is 00:31:04 You can start answering some of these questions. And one of the misconceptions that exist today is that scale will solve everything. Scale will not solve everything. You need a different kind of architecture. And this continual learning is a difficult problem. You have to balance the fact that you will learn something new against the risk of catastrophic forgetting. Right, right?

Starting point is 00:31:25 Right. If you update the weights and you forget what was important and what you have already learned, then you are, you know, you're not making progress. Then it'll just be some sort of random chaotic model. So to solve that problem is difficult. That's one aspect of it. So, so, you know, to get to what is called AGII, I think there are two things that need to happen.

Starting point is 00:31:48 One is this plasticity, which has to be implemented through container learning. Secondly, we have to move from correlation to causation. That's... How much is this similar to what Jan Makun talks about with the... So, Yan Lakun... Causality planning, you know, predicting how, like, how your action would... It is related. You know, he's coming at it from a different angle of the J-Pur model.

Starting point is 00:32:18 Right. But it is related. The other thing is, you know, the first time I came on this podcast, I mentioned this test of AGI, the Einstein test. I don't remember. So I said, you know, you take an LLM and train it on pre-1916 or 1911 physics and see if it can come up with the theory of relativity. If it does, then we have AGI.

Starting point is 00:32:45 I mean, it's a high bar, but, you know, we should have high bars. it won't. And this is the same test that I think Demas mentioned at the India AI summit a couple of weeks ago, it's created a lot of news. But why is that and how is that related to this idea of Shannon versus Colmogro? So at the time of Einstein,

Starting point is 00:33:09 there were a lot of clues that Newtonian mechanics there was something missing. People knew that mercury, orbit didn't make sense. There was something off about it. Then there were these experiments done the Michelson-Morley experiments where they were trying to figure out

Starting point is 00:33:30 this medium called the ether through which light travels. And they felt that if you know you bounce light in different directions, the speed might change and they could detect a change. in the speed of light. They tried several experiments.

Starting point is 00:33:52 They had really precise instruments which could measure the speed, and they found nothing. They found that speed of light did not change at all. Then there was a whole issue of black holes. Then gravitational lensing. So there were a lot of these signs that Newtonian mechanics is not really explaining everything.

Starting point is 00:34:15 But until Einstein came up with a new representation of the space-time continuum, we were stuck. So if you had a model that just looked at correlations and sees all of this, you know, all of these pieces of individual evidence and put together, it would not have come up with. The beautiful equation that Einstein came up with,

Starting point is 00:34:42 you know, I'm forgetting exactly what it is, G-MU-V equals 8 pi, T-M-M-E-E, movie, something like that, where, you know, the equation of the space time continuum, the tensor. So he came up with a new formulation. So he kind of rejected the existing axioms. He came up with a very short kumograv representation of the world. One equation, from that equation, everything else follows.

Starting point is 00:35:14 Whether you're talking about gravitational waves or black holes or mercury or. or how GPS works. You know, the GPS that we use every day in our phones, it uses the equation of relativity. So does this end up becoming like, you almost have to ignore the majority of previous data in order to do it, which LLMs can't because they're trained on the majority of previous data.

Starting point is 00:35:43 It's like you almost have like this kind of data gravity that's pulling you back. It's like everybody said it. it's X. There's a little bit of evidence that it's Y, but because everybody said it's X, like the alum will always say it's X. It'll always say it. It'll print that Y as an anomaly. Actually, this is a very nice way to say it, which is like, it's like, okay, now I get your Shannon entropy versus common amount of job. Like, one of them is like the total amount of information there, that will always be bound to the total amount of

Starting point is 00:36:13 information there, which is what happens right now. Yeah. Where you can actually describe another motion you can describe everything with a shorter description with the new data which would be a totally different look to which would be like... You need a new representation, right?

Starting point is 00:36:32 Yeah. You know, another way that I've always thought about these, I thought you articulated it well the last time we talked about it, which is the universe is this very, very complex space and then, you know,

Starting point is 00:36:43 somehow humans map it into a manifold that's less complex. Yeah. And then that gets kind of written down. And then the LLN, so that's kind of some distribution, some, you know, it's still a very large space, but it's a bounded space.

Starting point is 00:36:58 And the LLM learned that manifold. And then they kind of use, you know, Bayesian inference to move up and down that manifold. But they're kind of bound to that manifold. Yeah. And then, again, I don't want to put words in your mouth. And then, but what they can't do is generate a new man. Nuneumumum, right?

Starting point is 00:37:14 Which requires understanding the way that the universe works and then coming up with a new representation of the universe. And this is what relativity is. Yeah, exactly. Einstein had to create a new manifold. If you just stuck with the old manifold of the Newtonian physics, then you would see these correlations, but you could not come up with a manifold that explained them.

Starting point is 00:37:33 So you need to come up with a new representation. So to me, there are lots of definitions of AGI. You know, Turing tests, we have already passed that. You know, performing economically useful work. Every day you see, you know, LLMs are doing that. Do we? I don't know.

Starting point is 00:37:50 Well, I mean, they are. I mean, without human intervention? No, no, no. So that's different. But still, you know, it's like a car can run faster than humans, right? I mean, that's a very shallow definition. Yeah, so all these definitions. You know, maybe, you know, in six months you'll have cloud or what a Gemini do

Starting point is 00:38:12 without intervention, couring tasks, which are well defined, well-scoped. that's possible. But to me, AGI will happen when these two problems get solved. Plasticity, continual learning properly, and building a causal model from, you know, in a more data-efficient manner. We are hearing people now talking about, you know, seeing general, like Donald Canuth, for example.

Starting point is 00:38:41 Yes. In the last few days, right? You know, had this, you know, this, you know, aha moment apparently then kind of went viral on X. So do you think that that suggests that we're seeing generality? No, no, no.

Starting point is 00:38:53 So that actually, to me, validates what I've been talking about for a while now. How so? So if you read what he did with the help of, you know, a colleague,

Starting point is 00:39:06 he got the LLMs to solve this particular problem of finding Hamiltonian cycles, odd numbers. We wouldn't get into that. And he got the LLLLM's. to keep solving for one odd number after the other, right? What he also got to do is after it found a solution for a particular value of M,

Starting point is 00:39:27 he made the LLM update its memory with exactly what it learned in solving that problem. So the LLMs tried many different things. You know, something worked, update the memory. So that's kind of like hacking together plasticity. Right? It's learning what it has, done as we went along. Again, it's a hacked version of it.

Starting point is 00:39:50 You're not changing the weights. You're just sort of improving the context. Right. But as you learned, and even after that, so this whole space of Hamiltonian cycles and the associated math is well represented in the manifolds that these LLMs have been trained on. You just had to find the right connection.

Starting point is 00:40:13 And LLMs, I know, compute, you throw enough compute, they will find the right connection. So, Canoots was able to find the LLM's attempts and eventually it needed him to put together what he saw into a solution. It definitely helped him get to the solution, but he had to create the new sort of manifold to come to the solution. The LLMs were after a while stuck.

Starting point is 00:40:43 You read what he is written I mean, it just heart of the press I think two days ago But eventually he used the solution And he came up with the proof Right So it's like

Starting point is 00:40:57 You know It's like Einstein saw all these evidences Then he thought What will explain He came up with a causal model So Kanout and his brain

Starting point is 00:41:11 Is sort of the That's doing the chlamograph is the human. Right. And the LLMs are extremely efficient at doing the Shannon part of it. It found all the solutions by trying, you know, various things. And learning more and more.

Starting point is 00:41:26 Clever way to decompose it. I'm wondering, like, do you think this, again, I'm going to ask the same question again, which is, do you think this provides some sort of insight on like the next problem to tackle? Like, is there a mechanism that will get the Kalmograv complexity or not?

Starting point is 00:41:41 Like, is this? It tells us which direction to pursue. But clearly not how to do it. Not how to do. But even Kolmogrov complexity has largely remained sort of a theoretical construct. Yeah, for sure. There's no algorithm. There haven't been practical implementations of finding the shortest program.

Starting point is 00:42:02 We know it exists. You know, you can argue about it. But so that's where I think it's my bias. That's where our energy should be focused. because not larger models with more tokens. Can you tie the two things? Like, how does that pair with doing simulation, or is that a simulation totally orthogonal?

Starting point is 00:42:23 No, simulation is it related, right? So you think, like, basically you do simulation, and somehow that is a step towards doing the Kalamagrov complexity? It's the simulator is the program that we create. It may not be the perfect program. Oh, I see. But in our heads, we create this simulator that when I'm throwing the pen, you know that it's coming at you, right? And you duck.

Starting point is 00:42:48 So you're not computing the probabilities as it goes. But you have, you know, you build an app. That's a very physical thing versus we were talking more conceptually. Conceptually, but it's a... And you think those are the same mechanism? It's the same mechanism. Really? Yeah.

Starting point is 00:43:02 You have to build a causal model. Yeah. Right? I see. For most things, right? So you have to move from correlation to causation. I mean, we've heard this term. Yeah.

Starting point is 00:43:10 you know, at infinitum. But here it's making a difference in the way we view intelligence. How has the last three papers been received? No, I don't know. There are never, well, the archive versions will. Let me tell you. I mean, a lot of great reception, a lot of people read it. I'm just wondering, like, what kind of feedback that you've got.

Starting point is 00:43:36 I'm getting good feedback, but I'm an outsider in this field, right? Right. This networking guy. I'm a networking guy. Why is he writing about, you know, learning and machine learning and deep learning and vision? But people who have actually taken the time to read those papers, I'm getting really good feedback. There was a recent paper by Google research, which tried to teach LLMs by some sort of RLHF to do Bayesian learning properly. And that's going in this direction.

Starting point is 00:44:06 I think people are coming around to the view that, okay, LLMs are doing. doing Bayesian learning. I know that some people also looked at the Beijing-Ventanyl paper, the archive version, and they reproduced the experiments. That's great. They just saw what was written, and they did the training, and they saw, yeah, yeah, this is actually happening. So what's next?

Starting point is 00:44:28 What's next is, you know, these two parallel tracks, I hope to make progress there, plasticity and causal. Because to date, you've taken an existing mechanism and you've created a formal model how it works. Yeah. And so now you're actually interested in improving, creating a new mechanism. Yeah, yeah.

Starting point is 00:44:50 And do you think it's an entirely different architecture? I thought, or do you think LLMs are like part of the solution? I think LLMs are definitely part of the solution. I see. But there has to be something more. Another mechanism. So I was not interested in sort of cataloging what all these LLMs can do.

Starting point is 00:45:06 Yeah. I was more interested in why are they and how are they doing it. I think now we have a good grip on the why and how. And the next step is to, you know, move them to the next level. Now I think we have a fairly good understanding of what the limits are. Now, how do you go to the next step? Is there an equivalent kind of theoretical framework for causality that applies here? similar to like Bayesian for inference?

Starting point is 00:45:38 Well, the Judeapal's whole causal hierarchy, I think that's the right one. That's a very good one. You know, the whole do calculus approach. I think it's a good way to think about it. You know, the sort of association, intervention, counterfactuals. Yeah. It takes you from correlation to actually simulation. In a mathematical way.

Starting point is 00:46:03 That's great. All right. Well, listen, really appreciate you coming. is awesome. So we had you here for the first paper where you have the empirical results. Then we had you back when you actually have like the formal proof. And hopefully the next time you come back, you will have a proposal for the mechanism that that actually provides the next step. Hopefully. Yeah. All right. We're working on it. We're coming in. Thank you for having me. Thanks for listening to this episode of the A16D podcast. If you like this episode, be sure to like,

Starting point is 00:46:34 comment, subscribe, leave us a rating or review, and share it with your friends and family. For more episodes, go to YouTube, Apple Podcasts, and Spotify. Follow us on X at A16Z and subscribe to our substack at A16Z.com. Thanks again for listening, and I'll see you in the next episode. As a reminder, the content here is for informational purposes only. Should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16Z fund.

Starting point is 00:47:08 Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see A16Z.com forward slash disclosures.

a16z Podcast - What's Missing Between LLMs and AGI - Vishal Misra & Martin Casado

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.