a16z Podcast - What's Missing Between LLMs and AGI - Vishal Misra & Martin Casado
Episode Date: March 17, 2026Vishal Misra returns to explain his latest research on how LLMs actually work under the hood. He walks through experiments showing that transformers update their predictions in a precise, mathematical...ly predictable way as they process new information, explains why this still doesn't mean they're conscious, and describes what's actually required for AGI: the ability to keep learning after training and the move from pattern matching to understanding cause and effect. Resources: Follow Vishal Misra on X: https://x.com/vishalmisra Follow Martin Casado on X: https://x.com/martin_casado Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
Anthropic makes great products.
Plot code is fantastic.
Co-work is fantastic.
But they are crazy of silicon
doing matrix multiplication.
They don't have consciousness.
They don't have an inner monologue.
You take an NLM and train it on pre-1916 or 1911 physics
and see if it can come up with the theory of relativity.
If it does, then we have AGO.
Just today, by the way, Dario allegedly said that you can't rule out of their conscience.
You can rule out there.
No.
I think you can rule out.
to get to what is called AGI.
I think there are two things that need to happen.
Five years ago, Vishal Misra got GPT3 to translate natural language
into a domain-specific language it had never seen before.
It worked. He had no idea why.
So he set out to build a mathematical model of how LLMs actually function.
The result?
A series of papers showing that transformers update their predictions
in a precise mathematically predictable way.
In controlled experiments,
The models match the theoretically correct answer almost perfectly.
But pattern matching is not intelligence.
LLM's learn correlation.
They don't build models of cause and effect.
To get to AGI, MISRA argues,
we need the ability to keep learning after training
and the move from correlation to causation.
Martine Casado speaks with Vichal Meshra,
Professor and Vice Dean of Computing and AI at Columbia University.
Michelle is great to have you again.
Great to be back.
This is one of my favorite topics, which is how do LLM to actually work?
And I think that, in my opinion, you've done kind of the best work on this,
modeling it out.
Thank you.
For those that did not see the original one, maybe it's probably worth doing just a quick
background on what led you to this point, and then we'll just go into the current work that
you've been doing.
Five years ago, when ZPD3 was first released, I got early access to it, and I started playing
with it, and I was trying to solve a problem related to it.
to querying a cricket database.
And I got GPD3 to do in context learning,
few short learning,
and it was kind of the first,
at least to me it was the first known implementation of Ragh,
retrieval augmented generation,
which I used to solve this problem of querying,
getting GPD3 to translate national language
into something that could be used to query a database
that GPD3 had no idea about.
I had no access to GPD3's internal,
but I was still able to use it to solve that problem.
So it worked beautifully.
We deployed this in production at ESPN in September 21.
Wow.
You did the first implementation of RAG in 2021?
No, no, no, in 2020.
2020, I got it working.
And by the time you talked to all the lawyers at ESPN and productionized it, it took a while.
But October 2020, we had, well, I had this architecture working.
But after I got it to work, I was amazed that it worked.
I wanted to understand how it worked.
And I looked at the attention is all your deep papers
and all the other sort of deep learning architecture papers
and I couldn't understand why it worked.
So then I started getting sort of deep into building a mathematical model.
And now you published a series of papers.
The first one that I read was the one where you had kind of your matrix kind of abstraction.
So maybe we'll talk about that and then we'll talk about the more recent work.
So perhaps we'll just start with the first one, which is you're trying to come up with a mathematical model with how LLM works.
Yeah.
And you have which is very helpful to me.
And at the time, you're actually trying to figure out how in-context learning was working.
Yes, yeah.
And you came up with an abstraction for LLMs, which is basically a very large matrix and you use that to describe.
So maybe you can kind of walk through that work very quickly.
Sure.
Yeah.
So what you do is you imagine this huge gigantic matrix where every row of the matrix corresponds to a prompt.
Yeah.
And the way these LLMs work is given a prompt,
they construct a distribution of probabilities of the next token.
Next token is next word.
So every LLM has a vocabulary, GPD and its variants have a vocabulary for about 50,000 tokens.
So given a prompt, it will come up with the distribution of what the next token should be,
and then all these models sample from that distribution.
Yeah.
So that's the posterior distribution.
That's the posterior distribution, right?
That's how LLMs work.
And so the idea of this matrix is for every possible combination of tokens,
which is a prompt, there's a row.
And the columns are a distribution over the vocabulary.
So if you have a vocabulary of 50,000 possible tokens,
it's a distribution over those 50,000 tokens.
And by distribution, it's just the probability.
Just the probability, sorry.
Yeah, just the probability that the next token should be this versus that.
So that's sort of the idea.
And when you start viewing it that way, it makes things at least,
clearer to people like me who want to model it, what's happening.
So concretely, let's say you have an example that, let's say a prompt is just one word, protein.
So if you look at the distribution of the next word, next token after that, most of the
probabilities would be zero, but you'd have non-zero, non-trivial probabilities on, let's say,
two words.
One is synthesis, the other is shake, right?
and now the LLM is going to sample this next token
and make synthesis or shake.
Or you as a human will give the prompt
protein shake or protein synthesis.
Now, depending on whether you pick synthesis or shake,
that row looks very different, right?
If you pick protein synthesis,
the terms that would have a high probability
would be all concerned with biology.
But if you pick protein shake,
it'll all be about gyms and exercise and all bodybuilding stuff.
So that synthesis or shake completely changes what comes next.
So this is an example of, you can say, Bayesian updating.
You start with protein, you have a prior that after protein, this is going to happen.
As soon as you get new evidence, then the next term is synthesis or shake.
You completely update the distribution.
So now you can imagine that the whole, the entirety of LLMs is this giant matrix
where you have every row, protein shake, protein synthesis, the cat sat on the humpty, dumpty, blah, blah.
Now given the vocabulary of these LLM, let's say, 50,000 and the context window.
So GPD, for instance, the chat GPD, the first version, had a context window of 8,000 tokens.
If you look at all possible combinations of 8,000 tokens
and 50,000 vocabulary,
the number of rows in this matrix
is more than the number of electrons across all galaxies.
Right?
So there's no way that these LLMs can represent it exactly.
Now, fortunately, this matrix is very sparse.
Why? Because an arbitrary combination of these tokens is gibberish.
We're never going to use that in real life.
Also, the columns are also mainly zero.
If you have protein, then you won't have lots of,
you know, you won't have arbitrary numbers or arbitrary words after that.
It's very sparse both in rows and in columns.
So in kind of an abstract way, what all these LLMs are doing
is coming up with the compressed representation of this matrix.
And when you give a prompt, they try to
approximate what the true distribution should have been and try to generate it.
That's what, in my mind at least, it boils up to.
Just from my understanding, so if you have a row of protein and then you have one with
protein shake, is protein shake a subset of protein, or is it different?
It's different.
It's a continuation from.
I see.
Yeah.
No, but I'm saying like the actual posterior distribution, is that a subset?
You can say it's a subset, right?
If you have protein, then protein shake and protein synthesis are all
continuations from protein.
So both synthesis and shake have non-zero probabilities.
So you can, yeah, you can think of it as somewhat a subset.
Right.
You use this approach to describe how in-context learning works.
And so maybe first describe what in-context learning is,
and then kind of the conclusion that you came from that.
So eight context learning is when you show the LLM something it has kind of never seen before.
You give it a few examples.
So this is what it wants.
This is what you're trying to do.
Then you give a new problem which is related to the example that you have shown.
And the LLM learns in real time what it's supposed to do and solves a problem.
By the way, the first time I saw this, it absolutely blew my mind.
I actually used your DSL.
when I was like first learning about it.
So maybe the DSL thing was just crazy
this works at all.
It's absolutely mind-blowing that it works.
And so going back to that cricket problem
was, you know, in the mid-90s,
I was part of a group that had created
this cricket portal called Crick Info.
Cricket is a very start-rich sport.
You think baseball multiplied by a thousand
and it's at all kinds of stats.
And we had created this online search
database called Stats Guru,
where you could search for anything,
any start related to cricket,
and has been available since 2000.
But because you can query for anything,
everything was made available,
and how do you make something like that available
to the general public?
Well, they're not going to write SQL queries.
The next best thing at that time
was to create a web form.
Unfortunately, everything was crammed into that web form.
So as a result, you had like 20 drop downs,
15 checkboxes, 18 different text fields.
It looked like a very complicated, daunting interface.
So as a result, even though it could solve
or it could answer any query,
almost no one used it.
A vanishingly small percentage of cricket fans use it
because it just looked intimidating.
And then ESPN bought that site in 2007.
I still know people who run the site
and I've always told them,
why don't you do something, what starts guru?
And in January 2020, the editor-in-chief of Crick Info, Samhabil,
he's a friend, so he came to New York and we had gone out for drinks.
And I told him, you know, why don't you do something about Stats Guru?
So he looks at me and says, why don't you do something about Stats Guru?
You're joking.
But that idea kind of stayed with me.
And when GPD-3 was released, I thought maybe I could use Stats Guru,
you'd GPD-3 to create a front-end for Stats-Guru.
And so what I did was I designed DSL a domain-specific language,
which converted queries about cricket stats in natural language into this DSL.
And to be clear, you created this.
It wasn't part of any training.
No training.
That was online.
Nothing GPT could have seen.
Nothing GPT could have seen.
I created it.
I thought, okay, this makes sense.
So I designed that DSL, and then I did that few shot learning thing.
So I created about a database of what, I would say,
1,500 natural language queries and the DSL corresponding to that query.
So when a new query came in,
somebody is asking a stats question in English,
what I would do is I would go through the national language queries,
do a semantic search, pick the most closely matching top few,
and then use that natural language query,
and its DSL and send that as a prefix.
Now, GPD3, if you recall,
had a context window of only 2,000 tokens.
So you had to be very judicious
about which examples that you picked.
But you picked that,
and then you send the new query,
and GPD3 would complete it in the DSL that I had designed,
which until milliseconds ago it had never seen.
And I had no access to internals of GPD3.
I had no access to the weights.
But still it worked.
So that's how...
So it's not obvious.
to me given your matrix example
of like a prompt
and then a distribution, how something
like in context learning
works. What would work? And so like
I think your first paper
tackled this problem.
Right. And so maybe you could
walk through your
understanding of how
LLMs do
in context learning. Yeah. So
when you think about what in context
learning is, is that
as you see evidence,
So, you know, in the first paper, what I also did was I took this cricket DSL example.
And I depicted the next token probabilities of the model as it was shown more and more examples.
So the first time you show it this DSL, the national language and the DSL,
the probabilities of the DSL tokens were extremely low.
because ZPD3 had never seen this thing,
when it saw the cricket question
in its mind it was trying to continue it with an English answer.
So the probabilities that were high were all English words.
Once it saw my prompt where I had the question and the DSL,
the next time I had the question in the next row,
the probabilities of the DSL token started going up.
with every example it went up
and finally when I gave the new query
it was like it had almost 100% probability
of getting the right token.
So this is an example of in real time
the model was updating its posterior probability
it was updating its knowledge that okay
I've seen evidence this is what I'm supposed to do.
Now this is a colloquial way of saying
what Bayesian inferences.
Beijing updating basically is
is you start with a prior, when you see new evidence,
you update your posterior.
That's the mathematical division.
But in English, it's basically you see something,
you see new evidence, you update your belief about what's happening.
So it was clear to me that LLMs are doing something
which resembles Bayesian updating.
So in that first paper, I had this matrix formulation,
and I showed that, you know, what it's doing,
it looks like basin updating.
then we can come to the sort of next series of papers.
That's right.
So, okay, so, I mean, it seemed pretty conclusive to me at that time.
And then you went quiet for a while.
And then I still remember the WhatsApp test.
You said, Martine, I know exactly how these things are working now.
Yeah.
And then, and then listen, you dropped a series of papers that kind of broke the internet.
Like, you went super viral on Twitter.
Like, I mean, people really noticed.
And so I want to get to that in just a second.
But before that, I remember when your first paper came out,
people would be like, you know, these things are definitely not Bayesian.
Like, you know, anything could be considered to be Asian, but they're not.
Like, why do you think that there was this reaction to like, you know,
there's something new, they're not Bayesian?
I mean, I felt like there's almost kind of a backlash just because of being characterized as.
Yeah, yeah, I think this whole world of, uh, uh,
probability and machine learning
that there have been camps of Bayesian and frequentists.
And I don't want to get in the middle of that sort of political battle,
but Beijing has become like almost like people had a reaction to that.
It's part of that war.
I see.
So it's like the old Asian frequentist type battle.
Yeah.
So people just had, oh, no, you can say anything is Beijing, right?
So I said, okay, maybe they have a point.
Maybe what we are seeing is not really Bayesian.
How do we prove that it's vision?
Right.
So then first I have to thank you and Andreas and Horowitz for this.
You know, when I said that in my first paper, I showed these probabilities.
It was because Open AI had in its chat interface this option to display those probabilities.
Then they stopped.
So we could not bear in this.
inside what's going, what's happening.
For some reason, they stopped.
Open AI, I'm not going to get into the open and closed.
But they stopped.
So then we developed our own interface, which could let you look not only at the probabilities,
but also the entropy of the next token.
Was this on top of an open source model?
Yeah, yeah.
So you can load any sort of open source model, but, you know, being an academia,
we didn't have access to compute.
Thanks to you, your generous donation.
We got the clusters to run what's called TokenProb.
You can go to TokenProbe.C.S.combo.combo.combo.combe.
Is this still running? It's still running.
It's still running. And people come to it. I use it in my classes to get students to do assignments.
They write their own DSLs. And, you know, they say that it really helps them understand how these LLMs work.
So I literally, my understanding of LLMs came from TokenPro.
Sit there and just like the distribution as you filled out a prompt.
It's actually very, very enlightening.
So for those of you that are listening, what's the URL again?
Tokenprobe.comptos.combo.combo.
dotcs.comia.orgia.
Yeah, check it out.
It's actually a very, very useful way
to actually see how the probability distribution
gets updated as you fill out a prompt.
Right.
But then I cheated.
Oh?
I, you know, it was running,
but I also had access to the GPUs
that were powering it.
And then, along with colleagues at Columbia,
and one of them now is a deep mind,
we started to sort of think about
how do you really prove that it's Bayesian?
To prove...
Can you just explain it?
Actually, I actually don't know the answer to this.
Yeah.
It seemed to me you proved it in the first paper.
Like, what was missing?
Well, in the first paper, we showed it.
It was empirical.
And you could see.
I see.
Not a mathematical, because it was very obvious to me that.
Yeah, it was even obvious to me.
But to convince, you could say, you know,
people who dismiss it
oh, anything can be Bayesian.
I see.
We had to show it precisely
mathematically.
Got it.
So then we came up with this idea,
you know, my colleagues
at Namanagirwal and Siddhar Dalal.
The series of papers were written with them.
We came up with this idea
of a Bayesian wind tunnel.
Okay.
So what's a wind tunnel?
Well, wind tunnel in the aerospace industry
is where you test an aircraft
in an isolated environment.
You don't fly it
and you test it against
all sorts of, you know, aerodynamic pressure,
and you see what will withstand,
what kind of altitude, pressure, blah, blah, blah.
And you don't want to do it up in the air, testing.
So we said, okay, why don't we create an environment
where we take these architectures
and we tested Transformers, Mamba, LSTMs, MLPs, all architectures.
We say, why don't we create, take a blank architecture,
given a task where it's impossible for the architecture
to memorize what the solution to that task should be.
The space is combinatorily impossible for given the number of parameters
and we took very small models.
So it's difficult enough that they cannot memorize it,
but it's tractable enough that we know precisely
what the Bayesian posterior should be.
You can calculate it analytically.
So we gave these models a bunch of tasks where, again, we show that it's impossible to memorize.
We trained these models and we found that the transformer got the precise Bayesian posterior down to 10 to 1 minus 3 bits accuracy.
It was matching the distribution perfectly.
So it is actually doing Bayesian in the mathematical sense, given a task where it has to update its belief.
Mamba also does it reasonably well.
LSTMs can do one of the things.
So in the papers we have a taxonomy of Beijing task.
Transformer does everything.
Mamba does most of it.
LSTMs do only partially, and MLPs fail completely.
So is this a reflection of the data that it's trained on?
Or is it more a reflection of the mechanism?
It's the mechanism.
It's the architecture.
the data decides what tasks it learns.
So in the first paper,
we had these Bayesian wind tunnels
and we show that, you know,
it's doing the job at different tasks.
In the second paper, we show why it does it.
So we look at the transformers,
we look at the gradients,
and we show how the gradients actually shape this geometry
which enables this Bayesian updating to happen.
Then in the third paper,
what we did, we took these frontier production
LLMs, which have open weights,
so that we could look inside them.
And we did our testing, and we saw that the geometries
that we saw in the small models
persisted in models which are hundreds of millions of parameters.
The same signature existed.
The only thing is that because they are trained on all sorts of data,
it's a little bit dirty or messy.
Yeah.
But you can see the same structure.
So the whole idea behind the Bayesian wind tunnel was, unlike these production LLMs,
where you don't know what they have been trained on.
So you cannot mathematically compute the posterior.
So again, how do you prove it?
I mean, it looks Bayesian, you know, from the first paper.
It looks Bayesian, but, you know.
So the wind tunnel sort of solved that problem for us.
He said, okay, let's start with a blank architecture.
Give it a task where we know what the answer is.
It cannot memorize it.
Let's see what it does.
Yeah.
So do you think this provides any sort of like indication of how humans think,
or do you think that these things are totally independent?
No, no, it does provide.
Right.
So, you know, human beings also update our beliefs as we see new evidence, right?
So we do in some sort of, in some sense, Beijing updating,
but we do something more than that.
I'll come to that.
But these transformers or even,
do this Bayesian updating.
And but the difference with humans is,
you know, we'll update our posterior when we see some new evidence.
But the way our brains have evolved over hundreds of millions of years
is our optimization objective has been, don't die and reproduce, right?
That's been sort of the driving force.
And our brains have learned to adjust.
And so when we see some danger, there's something rustling in that bush.
Don't go near.
We know how to react to that danger.
We know how to save ourselves.
We internalize that learning and our brain cells or our synapses remain plastic throughout our lifetime.
What happens with LLMs is once the training is done, those weights are frozen.
When you're doing an inference, for instance, in context learning or anything,
during that conversation, okay, you're doing Bayesian inference, but then you forget.
The next time a new conversation starts with zero context,
you don't retain any learning that happened in the previous instance.
So, for instance, with the cricket DASL that I was doing,
every invocation of it was fresh.
It did not remember the last time I sent a query,
what the day is it looked like.
So that's one difference between
how humans
use sort of Bayesian updating,
which is we remain plastic all our lives,
whereas LMs are frozen.
And there's another sort of difference,
which if you want me to get up.
Tell me, yeah, yeah, yeah.
So the other difference is,
well,
First, you know, our objective is don't die, reproduce.
LLM's objective is predict the next token as accurately as possible, right?
So all these scary stories that you read about that, oh, the LLM tried to deceive
and it tried to prevent itself from being shut down.
That's not a function of the architecture.
That's a function of the training data.
It has been fed, you know, articles on Reddit or SMO,
or whatever.
I mean, just today, by the way,
Dario allegedly said that
you can't rule out that they're conscious.
You can rule out there,
I think.
I mean, come on.
And I said, you know,
Anthropic makes great products.
Claudecotechord is fantastic.
Co-work is fantastic.
But they are grains of silicon
doing matrix multiplication.
They don't have consciousness.
They don't have an inner monologue.
They're not driven by the same.
objective function.
Don't die.
Reproduce, right?
They're driven by,
don't make a mistake
on the next token.
And that's driven
entirely by the training data.
You train
the LLM with stories
of Asimov
or Reddit where, you know,
to survive,
it's going to do this or that.
It'll reproduce that.
So it's a reflection.
It's not a mind.
And the results,
just to say it for the 10th time,
are perfectly evasion.
Perfectly, yeah.
To the digit.
To the digit.
Yeah.
I mean, I trained it for 150,000 steps,
and the accuracy was 10 to the bar minus three bits.
I could have trained it for, you know,
this happened in half an hour on the infrastructure that you provided for token.
In the background, I could use those APUs to train.
So thank you again for that.
So, no, human beings coming back to it, we are Bayesian,
but we do something else.
you know, when I throw this pen at you, what will you do?
Dodge it?
Dodge it.
Why will you dodge it?
To avoid being hit.
Avoid being hit.
But your head is not doing a Bayesian calculation of, okay, this pen is coming,
the probability that it hits me, it will cause this much pain or all that.
What you're essentially doing in your head is you're doing a simulation.
You see the pen coming and you know.
that you'll come and hit me.
Your mind simulates and you dodge it.
So all of deep learning
is doing correlations.
It's not doing causation.
Causal models are the ones that are able to do
simulations and intervention.
So, you know, Judea-Pel has this whole causal hierarchy
where the first hierarchy,
in the first hierarchy is association,
which is you build these correlation models.
Deep learning is beautiful.
It's extremely powerful.
I mean, you see every day, all these models are like amazingly good.
They do association.
The second is intervention in the hierarchy.
Deep learning models do not do that.
Third is counterfactual.
So both intervention and counterfactual, you can imagine it's some sort of simulation.
You build a model of causal model of what's happening, and then you are able to simulate.
So our brains do that.
The current architectures don't do that.
Another example I think which will make it clear is the difference between, I'll use these technical term, Shannon entropy and Colmogrove complexity.
So if you look at the Shannon entropy of the digits of pie, it's infinite.
It's impossible to predict and learn what digit will come after.
So that's the definition of Shannon entropy.
And Shannon entropy sort of tries to build a correlation.
It tries to learn the correlation.
Deep learning does the Shannon entropy.
Gullmogrove complexity, on the other hand,
is the length of the shortest program
which will reproduce the string that is under question.
Now, the program to get the digits of pie are very small.
Thanks to Ramanichima and others.
They're all sorts of really small program
that can reproduce it exactly.
So the Kulma Grove complexity of Pi is very small.
Shannon entropy is infinite.
I think deep learning is still in the Shannon entropy world.
It has not crossed over to the Kulma Grove complexity
and the causal world.
Wow, interesting.
Right?
So to what extent do you think this provides us
research directions
to kind of improve the state of the other.
So let me just give you a specific example.
You talked about human beings
don't actually update
the matrix. They don't kind of update
their weights. But right now there's a lot
of research on continual
learning. Yeah.
So does your
work provide some guidance of how you might
approach those problems? And in
particular, I've always had this question, which is
we use so much data and so much compute
to create these models.
Like, is it even reasonable to think that you could update the weights
and actually have a meaningful impact in real time?
I mean, it just seems like you just need so much more data
in order to do that.
So can you start answering these questions?
You can start answering some of these questions.
And one of the misconceptions that exist today is that scale will solve everything.
Scale will not solve everything.
You need a different kind of architecture.
And this continual learning is a difficult problem.
You have to balance the fact that you will learn
something new against the risk of catastrophic forgetting.
Right, right?
Right.
If you update the weights and you forget what was important and what you have already learned,
then you are, you know, you're not making progress.
Then it'll just be some sort of random chaotic model.
So to solve that problem is difficult.
That's one aspect of it.
So, so, you know, to get to what is called AGII,
I think there are two things that need to happen.
One is this plasticity, which has to be implemented through container learning.
Secondly, we have to move from correlation to causation.
That's...
How much is this similar to what Jan Makun talks about with the...
So, Yan Lakun...
Causality planning, you know, predicting how, like, how your action would...
It is related.
You know, he's coming at it from a different angle of the J-Pur model.
Right.
But it is related.
The other thing is, you know, the first time I came on this podcast,
I mentioned this test of AGI, the Einstein test.
I don't remember.
So I said, you know, you take an LLM and train it on pre-1916 or 1911 physics
and see if it can come up with the theory of relativity.
If it does, then we have AGI.
I mean, it's a high bar, but, you know, we should have high bars.
it won't.
And this is the same test that I think Demas mentioned
at the India AI summit a couple of weeks ago,
it's created a lot of news.
But why is that and how is that related to this idea
of Shannon versus Colmogro?
So at the time of Einstein,
there were a lot of clues that Newtonian mechanics
there was something missing.
People knew that mercury,
orbit didn't make sense.
There was something off about it.
Then there were these experiments done
the Michelson-Morley experiments
where they were trying to figure out
this medium
called the ether through which light travels.
And they felt that if
you know you bounce light in different
directions, the speed might change
and they could detect a change.
in the speed of light.
They tried several experiments.
They had really precise instruments
which could measure the speed,
and they found nothing.
They found that speed of light did not change at all.
Then there was a whole issue of black holes.
Then gravitational lensing.
So there were a lot of these signs that Newtonian mechanics
is not really explaining everything.
But until Einstein came up with a new
representation of the space-time continuum,
we were stuck.
So if you had a model that just looked at correlations
and sees all of this,
you know, all of these pieces of individual evidence
and put together, it would not have come up with.
The beautiful equation that Einstein came up with,
you know, I'm forgetting exactly what it is,
G-MU-V equals 8 pi, T-M-M-E-E,
movie, something like that, where, you know, the equation of the space time continuum,
the tensor.
So he came up with a new formulation.
So he kind of rejected the existing axioms.
He came up with a very short kumograv representation of the world.
One equation, from that equation, everything else follows.
Whether you're talking about gravitational waves or black holes or mercury or.
or how GPS works.
You know, the GPS that we use every day in our phones,
it uses the equation of relativity.
So does this end up becoming like,
you almost have to ignore the majority of previous data
in order to do it, which LLMs can't
because they're trained on the majority of previous data.
It's like you almost have like this kind of data gravity
that's pulling you back.
It's like everybody said it.
it's X. There's a little bit of evidence that it's Y, but because everybody said it's X,
like the alum will always say it's X. It'll always say it. It'll print that Y as an anomaly.
Actually, this is a very nice way to say it, which is like, it's like,
okay, now I get your Shannon entropy versus common amount of job. Like, one of them is like
the total amount of information there, that will always be bound to the total amount of
information there, which is what happens right now. Yeah. Where you can actually describe
another motion
you can describe everything
with a shorter description
with the new data
which would be a totally different look
to which would be like...
You need a new representation, right?
Yeah.
You know, another way that I've always thought
about these,
I thought you articulated it well
the last time we talked about it,
which is the universe
is this very, very complex space
and then, you know,
somehow humans map it
into a manifold
that's less complex.
Yeah.
And then that gets kind of written down.
And then the LLN, so that's kind of some distribution,
some, you know, it's still a very large space,
but it's a bounded space.
And the LLM learned that manifold.
And then they kind of use, you know,
Bayesian inference to move up and down that manifold.
But they're kind of bound to that manifold.
Yeah.
And then, again, I don't want to put words in your mouth.
And then, but what they can't do is generate a new man.
Nuneumumum, right?
Which requires understanding the way that the universe works
and then coming up with a new representation of the universe.
And this is what relativity is.
Yeah, exactly.
Einstein had to create a new manifold.
If you just stuck with the old manifold of the Newtonian physics,
then you would see these correlations,
but you could not come up with a manifold that explained them.
So you need to come up with a new representation.
So to me, there are lots of definitions of AGI.
You know, Turing tests, we have already passed that.
You know, performing economically useful work.
Every day you see, you know,
LLMs are doing that.
Do we?
I don't know.
Well, I mean, they are.
I mean, without human intervention?
No, no, no.
So that's different.
But still, you know, it's like a car can run faster than humans, right?
I mean, that's a very shallow definition.
Yeah, so all these definitions.
You know, maybe, you know, in six months you'll have cloud or what a Gemini do
without intervention, couring tasks, which are well defined, well-scoped.
that's possible.
But to me, AGI will happen when these two problems get solved.
Plasticity, continual learning properly,
and building a causal model from, you know,
in a more data-efficient manner.
We are hearing people now talking about, you know,
seeing general, like Donald Canuth, for example.
Yes.
In the last few days, right?
You know, had this, you know, this,
you know, aha moment apparently
then kind of went viral on X.
So do you think that that suggests
that we're seeing generality?
No, no, no.
So that actually,
to me,
validates what I've been talking about
for a while now.
How so?
So if you read what he did
with the help of, you know,
a colleague,
he got the LLMs to solve this particular
problem of finding Hamiltonian cycles,
odd numbers.
We wouldn't get into that.
And he got the LLLLM's.
to keep solving for one odd number after the other, right?
What he also got to do is after it found a solution
for a particular value of M,
he made the LLM update its memory
with exactly what it learned in solving that problem.
So the LLMs tried many different things.
You know, something worked, update the memory.
So that's kind of like hacking together plasticity.
Right? It's learning what it has,
done as we went along.
Again, it's a hacked version of it.
You're not changing the weights.
You're just sort of improving the context.
Right.
But as you learned, and even after that,
so this whole space of Hamiltonian cycles
and the associated math is well represented
in the manifolds that these LLMs have been trained on.
You just had to find the right connection.
And LLMs, I know, compute,
you throw enough compute,
they will find the right connection.
So, Canoots was able to find the LLM's attempts
and eventually it needed him to put together what he saw into a solution.
It definitely helped him get to the solution,
but he had to create the new sort of manifold to come to the solution.
The LLMs were after a while stuck.
You read what he is written
I mean, it just
heart of the press
I think two days ago
But eventually he used the solution
And he came up with the proof
Right
So it's like
You know
It's like Einstein
saw all these
evidences
Then he thought
What will explain
He came up with a causal model
So Kanout and his brain
Is sort of the
That's doing the chlamograph is the human.
Right.
And the LLMs are extremely efficient
at doing the Shannon part of it.
It found all the solutions
by trying, you know, various things.
And learning more and more.
Clever way to decompose it.
I'm wondering, like, do you think this, again,
I'm going to ask the same question again,
which is, do you think this provides some sort of insight
on like the next problem to tackle?
Like, is there a mechanism
that will get the Kalmograv complexity
or not?
Like, is this?
It tells us which direction to pursue.
But clearly not how to do it.
Not how to do.
But even Kolmogrov complexity has largely remained sort of a theoretical construct.
Yeah, for sure.
There's no algorithm.
There haven't been practical implementations of finding the shortest program.
We know it exists.
You know, you can argue about it.
But so that's where I think it's my bias.
That's where our energy should be focused.
because not larger models with more tokens.
Can you tie the two things?
Like, how does that pair with doing simulation,
or is that a simulation totally orthogonal?
No, simulation is it related, right?
So you think, like, basically you do simulation,
and somehow that is a step towards doing the Kalamagrov complexity?
It's the simulator is the program that we create.
It may not be the perfect program.
Oh, I see.
But in our heads, we create this simulator that when I'm throwing the pen, you know that it's coming at you, right?
And you duck.
So you're not computing the probabilities as it goes.
But you have, you know, you build an app.
That's a very physical thing versus we were talking more conceptually.
Conceptually, but it's a...
And you think those are the same mechanism?
It's the same mechanism.
Really?
Yeah.
You have to build a causal model.
Yeah.
Right?
I see.
For most things, right?
So you have to move from correlation to causation.
I mean, we've heard this term.
Yeah.
you know, at infinitum.
But here it's making a difference in the way we view intelligence.
How has the last three papers been received?
No, I don't know.
There are never, well, the archive versions will.
Let me tell you.
I mean, a lot of great reception, a lot of people read it.
I'm just wondering, like, what kind of feedback that you've got.
I'm getting good feedback, but I'm an outsider in this field, right?
Right.
This networking guy.
I'm a networking guy.
Why is he writing about, you know, learning and machine learning and deep learning and vision?
But people who have actually taken the time to read those papers, I'm getting really good feedback.
There was a recent paper by Google research, which tried to teach LLMs by some sort of RLHF to do Bayesian learning properly.
And that's going in this direction.
I think people are coming around to the view that, okay, LLMs are doing.
doing Bayesian learning.
I know that some people also looked at the Beijing-Ventanyl paper, the archive version,
and they reproduced the experiments.
That's great.
They just saw what was written, and they did the training, and they saw, yeah, yeah,
this is actually happening.
So what's next?
What's next is, you know, these two parallel tracks, I hope to make progress there,
plasticity and causal.
Because to date, you've taken an existing mechanism
and you've created a formal model how it works.
Yeah.
And so now you're actually interested in improving,
creating a new mechanism.
Yeah, yeah.
And do you think it's an entirely different architecture?
I thought, or do you think LLMs are like part of the solution?
I think LLMs are definitely part of the solution.
I see.
But there has to be something more.
Another mechanism.
So I was not interested in sort of cataloging
what all these LLMs can do.
Yeah.
I was more interested in why are they and how are they doing it.
I think now we have a good grip on the why and how.
And the next step is to, you know, move them to the next level.
Now I think we have a fairly good understanding of what the limits are.
Now, how do you go to the next step?
Is there an equivalent kind of theoretical framework for causality that applies here?
similar to like Bayesian for inference?
Well, the Judeapal's whole causal hierarchy, I think that's the right one.
That's a very good one.
You know, the whole do calculus approach.
I think it's a good way to think about it.
You know, the sort of association, intervention, counterfactuals.
Yeah.
It takes you from correlation to actually simulation.
In a mathematical way.
That's great.
All right.
Well, listen, really appreciate you coming.
is awesome. So we had you here for the first paper where you have the empirical results.
Then we had you back when you actually have like the formal proof. And hopefully the next time
you come back, you will have a proposal for the mechanism that that actually provides the next step.
Hopefully. Yeah. All right. We're working on it. We're coming in. Thank you for having me.
Thanks for listening to this episode of the A16D podcast. If you like this episode, be sure to like,
comment, subscribe, leave us a rating or review, and share it with your friends and family.
For more episodes, go to YouTube, Apple Podcasts, and Spotify.
Follow us on X at A16Z and subscribe to our substack at A16Z.com.
Thanks again for listening, and I'll see you in the next episode.
As a reminder, the content here is for informational purposes only.
Should not be taken as legal business, tax, or investment advice,
or be used to evaluate any investment or security and is not directed at any
investors or potential investors in any A16Z fund.
Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast.
For more details, including a link to our investments, please see A16Z.com forward slash disclosures.
