The a16z Show - Columbia CS Professor: Why LLMs Can’t Discover New Science
Episode Date: October 13, 2025From GPT-1 to GPT-5, LLMs have made tremendous progress in modeling human language. But can they go beyond that to make new discoveries and move the needle on scientific progress?We sat down with dist...inguished Columbia CS professor Vishal Misra to discuss this, plus why chain-of-thought reasoning works so well, what real AGI would look like, and what actually causes hallucinations. Resources:Follow Dr. Misra on X: https://x.com/vishalmisraFollow Martin on X: https://x.com/martin_casado Stay Updated: If you enjoyed this episode, be sure to like, subscribe, and share with your friends!Find a16z on X: https://x.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zListen to the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYXListen to the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711Follow our host: https://x.com/eriktorenbergPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
Any LLM that was trained on pre-1915 physics would never have come up with a theory of relativity.
Einstein had to sort of reject the Newtonian physics and come up with this space-time continuum.
He completely rewrote the rules.
AGI will be when we're able to create new science, new results, new math.
When an AGI comes up with a theory of relativity, it has to go beyond what it has been trained on.
To come up with new paradigms, new science.
That's by definition of AGO.
Yeah.
Vichal Mistra was trying to fix a broken cricket stats page
and accidentally helped spark one of AI's biggest breakthroughs.
On this episode of the A16Z podcast,
I talk with Vashal and A16Z's Martin Thessado
about how that moment led to retrieval augmentation generation
and how Vichal's formal models explain what large language models can and can't do.
We discussed why LLMs might be hitting their limits,
what real reasoning looks like,
and what it would take to go beyond them.
Let's get into it.
Martin, I knew you wanted to have.
have Vishal on. What do you find so remarkable about him and his contributions that inspired this?
Vishal actually have very similar backgrounds. We both come from networking. He's a much more accomplished
networking guy than I am. That's a high bar given you a field. And so we actually view the world
in an information theoretic way. It is actually part of networking. And with all this AI stuff,
there's so much work trying to create models that can help us understand how these LLMs work.
And in my experience over the last three years, the ones that have most impact,
my understanding, and I think I've been the most predictive,
are the ones that Vishal has come up with.
He did a previous one that we're going to talk about,
called Matrix, is it?
Beyond the black box, but yeah.
Beyond the black box.
Actually, we should put this in the notes for this,
but the single best talk I've ever seen
on trying to understand how LLM's work
is one that Fischal did at MIT,
which Haribala Krishna pointed me to, and I watched that.
So he did that work, and then he's doing more recent work
that's actually trying to scope out
not only how LLM's reason,
but like it has some reflections on humans' reason too.
And so I just think he's doing some of the more profound work
in trying to understand and come up with models,
formal models for how LLM's reason.
On that note, you said his most recent work
helped you change how humans think.
On you, you flesh that out a little bit.
How did it sort of?
Well, okay, so can I just try to take a rough sketch at it
and then you just tell me how wrong I am?
Right ahead.
You're trying to describe how LLM's work.
And one thing that you've found is that they reduce
a very, very complex
multidimensional space
into basically a geometric manifold
that's a reduced state space.
So it's reduced degrees of freedom,
but you can actually predict
where in the manifolds
the reasoning can move to, roughly.
So you've reduced the dimensionality
of the problem to a geometric manifold,
and then you can actually formally specify
kind of how far you can reason within that manifold.
And the articulation is that we,
or one of the intuitions is that we as humans do the same thing,
is we take this very complex, heavy-tailed stochastic universe
and we reduce it to kind of this geometric manifold,
and then when we reason, we just move along that manifold.
Yeah, I think you captured it accurately.
That's kind of the spirit of the work.
Wait, wait, can I just hear it in your words?
Because I'm a VC, so.
You know, VC with an H index of what, 60?
True.
Yeah, so ultimately what all these LLMs are doing,
whether the early LLMs or the LLMs that we have today
with all sorts of post-training, RLHF,
whatever you do, at the end of the day,
what they do is they create a distribution for the next token.
So given a prompt, these LLMs create a distribution for the next token,
or the next word,
and then they pick something from that distribution
using some kind of algorithm
to predict the next token, pick it,
and then keep going.
Now, what happens because of the way
we train these LLMs,
the architecture of the transformers,
and the loss function,
the way you put it is right,
it sort of reduces the world
into these Bayesian manifolds.
Yeah.
And as long as the LLM is going in,
sort of traversing through these manifolds,
it is confident
and it can produce something
which makes sense.
The moment it sort of wears away
from the manifold,
then it starts hallucinating and
thought spotting nonsense.
Confident nonsense, but nonsense.
So it creates these manifolds
and the trick is the distribution
that is produced.
You can measure the entropy
of the distribution.
Right?
Entropy the way
Shannon is.
Share it.
It's an entropy.
It's an entropy, not thermodynamic entropy.
So suppose you have a vocabulary of, let's say, 50,000 different tokens,
and you have a distribution next token distribution over these 50,000 tokens.
So let's say the cat sat on the, right?
If that is a prompt, then the distribution will have a high probability for map or hat or table
and a very low probability of, let's say, ship or whale.
or something like that, right?
So because of the way it's trained,
it has these distributions.
Now, their distributions can be low entropy or high entropy.
A high entropy distribution means that there are many different ways
that the LLM can go with high enough probability for all those paths.
Low entropy means that there are only a small set of choices for the next token.
And the prompts also, you think,
can categorize into two kinds of prompts. One prompt is, as you can say, high information entropy.
Yeah. And one prompt is low information entropy. So the way these manifolds work, the LLM start paying attention
to prompts that have high information entropy and low prediction entropy. So what do I mean by that?
So when I say I'm going out for dinner.
Yeah.
Right.
So when I say I'm going out for dinner, that phrase, the LLMs have been trained.
They've seen it a lot.
And there are many different directions I can go with it.
I can say, I'm going for dinner tonight.
I'm going for dinner to McDonald's or I'm going to dinner, blah, blah, blah.
There are many different.
But when I say I'm going to dinner with Martin Casaro, you know, the LLM, now this is,
information rich.
This is sort of a rare phrase.
And now the sort of realm of possibilities
reduce it because Martin is only going to take me
to Michelin Star restaurants.
I'm not going to go to McDonald's.
You get what I'm saying.
The moment you add more context,
you make the prompt information rich,
the prediction entropy reduces.
Yep, yep, yep.
And another example that I often
cite.
But just quickly,
But what is your takeaway?
What is your implication on that?
Which is, of course, as you're interested, so, yeah, so you're, so, sorry, I forgot how you described it, but so the more precise you are, the more tokens you are, I presume, the less options you have for the next token.
Is that correct or not correct?
Yeah, yeah, essentially.
So you're reducing it.
You're reducing it to a very specific state space when it comes to.
confidence in an answer.
And this is kind of a manifold that you can go on.
And then, I mean, do you have kind of a conclusion of what that means for systems or what
that means for reasoning?
Or is it just a nice way to articulate the bounds of LLMs?
No, there is something, I don't know if I should say profound, but there is something
about it which tells what these LLMs can or cannot do.
right so one of the examples that i often tell is suppose i ask you what is 769 times
1025 you have no idea you can have some vague idea given the two numbers right and so in your
mind the next token distribution of the answer is going to be diffuse right you don't know
you have maybe a vague guess if you are mathematically very good maybe your guess is more
precise but it is going to be diffused and it's not going to be the correct answer but if i if you say can i
write it down and do it the way we have learned multiplication tables now you know exactly what to do
the next step right you write 7169 and then 1025 and then you know exactly so at each stage of
that process your prediction entropy is very low you know exactly what to do because you have been
this algorithm.
And by invoking this algorithm saying,
okay, I'm not going to just guess the answer,
but I'm going to do it step by step.
Then your prediction and entropy reduces.
And you can arrive at an answer
which you're confident of and which is correct.
And the LLMs are pretty much the same way.
That's why chain of thought works.
What happens with chain of thought
is you ask the LLM to do something,
chain of thought.
it starts breaking the problem into small steps.
These steps, it has seen in the past.
It has been trained on.
Maybe with some different numbers,
but the concept it has been trained on.
And once it breaks it down, then it's confident.
Okay, now I need to do A, B, C, D,
and then I arrive at this answer.
Whatever better is.
Let's zoom back out.
I want to get into LLMs,
but first, Michelle,
maybe you can give more context on your background
and how that informs your work here.
Okay. So yeah, as Martin said, my background is very similar to his. We, you know, we come from doing networking. So my PhD thesis, my sort of early work at Columbia has all been in networking. But there's another side of me, another hat that I wear, which is both an entrepreneur and a cricket fan.
I was going to say, don't you own a cricket team or something?
I'm a minority for your local cricket team, the San Francisco Unicons.
That's right.
I'm very proud to have you.
But, so in the 90s, I was one of the people who started this portal called Crick Info.
And Crick Info, at one point it was the most popular website in the world.
It had more hits than Yahoo.
That was before India came on.
And so, you know, we built cricket is a very start-rich sport.
You'll think baseball multiplied by 1,000.
And we had built this free searchable stats database on cricket called Stats Guru.
And this has been available on cricket for since 2000.
But because you can search for anything, everything was made available on Stats Guru.
and you know you can't expect people to write SQL queries to query everything so how do you how
would we do it well it was a web form you know where you could formulate your query using that form
and in the back end that that was translated into SQL query got the results and got it back
but as a result that because you could do everything everything was made available the web form
had like 25 different checkboxes 15 text fields 18 different drop downs the interface was a mess
was very daunting.
So, and ESPN acquired CRIK Info
in the mid-2006, I think,
but they still kept the same interface.
And that has always sort of nagged me.
And so I still know the people...
Wait, wait, what nagged you?
Is that Cric Info did not have informal language
and had a web form for doing queries?
That web form was terrible.
Because of that, only the real nerds.
Of all the things of the world to bother you,
the fact that an old website was a web form.
I appreciate your commitment to aesthetic.
So I'm still friendly with the people who run ESPen Creek and further.
The editor-in-chief, whenever he comes to New York,
you know, we meet up, we go out for a drink.
And so he was here in 2000.
So now the story shifts to how LLMs and me sort of met.
So January 2000, right before the pandemic, he was here
and I again said, why did you do something about Statskuru?
And he looks at me and says, why did you do something about Statskuru?
He was kind of joking, but he thought maybe, you know,
I had some ways to fix the interface.
So anyway, then the pandemic hit, the world stopped.
But in July of 2020, the first version of GPD3 was released.
And I saw someone
use GPD3
to write a SQL query
for their own database
using natural language.
And I thought,
can I use this to fix Stats Guru?
So I got early access to
GP3, you know, getting access
those days were difficult, but somehow I got it.
But soon I realized that, you know, no, I cannot really do it.
Because stats guru, the, the backend databases
were so complex and if you remember GPD3 had only a
2048 token context window.
There was no way in hell I could fit the complexities
of that database in that context window.
And GPD3 also did not do instruction following at that time.
But then in trying to solve this problem,
I accidentally invented what's now called Rack.
Where based on the natural language query,
I created a database of
natural language queries and the structured queries.
I created a DSL, which then translated into a rest call to stats guru.
So based on the new query, I would look through my set up natural language queries.
I had about 1,500 examples, and I would pick the six or seven most relevant ones.
And then that and the structured query, I would send as a prefix and the new query that,
GPD Trish magically completed it
and the accuracy was very high
so that had been running in production
since September 2021
you know about 15 months before
Chad GPD came
and you know the whole
revolution in some sense started
and RAC became very popular
I didn't call it Rack
but this is something sort of
accidentally did in trying to solve that problem
for quick info
now once I
once I built it
you know I was
print that this work, but I had no idea why it worked.
You know, I stayed at that, I stayed at that transformer architecture diagram.
I read those papers, but I couldn't understand how or why it worked.
So then I started in this journey of developing a mathematical model, trying to understand
how it worked.
So that's been sort of my journey through this world of AI and LLMs because I was
to solve this cricket problem.
Yeah. Amazing.
And so maybe reflecting back since the release of GP3,
what has most surprised you about how LLMs have developed?
So what is most surprised me, the pace of development.
So GPD3 was, you know, it was a nice pilot trick
and you had to jump through hoops to get it to do something useful.
But starting with the chat GPD was an advance over GPD3.
and then you had all these things
like chain of thought, instruction following.
GPD4 really made it polished
and, you know,
the pace of development has really surprised me.
Now, you know, when I started working with GPD3,
I could sort of see what its limitations were,
what I could make it do, what I couldn't make it do.
But I never thought of it as, you know,
what these LLMs have become for me now
and what have become from millions of people around the world.
we treat these models as our co-workers,
almost like an intern,
that you're constantly chatting with them,
brainstorming, making them do all sorts of work,
which we couldn't imagine, you know,
just when Chad GPD was released,
it was nice, it could write poems,
it could write Limericks,
it could answer some hallucinated questions,
but the capabilities that have emerged now,
that pace has been very sort of,
surprising to me.
Do you see progress plateauing?
Or how do you, either now or in the near future,
how do you see it going?
Yes.
In some sense, progress is plateauing.
It's like the iPhone.
You know, with the iPhone came out,
wow, what is this thing?
And the early iterations, you know,
constantly we were amazed by new capabilities.
But the last, you know, seven, eight, nine years,
it's maybe the camera
I got a little bit better
or you know
one thing changed here
or memory is more
but there has been
no fundamental advance
in what it's capable of
you can sort of
see a similar thing
happening with these
edelps
and this is not true
for just one
one company and one model
right
you look at what
open air is coming up with
or what
anthropic
google
or all these open source
Chinese
model or mistrial, the capabilities of LLMs has not fundamentally changed. They've become better,
right? They've improved. But they have not crossed into a different realm. So this is something
that I really appreciate about your work. And so the thing that really struck me is as soon as
these things showed up, you actually got busy trying to have a formal model of what they're
capable of, which was in stark contrast to what everybody else was doing. Everybody else was like,
AGI, these things are going to, you know, recursively self-improve, like, or, or they'll say,
oh, these are just stochastic parents, which doesn't mean anything. So everybody had rhetoric,
and sometimes this rhetoric was fanciful. And sometimes this rhetoric was almost reductionist,
like, oh, it's just a database, which is clearly not true. And the thing that really struck me
about your work is you're like, no, let's figure out exactly what's going on. Let's come
with a formal model, and once we have a formal model, we could reason about what that means.
And then, you know, in my reading of your work, I kind of break it up in two pieces.
There's the first one where you basically, you came up with this, you know, matrix abstraction.
I think it's worth you talking through.
And then you took in-context learning as an example, and you mapped it to Bayesian reasoning,
which to me was incredibly powerful because at the time, nobody knew why in-context learning worked.
So I think it would be great for you to discuss that
because again, I think it was the first real kind of formal effect
on like how are these things working.
And then the more recent work that you're working on now
is a kind of more generalized version of what is the state space
that these models output when it comes to confidence,
which is the manifold that we're talking about before.
So I think it would be great if you just described
your matrix model and then how you use that to provide some bounds what in context learning
is doing.
What's happening?
Okay.
So, so yeah, let's start with that matrix abstraction.
So the idea we on the matrix is you have this gigantic matrix where every row corresponds
to a prompt.
And then the number of columns of this matrix is the vocabulary of the LLM.
the number of tokens it has that it can emit.
So for every prompt,
this matrix contains the distribution over this vocabulary.
So when you say the cat sat on the,
you know, the column that corresponds to mat
will have a high probability.
Most of them will be zero.
But, you know, reasonable continuations
will have a non-zero probability.
And so you can imagine that there's this gigantic matrix.
Now, the size of this matrix is, you know, if we just take just the old first generation GPD3 model,
which had a context window of 2,000 tokens and a vocabulary of 50,000 next tokens or 50,000 tokens,
then the size of it, the number of rows in this matrix is more than the number of atoms across all galaxies.
that we know of.
So clearly we cannot represent it exactly.
Now, fortunately, a lot of these rows do not appear in real life.
Right?
An arbitrary collection of tokens, you are not going to use that as a prompt.
Similarly, you saw a lot of these rows are absent and a lot of the column values are also zero.
Right?
When you say the cat sat on the, it's unlikely to be,
followed by the token corresponding to, let's say, numbers.
Or, you know, an arbitrary collection of tokens.
There will only a very small subset of tokens that can follow a particular prompt.
So this matrix is very, very sparse.
But even after that sparsity and even after removing the sort of gibberish prompts,
the size of this matrix is too much for these models to represent,
even with a trail in parameters.
So what, in an abstract sense, what is happening is the models get trained on certain, you know, data from the training set.
And certain some, a subset, a small subset of these rows, you have reasonable values for the next token distribution.
Whenever you give the prompt something new, then it will try to interpolate with what,
it has learned and what's there in the new prom and come up with a new distribution.
But it's basically, so it's more than a stochastic parrot.
It is sort of Bayesian on this subset of the matrix that it has been trained on.
So when I say, you know, I'm going out for dinner with Martin tonight.
Now, I'm reasonably sure that it has never encountered that phrase in its training data, right?
But it has encountered variants of this phrase.
And given that I'm going out with Martin, it can produce a Bayesian posterior.
It uses that evidence that Martin is the one that I'm going for dinner with, and it will produce a next token distribution that will focus on the likely places that we are going.
So this matrix, because it's represented in a compressed way, yet the models respond to everything, every prompt.
How do they do it?
Well, they go back to what they've been trained on, interpolate there, and use the prompt as sort of some evidence to compute a new distribution.
Right, so right.
So the context of the prompt impacts the posterior distribution.
Exactly, yeah.
Right.
And you mapped to Bayesian learning where the context is the new evidence.
New evidence, exactly.
So I'll give you, so for instance, the cricket example that I spoke about earlier.
So I created my own DSL, which, you know, mapped a natural language query in cricket to this DSL,
which then I can translate into a SQL query or a Rest API, whatever.
But getting the DSL is important.
Now, these LLMs have never seen that DASL.
I designed it.
Yeah.
Right?
But yet after showing a few examples, it learned it.
How did it learn it?
And this is in the prompt.
You didn't know training.
It's in the prompt, right?
So like it's the weights are standard.
Yeah, yeah.
Yeah, yeah.
This was happening in October 2020.
Right.
I had no access to internals of Open AI.
I could just access their API.
Openly, I had no access to internal structure of Statskuru
or the DSL that I cooked up in my head.
Yet, after showing it only a few examples, it learned it right away.
So that's an example where it has seen DSLs or structures in the past.
And now using this evidence that I show, okay, this is what my DSL looks like.
now a new natural language query,
it is able to create the right posterior distribution
for the tokens, that map to the example that I've seen.
Now, the other beautiful thing about this is,
this is an example of few short learning or in context learning, right?
But when I give that prompt,
along with these examples to this LLM,
I'm not saying to the LLM,
okay, this is an example of few short learning.
So learn from these examples.
You just pass this to the LLM as a prompt
and it processes it exactly the way it would process any other prompt
which is not an example of in context learning.
So that really means that the underlying mechanism is the same.
Whether you give a set of examples
and then ask it to complete a talk,
a task like it in context learning,
or just give it some prompt for continuation
and I'm going out for dinner with Martin tonight.
There's no in-context learning there.
But the process with which it's generating
or doing this inferencing is exactly the same.
And that's what I have been trying to model
and come up with a formal model of.
What I've found very impressive is
you've used this basic model to show a number of things, right,
to describe context learning
and to map to page and learning.
But you did it for another one
where you kind of, you've sketched out this almost glib argument on Twitter, on X,
where you made this,
um, uh,
you made a rough argument for why recursive self-improvement can't happen without additional information.
And so maybe,
maybe just walk through very quickly how like this same model you can just very quickly show
that a model can never recursively self-improve.
So,
uh,
You know, another phrase that we've been using recently is, you know,
the output of the LLM is the inductive closure of what it has been trained on.
So when you say that it can recursively self-improve, it could mean one of two things.
So let's get back to the...
Well, actually, you know what's kind of interesting is like often the...
Most people agree that if you have one LLM and you just feed the output and the input,
like it's not going to do anything.
But then often people will say,
well, what if you have two LLLM,
you have no external information,
but you have two LLMs talking to each other.
Maybe they can improve each other
and then you can have like, you know,
a takeoff scenario.
But again, you even address this,
even in the case of like N number of LLMs
using kind of the matrix model to show that
like you just aren't gaining any information.
Yeah.
Yeah.
So you can represent the sort of information contained in these models.
and let's go back to that matrix analogy that have,
the matrix abstraction.
So like I said, you know,
these models are represent a subset of the rows.
So a subset of the rows are represented.
But some of these rows are able to help fill out some of the missing rows.
For instance, you know, if the model knows,
how to do multiplication doing the step by step,
then every row that is corresponding to let's a 7, 69 times 125 or whatever,
all those multiplications.
It can fill out the answer.
Because it has those algorithms sort of embedded in them,
you just need to unroll them.
So it can sort of self-improve up to a point.
But beyond a point, these models can only sort of generate what they have been trained on.
So let me give you
I'll give it three examples.
So any model, any LLM that was trained on pre-1915 physics
would never have come up with a theory of relativity.
Einstein had to sort of reject the Newtonian physics
and come up with this space-time continuum.
He completely rewrote the rules, right?
So that is an example of, you know, AGI.
where you are generating or generating new knowledge.
It's not simply unrolling what's all the universe, right?
It's not computing something.
It's actually discovering something fundamental about the universe.
And for that, you have to go outside your training set.
Similarly, you know, any LLM that was trained or didn't would not have come up with quantum mechanics.
Right?
That's where particle duality or this whole probabilistic notion or that, you know, energy is not continuous, but it is quantized.
you had to reject Newtonian physics.
Or Giddell's incompleteness serum.
He had to go outside the axioms to say that,
okay, it is incomplete.
So those are examples where you're creating new science
or fundamentally new results.
That kind of self-improvement is not possible with these architectures.
They can refine these, they can fill out these rows
where the answer already exists.
Another example,
which has received a lot of press these days
is these IMO results
international math will appear.
You know, whether it's a human
solving it or the LLM solving it,
they are not inventing new kinds of math.
They are able to connect known results
in a sequence of steps
to come up with the answer.
So even the LLMs, what they are doing
is they are exploring all sorts of solutions.
in some of these solutions, they start going on this path
where their next token entropy is low.
So that's where I say they are in that Bayesian manifold.
Where you have this entropy collapse.
And by doing those steps, you arrive at the answer.
But you're not inventing new math.
You're not inventing new axioms or new branches of mathematics.
You're sort of using what you've been trained on
to arrive at that answer.
So those things LLMs can do,
they'll get better at it,
of connecting the known dots.
Yeah.
But creating new dots,
I think we need an architectural advance.
Yeah.
So Martine was talking earlier
about how the discourse, you know,
was it was either stastatic parrots
or, you know,
AGI recursive or something.
How are you,
how do you conceive
of sort of the AGI discourse
or even,
the concept, what does it mean to the extent that it's useful?
How do you think about that?
The way I think about it, the way we have tried to formulate in our papers,
it's beyond a sarcastic parrot, but it's not AGI.
It's doing Bayesian reasoning over what it has been trained on.
So it's a lot more sophisticated than just a stochastic parrot.
How do you define AGI?
Okay, so AGI.
So how do I define?
find AGI. So
the way I would say that
LLMs currently
navigate through this known
Bayesian manifold,
AGI will create new manifolds.
So right now, these models
navigate, they do not create.
AGI will be when we're able to create
new science, new results, new
math. When an AGI comes up with a theory of
relativity, I mean, it's an extremely high bar, but you get
what I'm saying. It has to go beyond
what it has been trained on
to come up with
new paradigms,
neoscience,
and that's by definition of AGA.
Vichael, can you,
do you think that, based on the work you've done,
can you bound the amount of data,
computer,
or data or compute
that would be needed
in order for it to evolve?
So one of the problems,
if you just take LMs as they exist,
is there was so much
data used to create them, to create a new manifold will need a lot more data just because of the
basic mechanisms, right? Otherwise, it'll just kind of like, you know, get kind of consumed into the
existing set of data. Have you found any bounds of what would be needed to actually evolve
the manifold in a useful way? Or do you think we just need a new architecture? I personally think that
we need a new architecture. The more data that we have, the more compute we have,
we'll get maybe smoother manifold.
So it's like a map.
Yeah, because, I mean, there's this view that people have.
They're like, well, Vichal, this is all, this is all, you know, good and well.
But, you know, I could just take an LLM and I can give it eyes and I can give it ears
and I can put it in the world and it'll gain information.
And based on that intervention, it'll improve itself.
And therefore, it can learn new things.
But the counterpoint that I've always just intuitively thought to that is
the amount of data used to train these things is so low.
large, how much can you actually evolve that manifold given an incremental, I mean, almost none at all,
right? There has to be some other way to generate new manifolds that aren't evolving the existing
one. I completely agree. There has to be a new sort of architectural leap that is needed
to go from the current, you know, just throwing more data and more compute, you know, it's going
to plateau. It's, you know, the iPhone 15, 16, 17. And are there any research,
directions that are promising in your mind that might help us, you know, go beyond LLLM limitations?
So, I mean, again, I love LLMs.
They are fantastic.
They are going to increase productivity like nobody's business.
But I don't think they are the answer.
So, you know, Yard Lickin famously says that LLMs are a distraction on the road to AGM.
They're a dead end.
They're a dead end to AJA.
I don't think, I'm not quite in that camp.
But I think we need a new architecture to sit on top of LLMs to reach AGI.
You know, a very basic thing.
You know what Martin just said, you give them eyes and you give them ears.
You make them multimodal.
Of course, they'll become more powerful.
But you need a little bit more than that.
You know, the way human brains learns with very few examples,
that's not the way transformers learn.
And, you know, I'm not saying that we need to create an Einstein.
or a gator, but there has to be an architectural leap that is able to create these manifolds.
And just throwing new data will not do it.
It'll just smoothen out the already existing manifolds.
Is that something?
So is your goal to actually help, like, think through new architectures,
or are you primarily focused on putting formal bounds on existing architectures?
A bit of both.
I mean, the former goal is the more ambitious one that everybody is chasing,
and yeah, I think about that constantly.
Are there any new even like sort of hints
at a new architect, or like have we started to make any progress
on new architectures or is it?
You know, Jan has been pushing at this JAPA architecture,
energy-based architectures.
They seem promising.
The way I have been sort of thinking about it
is, you know, there's this set of
a benchmark or the ARC prize.
Yeah.
Right?
Mike Knoop and Fraschale have.
And if you understand why the LLMs are failing on this test,
maybe you can sort of reverse engineer a new architecture
that will help you succeed in that, right?
And I agree with a lot of,
what several people say that, you know, language is great,
but language is not the answer.
You know, when I'm looking at catching a ball that is coming to me,
I'm mentally doing that simulation in my head.
I'm not translating it to language to figure out where it will land.
I do that simulation in my head.
So, you know, one of the new architectures, architectural things is,
how do we do, how do we get these models to do approximate simulations?
to test out that idea and whether to proceed or not.
So, so, yeah, we have, you know,
another thing that I've always wondered about is,
did we develop as humans,
did we develop language because we were intelligent,
or because we developed language,
we accelerated our intelligence?
So I don't know which side of the camp you follow on that question.
I mean, what's interesting is, like,
you have these anecdotal examples of humans developing languages de novo that have been recorded, right?
Like it's either the Guatemalan or Nicaraguan sign language, right?
Where there is these students that develop their own language without being taught.
And so that would suggest that language follows intelligence.
The problem is they're all anecdotal, right?
Like who knows if somebody didn't teach them sign language?
Like nobody really knows.
There is no controls.
So this is all these observational studies.
And there's so few of them, you have to wonder if it's just kind of sloppy observation.
And so I think that the question is still outstanding.
Yeah.
So, I mean, language definitely accelerated our intelligence.
There's no question about that.
But which followed which we don't know.
I view it as a networking problem naturally, which is once you have languages, you can communicate.
And when you can communicate, you can store, you can replicate, yeah.
Yeah, exactly.
Exactly right. Cool.
Again, this is kind of a wonky question, but, you know, I think one thing that you've brought to the discourse,
and for those that are listening to this, I really think that you should look up Vishal's work and read it.
I just think it'll give you a really, really, especially if you have a systems background,
like a networking system's background, it'll give you a really, really good understanding of kind of the bounds on these.
But, like, the toolkit that you draw from is, like, information theory and, like, more formal.
Have you found that the AI community is receptive to this?
Or is it like two different cultures, two different planets
trying to communicate and not a lot of common ground?
How have you found bringing the networking view of the world
to the AI realm?
Some of them are receptive to it, definitely.
But, you know, these large conferences
at their reviewing process, it's so random.
And the kind of questions they ask,
you know, I'm a modeling person.
I like to bottle things.
And, you know, I submitted one version of this work
to one very famous machine learning or AI conference.
And the reviewer said, okay, this is a model, so what?
So there is...
That's absolutely remarkable.
So, like, you're actually taking a system that nobody understands, we have no models for.
You actually provided some model that we can use to analyze it.
And that alone wasn't sufficient.
They're asking, so where are the large-scale experiments to prove this?
I do, listen, I honestly, I mean, I find there's so much empiricism in, like, the current, you know, AI community.
Exactly because we don't understand the systems.
You know, it kind of reminds me.
I feel like systems went the other way, right?
It's like we had all of these models,
but then we didn't understand how the systems worked,
and then we just actually did measurement.
It feels like the AI stuff is the opposite,
which is like, we know we don't understand them,
and so we just measure them,
but now we're trying to come up with the models.
Yeah, exactly.
So it was so easy in some sense to build these artifacts
and then just measure them
that people have been going around trying to do that.
And one time I'd really dislike,
is prompt engineering.
Why?
Engineering used to mean sending a man to the move or providing five-nines reliability.
Prompt engineering is prompt twiddling.
You fiddle with a prompt and the boggles changes and the inference, the output changes.
And you have hundreds of papers just doing one X one, one the other, changing a problem this way, that way, and writing their observations.
And as a result, you know, lots of these papers are being written, are being submitted for review.
Reviewers get busy looking at all this kind of empirical work.
And my personal taste is to first try to understand model it.
Yeah.
And then you can do the other things.
So like a true theory guy.
I don't know about this bit twiddling.
Let me ask one more LAM question, which is, are there any,
benchmarks or real world tasks
that if they occurred,
you'd sort of reevaluate and say,
hey, maybe LLMs are, you know,
closer to the path to AGI than I thought.
If there were any real world
tasks. Good question.
You know, which
for
LLMs
or these models,
the one
domain where you have
the most training data is
probably coding.
and coding is where
you can also have the most structure
and yet
anyone who's used
these tools whether it's cursor
or whatever or cloud code
LLMs continue to hallucinate
continue to generate unreasonable code
you know you have to
you have to constantly
babysit these models
So the day
and LLM can create a large software project
without any
babysitting
is the day I'll be a little bit convinced
that it's
to what's easier
but again
I don't think
it'll be able to create new science
if it does
that's when I'll be convinced
I think that you can almost
take a definitional approach to answer this question
Vishal like the problem with these types of questions
is if you have
billions of dollars and you can collect whatever data you want. You can make a model do anything
you want, right? And so like, you know what I'm saying? Like, at some level, you've got this
entire capital structure, machinery behind these models. So you're like, oh, it can be good at
science. Well, sure, you put a billion dollars at solving materials science and collect all this data,
you'll be good at material science or whatever it is. And so, but there is a definitional answer,
which is, and I'm going to draw from your work, which is there is a manifold that's
there based on the data's been training on.
And then the question is, is if it ever produces something that's off, like a new manifold,
so considering the existing traded data, if it ever does that, if it does something that's outside of that distribution,
then clearly we're on a path to learning new things.
And if not, then everything is just a computational step from what's already known.
Yeah.
And I guess I guess the count would be maybe all humans do is work on their own manifold and Einstein,
was lucky or something,
I guess would be the counter to that.
But there's several many answer and examples.
And yeah, it's creating this new manifold.
I didn't want to use that definitional answer.
I thought it might sound too.
Yeah.
Too wonky to mathematical.
But essentially, if LLMs really created this new manifold,
then I would be convinced.
But so far, they have just gotten better
at navigating the existing manifold,
the existing training set.
Which is hugely powerful
and is going to change the world.
Which is hugely powerful.
I'm not denying that.
I think they are extremely, extremely good
at what they can do.
But there's a limit to what they can do.
So I've one quick question.
What's next for you?
I mean, you've tackled in context learning.
You've got a model for LLMs
and I've got a generalized model
for like their solution space.
What are you thinking about tackling next?
In terms of modeling
or?
Academically, an LLM.
Academically, I'm, you know, I'm thinking of this.
What is the architecture leap that is needed?
Oh, that's exciting.
To create this new manifold.
And how do we use, you know, multimodal data?
Awesome.
To expand.
When you figure that, I'll come back and talk to us.
That's right.
We'd love that.
So, I mean, you know, even with LLMs, you know, in the paper,
we say that you can improve the inference
by following this low or minimum entropy path.
So that's a very sort of small step
that we are building and training models
that will do inference based on the entropy path.
Yeah.
By the way, is model probe still up?
Token probe.
Yeah, yeah.
Token probe is still up.
And you can see actually the, you know,
token probe is a software that we built
and thanks to Martin and A16G's generosity
is running on your servers
and anyone can go and test.
And what we have done there is we actually show
the entropy.
Yeah.
It is so enlightening.
I recommend anybody listening to this who's interested.
Actually, check out token probe.
It only shows you the confidence.
Yeah, as you go along.
It's remarkable.
So in context learning, you know,
you create your new DSL
and you give it to the prompt
and you can see the confidence rising
with each new example,
the entropy reducing.
And that sort of is a validation of the bottle.
You can see it sort of unfurling
and right in front of your eyes.
The token probe is chaotic.
Thanks, thanks again.
Michelle, thanks so much for coming on the podcast.
It's a great conversation.
It was great fun.
Thank you.
Thank you so much again.
Thanks for listening to this episode
of the A16Z podcast.
If you like this episode,
be sure to like, comment,
subscribe,
rating or review and share it with your friends and family. For more episodes, go to YouTube,
Apple Podcasts, and Spotify. Follow us on X, A16Z, and subscribe to our substack at A16Z.com.
Thanks again for listening, and I'll see you in the next episode. As a reminder, the content
here is for informational purposes only. Should not be taken as legal business, tax, or investment
advice, or be used to evaluate any investment or security, and is not directed at any
investors or potential investors in any A16Z fund.
Please note that A16Z and its affiliates may also maintain investments in the companies discussed
in this podcast. For more details, including a link to our investments, please see A16Z.com
forward slash disclosures.
