a16z Podcast - Columbia CS Professor: Why LLMs Can’t Discover New Science
Episode Date: October 13, 2025From GPT-1 to GPT-5, LLMs have made tremendous progress in modeling human language. But can they go beyond that to make new discoveries and move the needle on scientific progress?We sat down with dist...inguished Columbia CS professor Vishal Misra to discuss this, plus why chain-of-thought reasoning works so well, what real AGI would look like, and what actually causes hallucinations. Resources:Follow Dr. Misra on X: https://x.com/vishalmisraFollow Martin on X: https://x.com/martin_casado Stay Updated: If you enjoyed this episode, be sure to like, subscribe, and share with your friends!Find a16z on X: https://x.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zListen to the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYXListen to the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711Follow our host: https://x.com/eriktorenbergPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Stay Updated:Find a16z on XFind a16z on LinkedInListen to the a16z Podcast on SpotifyListen to the a16z Podcast on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
Any LLM that was trained on pre-1915 physics would never have come up with a theory of relativity.
Einstein had to sort of reject the Newtonian physics and come up with this space-time continuum.
He completely rewrote the rules.
AGI will be when we're able to create new science, new results, new math.
When an AGI comes up with a theory of relativity, it has to go beyond what it has been trained on.
To come up with new paradigms, new science.
And that's by definition of Asia.
Vichal Mistra was trying to fix a broken cricket stats page
and accidentally helped spark one of AI's biggest breakthroughs.
On this episode of the A16Z podcast,
I talk with Vashal and A16Z's Martin Thessado
about how that moment led to retrieval augmentation generation
and how Vichal's formal models explain what large language models can and can't do.
We discussed why LLMs might be hitting their limits,
what real reasoning looks like,
and what it would take to go beyond them.
Let's get into it.
Martin, I knew you wanted to have Vishal on.
What do you find so remarkable about him and his contributions that inspired this?
Vishal actually have very similar backgrounds.
We both come from networking.
He's a much more accomplished networking guy than I am.
That's a high bar given you a field.
And so we actually view the world in an information theoretic way.
It is actually part of networking.
And with all this AI stuff, there's so much work trying to create models
that can help us understand how these LLMs work.
And in my experience over the last three years, the ones that have most impacted my understanding,
and I think have been the most predictive, are the ones that Vishal has come up with.
He did a previous one that we're going to talk about, called Matrix, is it?
Beyond the black box, but yeah.
Beyond the black box.
Actually, we should put this in the notes for this, but the single best talk I've ever seen
on trying to understand how Alam's work is one that Fischal did at MIT,
which Hari Bala Christian pointed me to, and I watched that.
So he did that work, and then he's doing more.
recent work that's actually trying to scope out not only how LLM's reason, but like it has
some reflections on humans' reason too. And so I just think he's doing some of the more
profound work in trying to understand and come up with models, formal models for how LLM's
reason. On that note, you said his most recent work helped you change how humans think.
On you flesh that out a little bit, how did it sort of.
Well, okay, so can I just try to take a rough sketch at it and then you just tell me how wrong I am?
Right ahead. You're trying to describe how LLM's work. And one thing that you've found is,
that they reduce a very, very complex multidimensional space
into basically a geometric manifold
that's a reduced state space.
So it's reduced degrees of freedom.
But you can actually predict where in the manifolds
the reasoning can move to, roughly.
So you've reduced the dimensionality of the problem
to a geometric manifold,
and then you can actually formally specify
kind of how far you can reason within that manifold.
And the articulation is that we,
or one of the intuitions is that we as humans do the same thing,
is we take this very complex, heavy-tailed, stochastic universe,
and we reduce it to kind of this geometric manifold,
and then when we reason, we just move along that manifold.
Yeah, I think you captured it accurately.
That's kind of the spirit of the work.
Wait, wait, can I just hear it in your words?
Because I'm a VC, so.
You know, VC with an H-index of what, 60?
True.
Yeah, so ultimately what all these LLMs are doing,
whether the early LLMs or the LLMs that we have today
with all sorts of post-training, RLHF,
whatever you do, at the end of the day,
what they do is they create a distribution for the next token.
So given a prompt, these LLMs create a distribution,
for the next token or the next word
and then they pick
something from that distribution
using some kind of algorithm
to predict the next token,
pick it and then keep going.
Now, what happens
because of the way we train these LLMs,
the architecture of the transformers
and the loss function,
the way you put it is right,
it sort of reduces the world into these Bayesian
manifolds.
Yeah.
And as long as the LLM
is going
in sort of traversing through these manifolds,
it is confident.
And it can produce something which makes sense.
The moment it sort of wears away from the manifold,
then it starts hallucinating and thought spotting nonsense.
Confident nonsense, but nonsense.
So it creates these manifolds.
And the trick is the distribution that is produced,
you can measure the entropy of the distribution.
Right?
Entropy the way,
share an entropy
Shannon entropy
not thermodynamic entropy
so suppose you have
a vocabulary of
let's say 50,000
different tokens
and you have a distribution
next token distribution
over these 50,000 tokens
so let's say
the cat sat on the
right
if that is a prompt
then the distribution
will have a high probability
for map
or hat or table
and a very low
probability of let's say
ship
or whale
or something like that
right
so because of the way
it's trained
it has these distributions
now
their distributions
can be low entropy
or high entropy
a high entropy distribution
means that
there are many different
ways that the LLM
can go
with high enough probability
for all those paths
low entropy
means that there are only
a small set of choices
for the next token
and
the prompts also you can
categorize into two kinds of
prompts. One prompt
is as you can say
high information entropy
and one prompt is
low information entropy
so the way
these manifolds work
the LLM start paying attention
to prompts that have
high information
entropy and low
prediction entropy
so what do I mean by that? So
So when I say, I'm going out for dinner.
Yeah.
Right.
So when I say, I'm going out for dinner, that phrase, the LLMs have been trained.
They've seen it a lot.
And there are many different directions I can go with it.
I can say, I'm going for dinner tonight.
I'm going for dinner to McDonald's or I'm going to dinner, blah, blah, blah.
There are many different.
But when I say, I'm going to dinner with Martin Casaro, you know, the LLM, now this is in
information rich.
This is sort of a rare phrase.
And now the sort of realm of possibilities reduce it,
because Martin is only going to take me
to Michelin-Star restaurants.
I'm not going to go to McDonald's.
You did what I'm saying.
The moment you add more context,
you make the prompt information rich,
the prediction entropy reduces.
Yep, yep, yep, yep.
And another example that I often say...
But just quickly,
But what is your takeaway?
What is your implication on that?
Which is, of course, as you're interested, so, yeah, so you're, so, sorry, I forgot how you described it, but so the more precise you are, the more tokens you are, I presume, the less options you have for the next token.
Is that correct or not correct?
Yeah, yeah, essentially.
So you're reducing it.
You're reducing it to a very specific state space when it comes to.
confidence in an answer
and this is kind of a manifold that you can go
on and then
I mean do you
do you have
kind of a conclusion of what that means
for systems or what that means for
reasoning or is it just a nice way
to articulate the bounds of LLMs
No
there is something
I don't know
I don't know if I should say profound but there is something
about it which tells what these LLMs
can or cannot do
right so one of the
examples that I often tell is, suppose I ask you what is 769 times 1,025?
You have no idea.
You can have some vague idea given the two numbers, right?
And so in your mind, the next token distribution of the answer is going to be diffuse, right?
You don't know.
You have maybe a vague guess.
If you are mathematically very good, maybe your guess is more precise, but it is going
to be diffuse.
and it's not going to be the correct answer.
But if you say, can I write it down and do it
the way we have learned multiplication tables,
now you know exactly what to do the next step.
You write 769 and then 1025 and then you know exactly.
So at each stage of that process,
your prediction entropy is very low.
You know exactly what to do.
Because you have been taught this algorithm.
and by invoking this algorithm saying,
okay, I'm not going to just guess the answer,
but I'm going to do it step by step.
Then your prediction and entropy reduces.
And you can arrive at an answer
which you're confident of and which is correct.
And the LLMs are pretty much the same way.
That's why chain of thought works.
What happens with chain of thought is
you ask the LLM to do something,
chain of thought.
It starts breaking the problem into
small steps these steps it has seen in the past it has been trained on maybe with some different
numbers but the concept it has been trained on and once it breaks it down then it's confident okay now
i need to do a b c d and then i arrive at this answer whatever it is let's zoom back i want to
into lLMs but first of it michael maybe you can give more context on your background and how
that informs your work here okay so you
Yeah, as Martin said, my background is very similar to his.
We, you know, we come from doing networking.
So my PhD thesis, my sort of early work at Columbia has all been in networking.
But there's another side of me, another hat that I wear, which is both an entrepreneur and a cricket fan.
I was going to say, don't you own a cricket team or something?
I'm a minority for your local cricket team, the San Francisco Unicons.
Yeah, that's right.
I was very proud to have you.
But, so in the 90s, I was one of the people who started this portal called Crick Info.
And Crick Info, at one point it was the most popular website in the world.
It had more hits than Yahoo.
That was before India came on.
And so, you know, we built, cricket as a very start with sport.
you think baseball multiplied by 1,000.
And we had built this free searchable stats database on cricket called Stats Guru.
And this has been available on Prickin 4 since 2000.
But because you can search for anything, everything was made available on Stats Guru.
And you know, you can't expect people to write SQL queries to query everything.
So how do you, how would we do it?
well, it was a web form,
you know, where you could formulate your query using that form,
and in the back end, that was translated into SQL query,
got the results and got it back.
But as a result, that because you could do everything,
everything was made available,
the web form had like 25 different checkboxes,
15 text fields, 18 different drop downs.
The interface was a mess.
It was very daunting.
So,
and ESPN acquired the,
Crick Info in the mid-2006, I think,
but they still kept the same interface.
And that has always sort of nagged me.
And so I still know the people...
Wait, wait, what nagged you?
Is that Crick Info did not have informal language
and had a web form for doing queries?
That web form was terrible.
Because of that, only the real nerds use...
Of all the things of the world to bother you.
The fact that an old website
It was a web form.
I appreciate your commitment to aesthetic.
So I'm still friendly with the people who run ESPen Creek and further.
The editor-in-chief, whenever he comes to New York, we meet up, we go out for a drink.
And so he was here in 2000.
So now the story shifts to how LLMs and me sort of met.
So January 2000, right before the pandemic, he was here.
and I again said, why did you do something about Stats Guru?
And he looks at me and says, why did you do something about Stats Guru?
He was kind of joking, but he thought maybe, you know, I had some ways to fix the interface.
So anyway, then the pandemic hit, the world stopped.
But in July of 2020, the first version of GPD3 was released.
And I saw someone use GPD3 to write a SQL query for,
for their own database using natural language.
And I thought, can I use this to fix Stats Guru?
So I got early access to GPD3, you know, getting access those days were difficult,
but somehow I got it.
But soon I realized that, you know, no, I cannot really do it.
Because stats grew, the backend databases were so complex.
And if you remember GPD3 had only a 2048 token context window.
there was no way in hell
I could fix the complexities
of that database and that context window
and
GPD3 also did not do
instruction following at that time
but then
in trying to solve this problem
I accidentally invented
what's now called Rack
where based on
the natural language query
I created a database of natural
language queries and structure
the structured queries
like I created a DSL
which then translated into a rest
call to stats guru
so based on the new query
I would look through my set of
national language queries I had about
1,500 examples and I would pick
the 6 or 7 most relevant ones
and then that and a structured
query I would send as a prefix and the new
query and GPD3 magically
completed it and the accuracy will very high
so that had been running in production
since September 2021.
You know, about 15 months before Chad GPD came
and, you know, the whole revolution in some sense started
and RAC became very popular.
I didn't call it RAC,
but this is something sort of I accidentally did
in trying to solve that problem for quick info.
Now, once I built it, you know,
I was thrilled at this work,
but I had no idea why it worked.
You know, I stayed at that,
I stared at that transformer architecture diagram.
I read those papers, but I couldn't understand how or why it worked.
So then I started in this journey of developing a mathematical model,
trying to understand how it worked.
So that's been sort of my journey through this world of AI and LLMs
because I was trying to solve this cricket problem.
Yeah, amazing.
And so maybe reflecting back since there were
least of GPD3, what is most surprised you about how LLMs have developed?
So what has most surprised me, the pace of development.
So GPD3 was, you know, it was a nice pilot trick, and you had to jump through hoops
to get it to do something useful.
But starting with the, you know, Chad GPD was an advance over GPD3.
And then you had all these things like chain of thought, instruction following.
GPT4 really made it polished.
And, you know, the pace of development has really surprised me.
Now, you know, when I started working with GPD3,
I could sort of see what its limitations were,
what I could make it do, what I couldn't make it do.
But I never thought of it as, you know,
what these LLMs have become for me now
and what have become from millions of people around the world.
We treat these models as our co-workers,
almost like an intern,
that, you know, you constantly,
chatting with them,
brainstorming,
making them do all sorts of work,
which we could imagine,
you know,
just when Chad GPD was released.
It was nice.
It could write poems.
It could write limericks.
It could answer some hallucinated questions.
But the capabilities that have emerged now,
that pace has been very sort of surprising to me.
Do you see progress plateauing?
Or how do you, either now or in the near future,
how do you see it going?
I, yes, in some sense, progress is plateauing.
It's like the iPhone, you know, with the iPhone came out, wow, what is this thing?
And the early iterations, you know, constantly we were amazed by new capabilities.
But the last, you know, seven, eight, nine years, it's maybe the camera got a little bit better or, you know, one thing changed here or memory is more.
but there has been no fundamental advance
in what it's capable of, right?
You can sort of see a similar thing
happening with these LLMs.
And this is not true for just one company
and one model, right?
You look at what OpenEI is coming up with
or what Anthropic Google
or all these open source
Sani's model or Mistral.
The capabilities of LLMs
has not fundamental,
changed. They've become better, right?
They've improved. But they
have not crossed into
a different
realm. So this is something
that I really appreciate about
your work. And so
the thing that really struck
me is as soon
as these things showed up, you actually
got busy
trying to have a formal model of what
they're capable of, which
was in stark contrast to what everybody else was doing.
Everybody else was like,
AGI.
These things are going to, you know, recursively self-improve, like, or, or they'll say, oh, these are just stochastic parents, which doesn't mean anything.
So everybody had rhetoric, and sometimes this rhetoric was fanciful, and sometimes this rhetoric was almost reductionist, like, oh, it's just a database, which is clearly not true.
And the thing that really struck me about your work is you're like, no, let's figure out exactly what's going on, let's come up with a formal model.
And once we have a formal model, we could reason about what that means.
And then, you know, in my reading of your work, I kind of break it up in two pieces.
There's the first one where you basically, you came up with this, you know, matrix abstraction.
I think it's worth you talking through.
And then you took in-context learning as an example and you mapped it to Bayesian reasoning,
which to me was incredibly powerful because at the time, nobody knew why in-context learning worked.
So I think it would be great for you to discuss that because, again, I think it was the first real kind of formal effect on, like, how are these things working?
And then the more recent work that you're working on now
is a kind of more generalized version of what is the state space
that these models output when it comes to confidence,
which is the manifold that we're talking about before.
So I think it would be great if you just described your matrix model
and then how you use that to provide some bounds
what in context learning is doing.
What's happening?
Okay, so, so yeah, let's start with that matrix abstraction.
So the idea we on the matrix is you have this gigantic matrix
where every row corresponds to a prompt.
And then the number of columns of this matrix
is the vocabulary of the LLM,
the number of tokens it has that it can emit.
So for every prompt,
this matrix contains the distribution over this vocabulary.
So when you say the cat sat on the,
you know, the column that corresponds to mat
will have a high probability.
Most of them will be zero.
But, you know, reasonable continuations will have a non-zero probability.
And so you can imagine that there's this gigantic matrix.
Now, the size of this matrix is, you know,
if we just take just the old
first generation GPD3 model
which had a context window of 2,000 tokens
and a vocabulary of
50,000 next tokens
or 50,000 tokens
then the size of it, the number of rows
in this matrix is more than the number of atoms
across all galaxies that we know of.
So clearly we cannot represent
yet exactly.
Now, fortunately, a lot of these rows do not appear in real life.
An arbitrary collection of tokens, you are not going to use that as a prompt.
Similarly, a lot of these rows are absent and a lot of the column values are also zero.
When you say the cat sat on the, it's unlikely to be followed by the token corresponding
to, let's say, numbers.
or an arbitrary collection of tokens.
There will only a very small subset of tokens
that can follow a particular prompt.
So this matrix is very, very sparse.
But even after that sparsity
and even after removing the sort of gibberish prompts,
the size of this matrix is too much for these models to represent,
even with a trail in parameters.
So what, in an abstract sense,
what is happening is,
the models get trained on certain, you know, data from the training set.
And certain some, a subset, a small subset of these rows, you have reasonable values
for the next token distribution.
Whenever you give the prompt something new, then it will try to interpolate with what it has learned
and what's there in the new prompt and come up with a new prompt.
and come up with a new distribution.
But it's basically, so it's more than a stochastic parrot.
It is sort of Bayesian on this subset of the metric
that it has been trained on.
So when I say, you know, I'm going out for dinner with Martin tonight.
Now, I'm reasonably sure that it has never encountered that phrase
in its training data, right?
but it has encountered variance of this phrase
and given that I'm going out with Martin
it can produce a Bayesian posterior
it uses that evidence that Martin is the one
that I'm going for dinner with
and it will produce a next token distribution
that will focus on the likely places that we are going
so this matrix
because it's represented in a compressed way
yet the models respond to everything
every prompt how do they do it well they go back to what they've been trained on
interpolate there and use the prompt as sort of some evidence to compute a new
distribution right so right so the context of the prompt impacts the posterior distribution
exactly yeah right and you mapped to Bayesian learning where the the context is
is the new evidence.
New evidence, exactly.
So I'll give you, so, so for instance,
the cricket example that I spoke about earlier.
Yeah.
So I created my own DSL,
which, you know,
mapped a natural language query in cricket to this DSL,
which then I can translate into a SQL query
or a Rest API, whatever.
But getting the DSL is important.
Now, these LLMs have never seen that DSL.
I designed it.
Yeah.
but yet after showing a few examples it learned it how did it learn it and this is this is in the
prompt you didn't know training it's in the prompt right so like it's the weights are
standard yeah yeah yeah this is this was happening in october 2020 right i had no access to
internals of open a i could just you know access their API open i had no access to internal structure of
stats guru or the
DSL that I cooked up in my head.
Yet, after showing it only a few examples, it learned it right away.
So that's an example where it has seen DSLs or structures in the past.
And now using this evidence that I show, okay, this is what my DSL looks like.
Now, a new natural language query, it is able to create the right posterior distribution for
the tokens, that map to the example that I've seen.
Now, the other beautiful thing about this is
this is an example of few short learning or in context learning, right?
But when I give that prompt, along with these examples to this LLM,
I'm not saying to the LLM, okay, this is an example of few short learning.
So learn from these examples, right?
You just pass this to the LLM as a prompt
and it processes it exactly the way it would process any other prompt,
which is not an example of in-context learning.
So that really means that the underlying mechanism is the same.
Whether you give a set of examples
and then ask it to complete a talk,
a task like it in-context learning,
or just give it some kind of prompt for continuation
that I'm going out for dinner with Martin tonight.
There's no in-context learning there.
But the process with which it's generating
or doing this inferencing is exactly the same.
And that's what I have been trying to model
and come up with a formal model of.
What I've found very impressive is
you've used this basic model
to show a number of things, right,
to describe context learning into maps to page and learning.
But you did it for another one
where you kind of, you've sketched out
this almost glib argument on Twitter on X
where you made this
you made a rough argument
for why recursive self-improvement
can't happen
without additional information
and so maybe
just walk through very quickly
how like this same model
you can just very quickly show
that a model can never
recursively self-improve.
So
you know, another phrase
that
we have been using
recently
is, you know, the output of the
LLM is the inductive closure of
what it has been trained on.
So when you say that
it can recursively self-improve,
it could mean
one of two things.
So let's get back to the...
Well, actually, you know what's kind of interesting
is like often the...
Most people agree that if you have one LLM
and you just feed the output and the input,
like it's not going to do anything.
But then often people will say,
well, what if you have two LLM,
you have no external.
information, but you have two LLMs talking to each other, maybe they can improve each other
and then you can have like, you know, a takeoff scenario. But again, you even address this,
even in the case of like N number of LLMs using kind of the matrix model to show that like
you just aren't gaining any information. Yeah. Yeah. So you can represent the sort of information
contained in these models. And let's go back to that matrix analogy that have, the matrix
subtraction. So like I said, you know, these models are represent a subset of the rows,
right? So a subset of the rows are represented, but some of these rows are able to
help fill out some of the missing rows. For instance, you know, if the model knows how to do
multiplication doing the step by step, then every row that is corresponding to let's a seven
69 times 125 or whatever.
It can fill out the answer, yeah.
It can fill out the answer because it has those algorithms sort of embedded in them.
You just need to unroll them.
So it can sort of self-improve up to a point.
But beyond a point, these models can only sort of generate what they have been trained on.
So let me give you, I'll give it three examples.
Yeah.
So any model, any LLM that was trained on pre-1915 physics
would never have come up with the theory of relativity.
Einstein had to sort of reject the Newtonian physics
and come up with this space-time continuum.
He completely rewrote the rules, right?
So that is an example of, you know, AGI,
where you are generating or generating new non-law.
It's not simply unrolling
about the universe, right?
It's not computing something.
It's actually discovering something
fundamental about the universe.
And for that, you have to go outside your training set.
Similarly, you know, any LLM
that was trained or didn't would not have come up
with quantum mechanics.
Right?
That's where particle duality or this whole
probabilistic notion or that, you know,
energy is not continuous, but it is quantized.
You had to reject Newtonian physics.
Yeah.
Or get us in completeness serum.
He had to go outside the axioms to say that, okay, it is incomplete.
So those are examples where you're creating new science or fundamentally new results.
That kind of self-improvement is not possible with these architectures.
They can refine these, they can fill out these rows where the answer already exists.
Another example, you know, which has received a lot of press these days, is these IMO results, international math,
you know whether it's a human solving it or the LLM solving it
they are not inventing new kinds of math
they are able to connect known results
in a sequence of steps to come up with the answer
so even the LLMs what they are doing is they are exploring all sorts of solutions
in some of these solutions they they start going on this path
where their next token entropy is low.
So that's where I say they are in that Bayesian manifold,
where you have this entropy collapse.
And by doing those steps, you arrive at the answer.
But you're not inventing new math.
You're not inventing new axioms
or new branches of mathematics.
You're sort of using what you've been trained on
to arrive at that answer.
So those things LLMs can do,
you know, they'll get better at it
of connecting the known dots
but creating new dots
I think we need
an architectural
advance
yeah
so Martine was talking earlier
about how the discourse
you know was it was either
static parrots or
you know
AGI recurs or something
how are you
how do you conceive of sort of the AGI
discourse or even
the concept
what does it mean to the extent that it's
useful how do you think about that
So the way, you know, I think about it, the way we have tried to formulate in our papers,
it's beyond a sarcastic parrot, but it's not AGI.
He's doing Bayesian reasoning over what it has been trained on.
So it's a lot more sophisticated than just a stochastic parrot.
How do you define AGI?
Okay, so AGI.
So how do I define AGI?
So the way I would say that LLMs currently navigate through this known Bayesian manifold, AGI will create new manifolds.
So right now these models navigate, they do not create.
AGI will be when we are able to create new science, new results, new math.
When an AGI comes up with a theory of relativity, I mean, it's an extremely high bar, but you get what I'm saying.
It has to go beyond what it has been trained on to come up with new paradigms,
neoscience, and that's by definition of AGA.
Vichael, do you think that, based on the work you've done,
can you bound the amount of data computer or data or compute that would be needed
in order for it to evolve?
So one of the problems, if you just take LLMs as they exist,
is there was so much data used to create them.
To create a new manifold will need a lot more data
just because of the basic mechanisms, right?
Otherwise, it'll just kind of like, you know,
get kind of consumed into the existing set of data.
Like, have you found any bounds of what would be needed
to actually evolve the manifold in a useful way?
Or do you think we just need a new architecture?
I personally think that we need a new architecture.
the more data that we have the more compute we have
we'll get maybe smoother manifolds
so it's like a map
yeah because I mean there's this view that people have
they're like well
Vishal this is all this is all
good and well but you know I could just take an
LLM and I can give it eyes and I can give it ears
and I can put it in the world and it'll gain information
and based on that intervention it'll improve itself
and therefore it can learn new things
but the counterpoint that I've always just intuitively
thought to that is
the amount of data used to train these things
is so large, how much can you actually
evolve that manifold given an incremental
almost none at all, right?
There has to be some other way
to generate new manifolds that aren't evolving
the existing one.
I completely agree.
There has to be a new sort of architectural leap
that is needed to go from the current,
you know, just throwing more data and more compute.
You know, it's going to plateau.
It's, you know, the iPhone 15, 16, 17.
And are there any research directions that are promising in your mind
that might help us, you know, go beyond LLM limitations?
So, I mean, again, I love LLMs.
They are fantastic.
They are going to increase productivity like nobody's business.
But I don't think they are the answer.
So, you know, Yard Likun famously says that LLMs are a distraction on the road to Aege.
Dead-end.
They're dead-end, AJ.
I don't think, I'm not quite in that camp,
but I think we need a new architecture
to sit on top of LLMs to reach AGI.
You know, a very basic thing,
you know what Martin just said.
You give them eyes and you give them ears.
You make them multimodal.
Of course, they'll become more powerful.
But you need a little bit more than that.
You know, the way human brains learns
with very few examples,
that's not the way transformers learn.
And, you know, I'm not saying that we need to create an Einstein or a gator,
but there has to be an architectural leap that is able to create these manifolds,
and just throwing new data will not do it.
It'll just smoothen out the already existing manifolds.
Is that something?
So is your goal to actually help, like, think through new architectures,
or are you primarily focused on putting formal bounds on existing architectures?
a bit of both
I mean the former goal
is the more ambitious one that
everybody is chasing
and yeah I think about that constantly
are there any new
even like
sort of hints that a new architect
or like have we started to make any progress
on new architectures
or is it
you know
Jan has been
pushing at this J-PAR architecture
energy-based architectures
they seem promising
the way I
have been sort of thinking about
it is
you know
there's this
set of a benchmark
or the arc prize
right that Mike Kanoop
and Fraschale have
and if
you understand why
the LLMs are failing on this test
maybe you can sort of reverse engineer a new architecture
that will help you succeed in that, right?
And I agree with a lot of what several people say that,
you know, language is great, but language is not the answer.
You know, when I'm looking at catching a ball that is coming to me,
I'm mentally doing that simulation in my head.
I'm not translating it to language to figure out where it will land.
I do that simulation in my head.
So, a way, you know, one of the new architectures, architectural things is how do we do,
how do we get these models to do approximate simulations, to test out that idea and whether
to proceed or not?
So, so, yeah, we have, you know, another thing that I've always wondered about is, did we develop as humans?
Did we develop language because we were intelligent?
or because we developed language,
we accelerated our intelligence.
So I don't know which side of the camp
you follow on that question.
What's interesting is,
like you have these anecdotal examples
of humans developing languages de novo
that have been recorded, right?
Like it's either the Guatemalan or Nicaraguan sign language, right?
Where there is these students
that develop their own language without being taught.
And so that would suggest that language just follows intelligence.
The problem is, is they're all anecdotal, right?
Like, who knows if somebody didn't teach them sign language?
Like, nobody really knows.
There is no controls.
So this is all these observational studies.
And there's so few of them, you have to wonder if it's just kind of sloppy observation.
And so I think that the question is still outstanding.
Yeah.
So, I mean, language definitely accelerated our intelligence.
there's no question about that.
But which followed which we don't know.
I view it as a networking problem naturally,
which is once you have languages,
you can communicate.
And when you can communicate,
you can communicate, you can replicate, yeah.
Yeah, exactly.
Exactly, right.
Cool.
Again, this is kind of a wonky question,
but, you know, I think one thing that you've brought to the discourse,
and for those that are listening to this,
I really think that you should look up Vishal's work and read it.
I just think it'll give you a really, really,
especially if you have a systems background,
like a networking systems bracket
give you a really, really good understanding
of kind of the bounds on these.
But like the toolkit that you draw from
is like information theory
and like more formal.
Have you found that the AI community
is receptive to this?
Or is it like two different cultures,
two different planets trying to communicate
and not a lot of common ground?
Like how have you found like bringing
like the networking view of the world
to the AI realm?
Some of them are receptive to it, definitely.
But, you know, these large conferences at their reviewing process, it's so random.
And the kind of questions they ask, you know, I'm a modeling person.
I like to bottle things.
And, you know, I submitted one version of this work to one very famous machine learning or AI conference.
And the reviewer said, okay, this is.
the model, so what?
So there is...
That's absolutely remarkable.
So, like, you actually take in a system that nobody understands, we have no models for,
you actually provided some model that we can use to analyze it, and that alone wasn't sufficient.
They're asking, so where are the large-scale experiments to prove this?
I do, listen, I honestly, I mean, I find there's so much empiricism.
in, like, the current, you know, AI community
exactly because we don't understand the systems.
You know, it kind of reminds me.
I feel like systems went the other way, right?
It's like we had all of these models,
but then we didn't understand how the systems worked,
and then we just, like, actually did measurement.
It feels like MLN, or the AI stuff is the opposite,
which is like, we know we don't understand them,
and so we just measure them,
but now we're trying to, like, come up with the models.
Yeah, exactly.
So it was so easy in some,
sense to build these
artifacts and then just measure them
that people have been going around
trying to do that
and
you know what one term
I'd really dislike is
prompt engineering
why you know engineering used to mean
sending a matter of the move
or providing five nine's reliability
prompt engineering is
prompt twiddling
you fiddle with a prompt
and the boggles changes and
the inference the output changes
And, you know, you have like hundreds of papers just, just, you know, doing one X whenever on the other,
changing a problem this way, that way, and writing their observations.
And as a result, you know, lots of these papers are being written, are being submitted for a review.
Reviews get busy looking at all this kind of empirical work.
And my personal taste is to first try to understand, model it.
Yeah.
And then you can do the other things.
like a true theory guy.
I don't know about this bit twiddling.
Let me ask one more LM question, which is,
are there any benchmarks or real-world tasks
that if they occurred,
you'd sort of reevaluate and say,
hey, maybe LLMs are closer to the path to AGI
than I thought.
If the very real-world tasks.
Good question.
You know, which for LLMs or these models,
the one domain where you have the most training data is probably coding.
And coding is where you can also have.
the most structure and yet anyone who's used these tools whether it's cursor or whatever
cloud code lLMs continue to hallucinate continue to generate unreasonable code you know you have to
constantly babysit these models so the day and lLM can create a large software project without any
babysitting is the day
I'll be a little bit convinced that it's
towards easy. But again
I don't think
it'll be able to create new science. If it does,
that's when I'll be convinced.
I think that you can almost take a definitional approach
to answer this question, Vishal.
The problem with these types of questions
is if you have billions of dollars
and you can collect whatever data you want,
you can make a model do anything you want, right?
And so like, you know what I'm saying?
at some level, you've got this entire capital structure,
machinery behind these models.
So you're like, oh, it can be good at science.
Well, sure, you put a billion dollars
of solving materials science and collect all this data,
you'll be good at material science or whatever it is.
And so, but there is a definitional answer, which is,
and I'm going to draw from your work,
which is there is a manifold that's in there
based on the data's been training on.
And then the question is, if it ever produces something
that's off, like a new manifold.
So considering the existing
traded data, if it ever does that, if it does
something that's outside of that distribution,
then clearly we're on a path
to learning new things. And if not,
then everything is just a computational step from what's
already known.
Yeah. And I guess
I guess the counter to that would be
maybe all humans do is
work on their own manifold and Einstein
you know
was lucky or something, I guess
would be the counter to that. But I just
So, you know, there's several mini-Ansign examples,
and, yeah, it's creating this new manifold.
I didn't want to use that definitional answer.
I thought it might sound too, too wonky to mathematical.
But essentially, if LLMs really created this new manifold,
then I would be convinced.
But so far, they have just gotten better at navigating the existing manifold,
the existing training set.
Which is hugely powerful and is it going to change the world.
Which is hugely powerful.
I'm not denying it.
that I think they are extremely, extremely good at what they can do.
But there's a limit to what they can do.
So I've one quick questions, what's next for you?
I mean, you've tackled in context learning.
You've got a model for LLMs, and I've got a generalized model for, like, you know,
like their solution space.
What are you thinking about tackling next?
In terms of modeling or?
Academically, an LLM.
Academically, yeah, academically, I'm
you know, I'm thinking of this,
what is the architecture leap that is needed?
Oh, that's exciting.
To create this, you know, manifold.
And how do we use, you know, multimodal data?
Awesome.
To expand.
When you figure that, I'll come back and talk to us.
That's right.
We'd love that.
So, I mean, you know, even with LLMs, you know,
in the paper, we say that you can improve the inference
by following this
low or minimum entropy path
so that's a very sort of small step
that we are building
and training models
that will do influence based on
the entropy path
yeah
by the way is model probe still up
token probe
yeah token probe is still up
and you can see actually
the you know
token probe is a software that we built
and thanks to Martin and A16
these generosity is running
on your servers and anyone
can go and test. And what we
have done there is we actually show
the entropy. Yeah.
It is so enlightening. I recommend anybody
listening to this who's interested. Actually, check out
token probe. It literally shows you the
confidence. Yeah, as you go
along. It's remarkable. So in context
learning, you know, you create your new
DSL and you give it to
the prompt and you can see
the confidence rising with each new
example. The entropy is reducing.
and that sort of is a validation of the bottle.
You can see it sort of unfurling and right in front of your eyes.
The token prop is here writing. Thanks, thanks again.
Michelle, thanks so much for coming on the podcast.
It's a great conversation.
It was a great fun.
Thank you.
Thank you so much again.
Thanks for listening to this episode of the A16Z podcast.
If you like this episode, be sure to like, comment, subscribe,
leave us a rating or review, and share it with your friends and family.
episodes, go to YouTube, Apple Podcasts, and Spotify. Follow us on X, A16Z, and subscribe to our
substack at A16Z.substack.com. Thanks again for listening, and I'll see you in the next episode.
As a reminder, the content here is for informational purposes only. Should not be taken as legal
business, tax, or investment advice, or be used to evaluate any investment or security, and is not
directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its
affiliates may also maintain investments in the companies discussed in this podcast. For more
details, including a link to our investments, please see A16Z.com forward slash disclosures.
