Dwarkesh Podcast - Francois Chollet — Why the biggest AI models can't solve simple puzzles
Episode Date: June 11, 2024Here is my conversation with Francois Chollet and Mike Knoop on the $1 million ARC-AGI Prize they're launching today.I did a bunch of socratic grilling throughout, but Francois’s arguments about why... LLMs won’t lead to AGI are very interesting and worth thinking through.It was really fun discussing/debating the cruxes. Enjoy!Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here. Timestamps(00:00:00) – The ARC benchmark(00:11:10) – Why LLMs struggle with ARC(00:19:00) – Skill vs intelligence(00:27:55) - Do we need “AGI” to automate most jobs?(00:48:28) – Future of AI progress: deep learning + program synthesis(01:00:40) – How Mike Knoop got nerd-sniped by ARC(01:08:37) – Million $ ARC Prize(01:10:33) – Resisting benchmark saturation(01:18:08) – ARC scores on frontier vs open source models(01:26:19) – Possible solutions to ARC Prize Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe
Transcript
Discussion (0)
Okay, today I have the pleasure to speak with Francois Cholet, who is a AI researcher at Google and creator of Keras.
And he's launching a prize in collaboration with Mike Knoof, the co-founder of Zapier, who will also be talking to in a second, a million-dollar prize to solve the ARC benchmark that he created.
So first question, what is the ARC benchmark and why do you even need this prize?
Why won't the biggest LEM we have in a year be able to just saturate it?
Sure. So, ARC is intended as a kind of IQ test for machine intelligence.
And what makes it different from most LAM benchmarks out there is that it's designed to be
resistant to memorization. So if you look at the way LLM's work, they're basically this big
interpolative memory. And the way you scale up their capabilities is by trying to cram as
much knowledge and patterns as possible into them. And by contrast, ARC does not require a lot of
knowledge at all is designed to only require what's known as core knowledge
which is basic knowledge about things like elementary physics objectness
counting that sort of thing the sort of knowledge that any four-year-old or five-year-old
possesses right but what's interesting is that each puzzle in arc is novel
is something that you've probably not encountered before even if you've
memorized the entire internet and that's what makes
So in the sweat
makes arc challenging
for LLMs. And so far
LMs have not been doing very well
on it. In fact, the approaches that are working well
are more towards
discrete program search, program synthesis.
So
first of all, I'll make a comment that
I'm glad that as a skeptic of LLM
you have put out yourself a benchmark
that
is it accurate to say that
suppose that the biggest model we have in a year
is able to get 80% on this,
then you're,
review would be we are on track of to AGI with LLMs, how would you think about that?
Right.
I'm pretty skeptical that we're going to see LLM do 80% in a year.
That said, if we do see it, you would also have to look at how this was achieved.
If you just train the model and millions or billions of puzzles similar to Arc so that you're
relying on the ability to have some overlap between the tasks that you train on and the
task that you're going to see at test time, then you're still using memory.
right and maybe maybe it can work you know hopefully arc is going to be good enough that
it's going to be resistant to this sort of attempt and brute forcing but you know you never know
maybe maybe it could happen I'm not saying it's not going to happen arc is not a perfect
benchmark maybe maybe it has flaws flows maybe it could be hacked in that way I'm so I guess I'm
curious about what would GPTFI have to do that you're very confident that you know it's on the
path to AGI. What would make me change my mind about that alarm is basically if I start seeing
a critical mass of cases where you show the model with something that has not seen before,
a task that's actually novel from the perspective of its training data, something that's not
in training data, and if it can actually adapt on the fly. And this is true for al-LAMS, but really
this would catch my attention for any AI technique out there. If I can see the ability to
adapt to novelty on the fly to pick up new skills efficiently, then I would be extremely
interested. I would think this is on the past to AGI.
So the advantage they have is that they do get to see everything. Maybe I'll take issue
with how much they are relying on that. But let's suppose that they are relying, obviously
they're relying on that more than humans do. To the extent that they do have so much in distribution,
to the extent that we have trouble distinguishing whether an example is in distribution
or not. Well, if they have everything in distribution, then they can do everything that we can do,
maybe it's not in distribution for us. Why is it so crucial that it has to be out of distribution
for them? You know, why can't we just leverage the fact that they do get to see everything?
Right. You're asking basically what's the difference between actual intelligence, which is the
ability to adapt to things you've not been prepared for, and pure memorization, like reciting what you've
seen before. And it's not just some semantic difference. The big difference is the big difference
is that you can never pre-trained on everything that you might see at test time, right?
Because the world changes all the time.
So it's not just the fact that the space of possible tasks is infinite.
And even if you're trained on millions of them, you've only seen zero person of the total space.
It's also the fact that the world is changing every day, right?
This is why the human species has developed intelligence in the first place.
if there was a
thing as a distribution
for the world, for the universe, for our lives,
then we would not need intelligence at all.
In fact, many creatures, many insects, for instance,
do not have intelligence.
Instead, what they have in their connectum,
in their genes,
hard-coded programs, behavioral programs
that map some stimuli to appropriate response.
And they can actually navigate their lives,
their environment,
very evolutionary fits that way without needing to learn anything.
And while, if our environment was static enough, predictable enough,
what would have happened is that evolution would have found the perfect behavioral program,
a hard-coded static behavioral program.
We would have written it into our genes.
We would have a hard-coded brain connectum, and that's what we would be running on.
But no, that's not what happened.
Instead, we have general intelligence.
So we are born with extremely little knowledge about the way.
world, but we are born with the ability to learn very efficiently and to adapt in the face of
things that we've never seen before. And that's what makes us unique. And that's what is really,
really challenging to recreate in machines. I want to rabbit hole on that a little bit. But before I do
that, maybe I'm going to overlay some examples of what an arc like challenge look like for the YouTube
audience, but maybe for people listening on audio, can you just describe what would a sample
arc challenge look like? Sure. So one arc puzzle, it looks
kind of like an IQ test puzzle, you've got a number of demonstration input-adput pairs.
So one pair is made of two grids. So one grid shows you an input and the second grade shows
you what you should produce as a response to that input. And you get a couple pairs like
this to demonstrate the nature of the task, to demonstrate what you're supposed to do with
your inputs, and then you get a new test input. And you get a new test input. And you'll
job is to produce the corresponding test output.
You look at the demonstration pairs and from that you figure out what you're supposed to
do and you show that you've understood it on this new test pair.
And importantly, in order to the sort of like knowledge basis that you need, in order to
approach these challenges, is you just need core knowledge.
And core knowledge is it's basically the knowledge of what makes an object, basic counting,
basic geometry, topology, symmetries, that sort of thing.
So extremely basic knowledge, LLMs for sure possess such knowledge.
Any child possesses such knowledge.
And what's really interesting is that each puzzle is new.
So it's not something that you're going to find elsewhere on the Internet, for instance.
And that means that whether it's as a human or as machine, every puzzle, you have to approach it from scratch.
You have to actually reason your way through it.
We cannot just fetch the response from your memory.
So the core knowledge, one contention here is we are only now getting multimodal models
who, because of the data they are trained on, are trained to do spatial reasoning.
Whereas obviously, not only humans, but for billions of years of revolution, we've had,
our ancestors have had to learn how to understand abstract physical and spatial properties
and recognize the patterns there.
And so one view would be in the next year, as we gain models that are multimodal native, that isn't just a sort of second class that is an add-on, but the multimodal capability is a priority, that it will understand these kinds of patterns because that's something we'd see natively.
Whereas right now, what arc sees is some JSON string of 100-1-0, and it's supposed to recognize a pattern there.
and even if you showed a human such a,
like just a sequence of these kinds of numbers,
it would have a challenge making sense
of what kind of question you're asking it.
So why I want it to be the case that as soon as we get
multimodal models, which we're on the path to unlock right now,
they're going to be so much better at archetype's facial reasoning?
That's an empirical question,
so I guess we're going to see the answer within a few months.
But my answer to that is, you know, our grades,
they're just discrete 2D grades of symbols.
They're pretty small, like it's not like,
If you flatten an image as a sequence of pixels, for instance,
then you get something that's actually very, very difficult to parse.
But that's not true for arc because the grids are very small,
you only have 10 possible symbols.
So there are these two degrees that are actually very easy to flatten
as sequences. And transformers, LLMs, they are very good at processing the sequences.
In fact, you can show that LLAMs do fine with processing arc-like data
by simply fine-tuning LLM on some subsets of the tasks
and then trying to test it on small variations of these tasks.
And you see that, yeah, the LLM can encode just fine solution programs
for tasks that it has seen before.
So it does not really have a problem passing the input
or figuring out the program.
The reason why LLMs don't do well on Arc is really just
the unfamiliarity aspect, the fact that each new task is different from every other
other task. You cannot, basically, you cannot memorize the solution programs in advance.
You have to synthesize a new solution program on the fly for each new task.
And that's really what they're struggling with.
So before I do more a devil's advocate, I just want to step back and explain why I'm
especially interested in having this conversation.
And obviously, the million dollar arc prize, I'm excited to actually play out with it myself
and hopefully the Vesuvius challenge,
which was Nat Friedman's Prize for solving,
decoding scrolls,
the winner of that,
decoding the squirrels from,
that were buried in the volcanoes in the Herculaneum library,
that was solved by a 22-year-old
who was listening to the podcast, Luke Farator.
So hopefully somebody listening
will find this challenge intriguing
and find a solution.
And the reason I've had on recently
a lot of people who are bullish on LLMs,
and I've had discussions with them before interviewing you about how to we explain the fact that LMs don't seem to be natively performing that well on ARC.
And I found their explanations somewhat contrived and I'll try out some of the reasons on you.
But it is actually an intriguing fact that some of these problems are relatively straightforward for humans to understand.
And they do struggle with them if you just input them natively.
All of them are very easy for humans.
Like any smart human should be able to do 90% 95%
percent on arc.
Smart human.
A smart human.
But even a five-year-old, so with very, very little knowledge, they could definitely do over 50%.
So let's talk about that because you, I agree that smart humans will do very well on
this test.
But the average human will probably do, you know, mediocre.
Not really average.
So we actually tried with average humans.
The score about 85.
That was with Amazon mechanical Turk workers, right?
I honestly don't know the demographic profile of Amazon mechanical Turkworkers,
but I imagine just interacting with the platform that Amazon is set up to do remote work.
That's not the median human across the planet, I'm guessing.
I mean, the broader point here being that, so we see the spectrum in humans where humans obviously have AGI.
But even within humans, you see a spectrum where some people are relatively dumber and they'll do perform work on IQ-like tests.
For example, Ravens regressive matrices.
is if you look at how the average person performs on that,
and you look at the kind of questions that is this sort of mid or miss.
Half of people will get right.
Half of people will get it wrong.
Some of them are pretty trivial.
For us, we might think like this is kind of trivial.
And so humans have AI, but from relatively small tweaks,
you can go from somebody who misses these kinds of basic IQ test questions,
somebody who gets them all right,
which suggests that actually if these models are doing natively,
we'll talk about some of the previous performances
that people are tried with these models,
but somebody with a Jack Cole with a 240 million parameter model got 35%.
Doesn't that suggest that they're on this spectrum that clearly exists within humans
and they're going to get saturated it pretty soon?
Yeah, so that's a bunch of interesting points here.
So there is indeed a branch of LLM approaches suspended by Jack Cole that are doing quite well,
that are in fact state of the art.
But you have to look at what's going on there.
So there are two things.
The first thing is that to get these numbers, you need to pre-train your LLM on millions of generated
arc tasks.
And of course, if you compare that to a five-year-old child looking at ARC for the first time,
the child has never done an acute test before, has never seen something like an ARTAS
before.
The only overlap between what they know and what they have to do in the test is core knowledge,
is knowing about like counting and objects and symmetries and things like that.
And still, they're going to do really well.
and they're going to do much better than the LLM trained on millions of similar tasks.
And the second thing that's something to note about the Jack-Cold approach is one thing that's
really critical to making the model work at all is test time fine-tuning.
And that's something that's really missing, by the way, from LLM approaches right now is that,
you know, most of the time when you're using an LLM, it's just doing static inference.
The model is frozen and you're just,
prompt in it and then you're getting an answer. So the model is not actually learning anything
on the fly. Its state is not adapting to the task at hand. And what Jacko is actually doing is that
for every test problem is on the fly is fine-tune-in a version of DLLM for that task. And that's
really what's unlocking performance. If you don't do that, you get like 1%, 2%. So basically something
completely negligible. And if you do
test time for intuning and you add a bunch of tricks on top, then you end up with interesting
performance numbers. So I think what is doing is trying to address one of the key limitations
of LLMs today, which is the lack of active inference. It's actually adding active inference to
LLMs. And that's working extremely well, actually. So that's fascinating to me.
There's so many interesting rabbit holes there. Should I take them in sequence or deal with
them all once? Let me just start. So the point you made about the fact that you need to unlock
the adaptive compute slash test time compute, a lot of the scale maximalists, I think this will be
interesting rabbit hole to explore with you, because a lot of the scaling maximalist have your
broader perspective in the sense that they think that in addition to scaling, you need these
kinds of things like unlocking adaptive compute or doing some sort of RL to get the system two working.
And their perspective is that this is a relatively straightforward thing that will be added
atop the representations that a scaled-up model has.
has greater access to.
No, it's not just a technical detail.
It's not a straightforward thing.
It is everything.
It is the important part.
And the scale maximalist argument, really it boils on to,
you know, these people, they refer to scaling laws,
which is this empirical relationship that you can draw between
how much compute you spend entering a model and the performance
you're getting on benchmarks, right?
And the key question here, of course,
is, well, how do you measure performance
what it is that you're actually
improving by adding more compute
and more data? And, well,
it's benchmark performance, right?
And the thing is, the way you measure
performance is not a technical
detail. It's not an afterthought because
it's going to narrow down
the sort of questions that you're asking.
And so accordingly, it's going to
narrow down the sort of answers
that you're looking for. If you look
at the bencher
we're using for LLMs.
They're all memorization-based benchmarks.
Like sometimes they are literally just knowledge-based,
like a school test.
And even if you look at the ones that are, you know,
explicitly about reasoning,
you realize if you look closely that it's,
in order to solve them,
it's enough to memorize a finite set of reasoning patterns.
And then you just reapply them.
They're like static programs.
LMs are very good at memorizing static programs.
small Stelix programs.
And they've got this sort of like bank of solution programs.
And when you give them a new puzzle, they can just fetch the appropriate program
apply it.
And it's looking like it's reasoning.
But really, it's not doing any sort of on-the-fly program synthesis.
All it's doing is program fetching.
So you can actually solve all these benchmarks with memorization.
And so what you're scaling up here, like if you look at the models, they are big parameters.
metric curves fitted to the data distribution, which I got on a descent.
So there are basically this big interpolative databases, interpolative memories.
And of course, if you scale up the size of your database and you cram into it more knowledge,
more patterns and so on, you are going to be increasing its performance as measured by a
memorization benchmark. That's kind of obvious. But as you're doing it, you are not increasing
the intelligence of the system one bit.
You are increasing the skill of the system.
You are increasing its usefulness, its scope of applicability, but not its intelligence, because
skill is not intelligence.
And that's the fundamental confusion that people run into is that they're confusing skill
and intelligence.
Yeah, there's a lot of fascinating things to talk about here.
So skill, intelligence, interpolation.
I mean, okay, so the thing about they're fitting some manifold is.
into that maps the input data.
There's a reductionist way to talk about what happens
in the human brain that says that it's just axons
firing at each other.
But we don't care about the reductionist explanation
of what's happening.
We care about what the sort of meta
at the macroscopic level, what happens when these things combine?
As far as the interpolation goes, so okay,
let's look at one of the benchmarks here.
There's one benchmark that does great school math
And these are problems that, like a smart high schooler would be able to solve.
It's called GSM 8K.
And these models get 95% on these.
Like, basically, they always nail it.
That's memorization benchmark.
Okay, let's talk about what that means.
So here's one question about from that benchmark.
So 30 students are in a class.
One fifth of them are 12-year-olds.
One-third are 13-year-old.
One-tenth-or-11-year-olds.
How many of them are not 11, 12, or 13-year-olds?
So I agree, like, this is not rocket science, right?
You can write down on paper how you go through this problem.
and a high school kid,
at least a smart high school kid,
should be able to solve it.
Now, when you say memorization,
it still has to reason through
how to think about fractions
and what is the context of the whole problem
and then combining the different calculations that's doing.
It depends how you want to define reasoning.
But there are two definitions you can use.
So one is, I have available a set of program templates.
It's like the structure of the puzzle,
which can also generate its solution.
And I'm just going to identify the right template, which is in my memory.
I'm going to input the new values into the template, run the program, get the solution.
And you could say this is reasoning.
And I say, yeah, sure, okay.
But another definition you can use is reasoning is the ability to, when you're faced with a puzzle,
given that you don't have already a program in memory to solve it,
you must synthesize on the fly a new program based on bits of pieces of existing programs that you have.
you have to do on-the-fly program synthesis.
And it's actually dramatically harder
than just fetching the right memorized program
and replying it.
So I think maybe we are overestimating
the extent to which humans are so sample efficient
that they also don't need training in this way
where they have to drill in these kinds of pathways
of reasoning through certain kinds of problems.
So let's take math, for example.
Yeah.
It's not like you can just show a baby
the axioms of set theory.
And now they know math, right?
So when they're growing up, you had to do years of teaching them pre-algebra.
Then you've got to do a year of teaching them doing drills and going through the same kind of problem in algebra,
then geometry, pre-calculus, calculus.
Absolutely.
So training?
Yeah, but isn't that like the same kind of thing where you can't just see one example and now you have the program or whatever?
You actually had to drill it.
These models also had to drill with a bunch of fruit training data.
Sure.
I mean, in order to do on-the-fly program synthesis, you actually need building blocks to work from.
So knowledge and memory are actually tremendously important in the process.
I'm not saying it's memory versus reasoning in order to do effective reasoning.
You need memory.
But it sounds like it's compatible with your story that through seeing a lot of different kinds of examples,
these things can learn to reason within the context of those examples.
And we can also see within bigger and bigger models.
So that was an example of a high school level of math problem.
let's say a model that's like smaller than GPT3 couldn't do that at all.
As these models get bigger, they seem to be able to pick a bigger and bigger.
It's not really a size issue.
It's more like a trained data issue in this case.
Well, bigger models can pick up these kinds of circuits, which smaller models apparently
don't do a good job of doing this, even if you were to train them on this kind of data.
Doesn't that just suggest that as you have bigger and bigger models, they can pick up bigger and bigger
pathways or more general ways of reasoning?
Absolutely.
But then isn't that intelligence?
No, no, it's not.
If you scale up your database and you keep adding to it more knowledge, more program templates,
then sure, it becomes more and more skillful.
You can apply to more and more tasks.
But general intelligence is not task specific skills scaled up to many skills.
Because there is an infinite space of possible skills.
General intelligence is the ability to approach any problem, any skill, and very quickly master it using very little data.
Because this is what makes you able to face anything.
you might ever encounter.
This is what makes,
this is the definition of generality.
Like,
generality is not specificity scaled up.
It is the ability to apply your mind to anything at all,
to arbitrary things.
And this requires, fundamentally,
this requires the ability to adapt,
to learn on the fly efficiently.
So my claim is that by doing this free training
on bigger and bigger models,
you are gaining that capacity
to then generalize very efficiently.
Let me give me an example.
Let me give me an example.
So your own company, Google, in their paper on Gemini 1.5, they had this very interesting
example where they would give, in context, they would give the model the grammar book and the dictionary
of a language that has less than 200 living speakers.
So it's not in the pre-training data.
And you just give them the dictionary.
And it basically is able to speak this language and translate to it, including the complex.
and organic ways in which languages are structured.
So a human, if you showed me a dictionary from English to Spanish,
I'm not going to be able to pick up the how to structure sentences
and how to say things in Spanish.
The fact that because of the representations that it has gained
through this free training, it is able to now extremely efficiently
learn a new language.
Doesn't that show that this kind of pre-taining actually does increase your ability
to learn new tasks?
If you're right, if you were right, LLMs would do really well on arc puzzles,
because arc puzzles are not complex.
Each one of them requires very little knowledge.
Each one of them is very low on complexity.
You don't need to think very hard about it.
They're actually extremely obvious for humans,
like even children can do them.
But ALMs cannot,
even ELMs that have, you know,
100,000 times more knowledge than you do.
They still cannot.
And the only thing that makes ARC special
is that it was designed with this intent
to resist memorization.
This is the only thing.
And this is the huge,
blocker for LLM performance.
Right.
And so, you know,
I think if you look at LLMs closely,
it's pretty obvious that they're not really like
synthesizing new programs on the fly to solve the tasks that they're faced with.
They're very much reapplying things that they've stored in memory.
For instance, one thing that's very striking is LLMS can solve a Cesar Cipher,
you know, like a Cesar Cipher, like transposing,
letters to code a message.
And well, there's a very complex algorithm, right?
But it comes up quite a bit on the internet.
So they've basically memorized it.
And what's really interesting is that they can do it
for a transposition length of like three or five
because there are very, very common numbers in examples
provided on the internet.
But if you try to do it with an arbitrary number, like nine,
it's going to fail.
Because it does not encode the generalized form
of the algorithm, but only specific cases.
It has memorized specific cases
of the algorithm. And if
it could actually synthesize on the
fly, the solver algorithm,
then the value of
N would not matter at all,
because it does not increase the problem and complexity.
I think this is true of humans as well,
where what was the study?
Humans use memorization pattern matching
all the time, of course, but humans
are not limited to memorization
and pattern matching. They have this very
unique ability to adapt to
new situations on the fly. This is exactly what enables you to navigate every new day in your life.
I'm forgetting the details, but there was some study that chess grandmasters will perform
very well within the context of the moves that... Excellent example, because chess at the highest
level is all about memorization. Chess memorization gauge. Okay, sure. We can leave that aside.
What is your explanation for the original question of why can, why in context the GPD one,
sorry, Gemini 1.5 was able to learn a language, including the complex.
grammar structure. Doesn't that show that they can pick up new knowledge?
I would assume that it has simply mined from its extremely extensive, unimaginably vast,
training data. It has mined the required template and then it's just reusing it. We know that
Lelames have a very poor ability to synthesize new program templates like this on the fly or even
adapt existing ones. They're very much limited to fetching.
Suppose there's a programmer at Google. They go into the office in the morning. At what point
are they doing something that 100% cannot be due to fetching some template that
could, even if they, suppose they were an LLM, they could not do if they had fetched
some template from their program.
At what point do they have to use this so-called extreme generalization capability?
Forget about Google software developers.
Every human, every day of their lives, is full of novel things that they've not been prepared
for.
You cannot navigate your life based on memorization alone.
It's impossible.
I'm sort of denying the premise that you are also agreed they're not doing like, quote,
numeralization, it seems like you're saying they're less capable of generalization, but I'm just
curious of like the kind of generalization they do, if you get into the office and you try to
do this kind of generalization, you're going to fail at your job. What is the first point,
you're a programmer, what is the first point when you try to do that generalization?
You would lose your job because you can't do the extreme generalization.
I don't have any specific examples, but literally, like, take this situation, for instance,
you've never been here in this room.
Maybe you've been in this city a few times.
I don't know, but there's a fair amount of novelty.
You've never been interviewing me.
There's a fair amount of novelty in every hour of every day in your life.
It's in fact, by and large, more novelty than any LLM could handle.
Like if you just put LLM in a robot, it could not be doing all the things that you've been doing today.
Or take either like cell driving cars.
for instance, you take a self-driving car operating in the barrier, do you think you could just
drop it in New York City or drop it in London where people drive on the left? No, it's going to fail.
So not only can you not like make it generalize to a change of rules, of driving rules, but you
cannot even make it generalize to a new city. It needs to be trained on each specific environment.
I mean, I agree that self-driving cars aren't AGI.
But it's the same type of model.
They are transformers as well.
I mean, I don't know.
Aides also have brains with neurons in them, but they're less intelligent because they're small.
It's not the same architect.
We can get into that.
So I still don't understand a concrete thing of we also need training.
That's why education exists.
That's why we had to spend the first 18 years of our life, doing drills.
We have a memory, but we are not a memory.
We are not limited to just a memory.
But I'm denying the firmament that's necessarily the only thing these models are doing.
And I'm still not sure what is the task that a remote worker would be doing, have to, like,
suppose you just have to step out a remote work with an LLM and their programmer.
What is the first point at which you realize this is not a human, this is an LLM?
What about I just send them a knock puzzle and see how they do?
No, like part of their job, you know?
But you have to deal with novelty all the time.
Okay, so if you, is there a world in which all the programmers are replaced?
and then we're still saying,
but they're only doing memorization
late in programming tasks, but they're still producing
a trillion dollars of
worth of, you know, output in the form of code.
Software development is actually a pretty good example
of a job where you're dealing with novelty
all the time. Or if you're not,
well, I'm not sure what you're doing. So
I personally use
Genentee very little in my
software development job. And
before LMS where I think,
I was also using Stack Overflow very little.
You know, some people,
I'll just copy pasting stuff from Stack Overflow on nowadays,
it can be basing stuff from an LLM.
Personally, I try to focus on problem solving.
The syntax is just a technical detail.
What's really important is the problem solving.
Like the essence of programming is engineering mental models,
like mental representations of the problem you're trying to solve.
But you can add, you know, we have many,
people can interact with these systems themselves,
and you can go to chat GPT and say,
here's a specification of the kind of program I want, they'll build it for you.
As long as there are many examples of this program on like ITEM and Sarkovacru and so on, sure,
they will fetch the program for you from their memory.
But you can change arbitrary details.
No, it doesn't work.
I need it to work on this different kind of server.
If that were true, there would be no software engineers today.
I agree we're not at a full AGIEI yet in the sense that these models have, let's say,
less than a trillion parameters.
A human brain has somewhere on the order of 10 to 30 trillion synapses.
I mean, if you were just doing some naive math, you're like at least 10x underparameterized.
So I agree we're not there yet.
But I'm sort of confused on why we're not on the spectrum where, yes, I agree that there's
many kinds of generalization they can't do.
But it seems like they're on this kind of smooth spectrum that we see even within humans,
where some humans would have a hard time doing an archetype test.
We see that based on the performance on progressive Ravens, matrices type IQ tests.
I'm not a fan of IQ test because for the most parts you can train
on IQ tests and get better at them.
So they have very much memorization based.
And this is actually the main pitfall
that AHC tries not to fall for.
I'm still on computer.
So if all remote jobs are automated
in the next five years, let's say,
at least that don't require you to be like sort of a service.
It's not like a salesperson
where you want the human to be talking,
but like programming whatever.
In that world,
would you say that that's not possible
because a lot of what a programmer needs to do
definitely requires things
that would not be in any free training corpus?
Sure. I mean, in five years, there will be more
software engineers than they are today, not sure.
But I just want to understand. So
I'm still not sure. I mean, I know
how to, I studied computer science. If I had
become a code monkey out of college, like, what would I be doing?
I go to my job.
What is the first thing, my boss tells me
something to do? When does he realize
I'm an LLM if I was an LLM?
Probably on the first day, you know.
Again, if it were true that LLMs could generalize to novel problems like this
and actually develop software to solve a problem they've never seen before,
you would not need software engineers anymore.
In practice, if I look at how people are using LLMs in their software engineering job today,
they are using it as a stack of a flow replacement.
So they are using it as a way to copy-paste, code snippets,
to perform very common actions.
And what they actually need is a database of code snippets.
They don't actually need any of the abilities
that actually make them software engineers.
I mean, when we talk about interpolating
between the stack overflow databases,
if you look at the kinds of math problems or coding problems,
maybe to say that they're...
Maybe let's step back on interpolation
and let me ask the question this way.
Why can't creativity, why isn't creativity
just interpolation in a higher dimension
where if,
A bigger model can learn a more complex manifold.
If we're going to use the ML language.
And if you look at read a biography of a scientist, right?
It doesn't feel like they're not zero-shodding new scientific theories.
They're playing with existing ideas.
They're trying to juxtapose them in their head.
They try out some like slightly ever, in the tree of evolution, intellectual descendants,
they try out a different evolutionary path.
You sort of run the experiment there in terms of publishing the paper, whatever.
It seems like a similar kind of thing humans are doing.
there's like at a higher level of generalization.
And what you see across bigger and bigger models
is they can, they seem to be approaching higher
and higher level to centralization
where GPT2 couldn't do a great school level math problem
that requires more generalization
that it has capability for, even that skill,
then GPD three and four can.
So not quite.
So GPT4 has a higher degree of skill
and higher range of skills.
Because the same semantics here,
but I don't want to get a semantics here,
but the question of why can't creativity be
Just interpolation on a higher dimension.
I think interpolation can be creative, absolutely.
And to your point, I do think that on some level,
humans also do a lot of memorization,
a lot of reciting, a lot of pattern matching,
and a lot of interpolation as well.
So it's very much a spectrum between pattern matching
and true reasoning, it's a spectrum.
And humans are never really at one hand,
end of the spectrum.
They're never really doing pure pattern
matching of pure reasoning. They're usually doing some mixture of both. Even if you're doing
something that seems very reasoning heavy, like proving a mathematical theorem, as you're doing it,
sure, you're doing quite a bit of discrete search in your mind, quite a bit of actual reasoning,
but you're also very much guided by intuition, guided by the shape of proofs that you've
seen before, by your knowledge of mathematics. So it's never really, you know, all of our thoughts,
everything we do is a mixture of this sort of like
interpolative memorization based thinking,
this sort of like type 1 thinking
and type 2 thinking.
Why are bigger models more sample efficient?
Because they have more reusable building blocks
that they can lean on
to pick up new patterns in that training data.
And does that pattern keep continuing
as you keep getting bigger and bigger?
To the extent that the new patterns,
that the new patterns you're giving the model to learn
are good match for what it has learned before.
If you present something that's actually novel
that is not in a data distribution,
like an arc puzzle, for instance, it will fail.
Let me make this claim.
The program synthesis, I think, is a very,
very useful intuition pump.
Why can it be the case that what's happening
in the transformer is the early layers
are doing the figuring out
how to represent the inputting tokens.
And what the middle layers do
is this kind of program search,
program synthesis,
where they combine the inputs
to all the circuits in the model
where they go from the low level representation
to a higher level representation
near the middle of the model, they use these programs,
they combine these concepts,
then what comes out at the other end
is the reasoning based on that high level intelligence.
Possibly, why not?
But you know, if these models
were actually capable of synthesizing novel programs,
however simple, they should be able to do arc.
because for any arc task, if you write down the solution program in Python, it's not a complex program.
It's extremely simple.
And humans can figure it out.
So why can LLMS not do it?
Okay.
I think that's a fair point.
And if I turn the question around to you, so suppose that it's the case that in a year, a multimodal model can solve ARC, let's get 80%, whatever the average human would get, then AGI?
Quite possibly, yes.
I think if you start,
so honestly what I would like to see is an LLM type model,
solving arc at like 80%,
but after having only been trained on core knowledge-related stuff.
But human kids, I don't think we're necessarily just traded none.
It's not just that we have in our genes, object permanence.
Let me rephrase that.
Only trained on information that is not explicitly trying to uncovering,
anticipate what's going to be in the arc test set.
But isn't the whole point of Arc that you can't sort of,
it's a new chart of type of intelligence test every single time?
Yes, that is the point.
So if Arc were a perfect, flawless benchmark,
it would be impossible to anticipate what's in the test set.
And, you know, Arc was released more than four years ago,
and so far it's been resistant to memorization.
So I think it has, to some extent,
I pass a test of time.
But I don't think it's perfect.
I think if you try to make by hand
hundreds of thousands of arc tasks, and then you try to multiply them by programmatically generating
variations, and then you end up with maybe hundreds of millions of tasks. Just by brute forcing
the task space, there will be enough overlap between what you're trained on and what's in
the test set that you can actually score very highly. So, you know, with enough scale, you can always
cheat. If you can do this for every single thing that supposedly requires intelligence,
then what good is intelligence? Apparently you can just brute force intelligence.
If the world, if your life were a static distribution, then sure, you could just brute force the space of possible behaviors.
Like, you know, the way I would think about intelligence, there are several metaphors, Salactoes, but one of them is you can think of intelligence as a past-finding algorithm in future situation space.
Like, I don't know if you're familiar with game development, like RTS game development, but you have a map, right?
And you have, it's like a 2D, 2D map.
And you have partial information about it.
Like there is some fog of war on your map.
There are areas that you haven't explored yet.
You know nothing about them.
And then there are areas that you've explored, but you only know how they were like in the past.
You don't know how they're like today.
And now instead of thinking about a 2D map, think about the space of possible future situations that you might encounter and how they're connected to each other.
is a pass-finding algorithm.
So once you set a goal,
it will tell you how to get there optimally.
But of course, it's constrained by the information you have.
It cannot pass-finding in an area that you know nothing about.
It cannot also anticipate changes.
And the thing is, if you had complete information about the map,
then you could solve the pass-finding problem by simply
memorizing every possible path, every mapping from point A to point B, you could solve the problem
with pure memory. But the reason you cannot do that in real life is because you don't actually
know what's going to happen in the future. Life is ever changing. I feel like you're using
words in really memorization, which we would never use for human children. If you're like,
your kid learns to do algebra and then like now learns to do calculus, you wouldn't say they
memorize calculus. If they can just solve any arbitrary algebraic problem, you wouldn't say,
like they've memorized algebra.
They say they've learned algebra.
Humans are never really doing pure memorization or pure reasoning.
But that's only because you're semantically labeling when the human does a skill,
it's a memorization, when the exact same school is done by the LLM,
as you can measure by these benchmarks.
And you can just, like, plug in any sort of math problem.
Sometimes humans are doing the exact same as the LLM is doing,
which is just, for instance, I know, if you learn to add numbers,
you're memorizing an algorithm, you're memorizing a program,
and then you can reapply it.
You are not synthesizing on the fly the addition program.
So obviously at some point, some human had to figure out how to do addition.
But the way a kid learns it is not that they sort of figure out from the accents of that theory, how to do addition.
I think what you learn in school is mostly memorization.
Right.
So my claim is that, listen, these models are vastly underparameterized relative to how many flops or how many parameters you have the human brain.
And so, yeah, they're not going to be like coming up with new theorems like the smartest humans can.
But most humans can't do that either.
what most humans do, it sounds like it's similar to what you are calling memorization,
which is memorizing skills or memorizing, you know, techniques that you've learned.
And so it sounds like it's compatible in your, tell me if this is wrong.
Is it compatible in your world if like all the remote workers are gone, but they're doing
skills which we can potentially make synthetic data off?
So we record everybody's screen and every single remote worker's screen.
We sort of understand the skills they're performing there.
And now we've trained a model that can do all.
all the remote workers are unemployed,
we're generating trillions of dollars
to economic activity from AI, remote workers.
In that world, are we still in the memorization regime?
So, sure, with memorization, you can automate almost anything.
As long as it's a static distribution,
as long as you don't have to deal with change.
Are most jobs part of such a static distribution?
Potentially, there are lots of things that you can automate.
And LLMs are an excellent tool for automation.
And I think that's really, but you have to understand that automation
automation is not the same as integers.
I'm not saying that all limbs are useless.
I've been a huge proponent of deep learning
for many years. And
you know, for many years, I've been saying two things. I've been saying
that if you keep scaling up
deep learning, it will keep paying off.
And at the same time, I've been saying, if you keep
scaling up deep learning, this will not lead to
a GI. So we can automate
more and more things. And yes,
this is economically valuable. And yes, potentially
there are many jobs. You could automate a way
like this, and that would be economically valuable.
But you're not still not going to have
intelligence. So you can ask, you know, okay, so what does it matter if we can generate all this
economic value? Maybe you don't need intelligence after all. Well, you need intelligence the moment
you have to deal with change, with novelty, with uncertainty. As long as you're in a space that can
be exactly described in advance, you can just, you can just make your pure memorization, right?
In fact, you can always solve any problem. You can always display arbitrary levels of skills
on any task
without leveraging any intelligence
whatsoever as long as
it is possible to describe the problem
and its solution very, very precisely.
But when they do deal with novelty,
then you just call it interpolation, right?
No, no, interpolation is not enough
to deal with all kinds of novelty
if it were, then LLMs would be a GI.
Well, I agree they're not a GI.
I'm just trying to figure out
how do we figure out
we're on the path to a GI.
And I think sort of crux here is maybe that it seems to me that these things are on a spectrum
and we're clearly covering the earliest part of the spectrum with LLMs.
I think so.
And oh, okay, interesting.
But here's another sort of thing that I think is evidence for this.
Grocking, right?
So clearly even within deep learning, there's a difference between the memorization regime
and the generalization regime where at first they'll just memorize the data set of, you know,
if you're doing modular edition, how to add digits.
And then at some point, if you keep training on that, they'll learn the skill.
So the fact that there is that distinction suggests that the generalized circuit, the deep learning can learn,
there is a regime it enters where it generalizes if you have an over-parameterized model,
which you don't have in comparison to all the tasks we want these models to do right now.
Grogh is very, very old phenomenon.
We've been observing it for decades.
It's basically an instance of the minimum description length principle,
where sure you can given a problem you can just memorize an input pointwise input to
output mapping which is completely overfit so it does not generalize at all but it
solves the problem on the train data and from there you can actually keep
pruning it keep making your mapping simpler and simpler and more compressed
and at some point it will start generalizing and so that's something called the
the minimum description next principle.
It's decided that the program that will generalize best is the shortest.
Right.
And it doesn't mean that you're doing anything other than memorization,
but you're doing memorization plus regularization.
Right.
AKA generalization.
Yeah.
And that is absolutely at least to generalization.
Right.
And then so you do that within one skill.
But then the pattern you see here of meta learning is that it's more efficient to store a program
that can perform many skills rather than one skill.
which is what we might call fluid intelligence.
And so as you get bigger in moving models,
you would expect it to go up this hierarchy of generalization
where it generalizes to a skill,
then it generalizes across multiple skills.
That's correct. That's correct.
And, you know, LLMs, they're not infinitely large.
They have only a fixed number of parameters,
and so they have to compress their knowledge as much as possible.
And in practice, so LLMs are mostly storing reusable bits of programs,
like vector programs.
And because they have this need for compression,
it means that every time they're learning a new program,
they're going to try to express it
in terms of existing bits and pieces of programs
that they've already learned before.
Right?
Isn't this the generalization?
Absolutely.
Oh, wait.
This is why, you know,
clearly LLMs have some degree of generalization.
And this is precisely why.
It's because they have to compress.
And why is that intrinsically limited?
Why can't you just go, at some point,
it has to learn a higher level of generalization,
higher level and then the highest level is the fluid intelligence.
It's intrinsically limited because the substrate of your model is a big parametric curve.
And all you can do with this is local generalization.
If you want to go beyond this towards broader or even extreme generalization,
you have to move to a different type of model.
And my paradigm of choice is discrete program search, program synthesis.
So, and if you want to understand that, you can sort of like compare,
compare it, contrasts it with deep learning.
So in deep learning, your model is a parametric curve,
a differential ball parametric curve.
In program synthesis, your model is a discrete graph of operators.
So you've got like a set of logical operators like a domain specific language.
You're picking instances of it.
You're structuring that into a graph.
That's a program.
And that's actually very similar to like a program you might write in Python or C++ and so on.
And in deploying your learning engine, because we are doing machine learning here, like we're trying to automatically learn these models,
and deep learning your learning engine is gradient descent, right?
And gradient descent is very compute efficient because you have this very strong, informative feedback signal about where the solution is,
so you can get to the solution very quickly.
But it is very data inefficient, meaning that in order to make it work, you need a dense sampling of the operation.
rating space. You need a dense sampling of data distribution. And then you're limited to
only generalizing within that data distribution. And the reason why you have this limitation
is because your model is a curve. And meanwhile, if you look at discrete program search,
the learning engine is combinatoral of search. You're just trying a bunch of programs until you
find one that actually meets your spec. This process is extremely data efficient. You can learn
and generalizable program from just one example, two examples, which is why
works so well on arc by the way but the big limitations that it's extremely
computing efficient because you're running into a combinator explosion of
course and so you can you can sort of see here how the
learning and discrete program search they have very complementary
strength and limitations as well like every limitation of deep learning
as a strength corresponding strengths in in program synthesis and in
and university and i think the past forward is going to
going to be to merge the two, to basically start doing.
So another way you can think about it is, so this parametric curves train with ground
descent, there are great fit for everything that's system one type thinking, like pattern
cognition, intuition, memorization, and so on.
And discrete program search is great fit for type two thinking, system two thinking.
for instance, planning, reasoning,
quickly figuring out a generalizable model,
let matches just one or two examples
like for an arc puzzle, for instance.
And I think humans are never doing pure system one
or pure system two.
They're always mixing and matching both.
And right now we have all the tools for system one.
We have almost nothing for system two.
The way for one is to create a hybrid system.
And I think the form it's going to take
is it's going to be mostly system too.
So the outer structure is going to be a discrete program search system.
But you're going to fix the fundamental limitation of discrete program search,
which is a counter explosion.
You're going to fix it with deep learning.
You're going to leverage deep learning to guide, to provide intuition in program space,
to guide the program search.
And I think that's very similar to what you see, for instance,
when you're playing chess
or when you're trying to prove a theorem
is that it's mostly
a reasoning thing, but you start
out with some intuition about the shape of the solution.
And that's very much something you can get
via a deep planning model.
Deplanning models, they are very much like
intuition machines. They're pattern matching
machines. So you start
from this shape of the solution
And then you're going to do actual explicit discrete program search.
But you're not going to do it via brute force.
You're not going to try things kind of like randomly.
You're actually going to ask another deep learning model for suggestions.
Like here's the best likely next step.
Here's where in the graph you should be going.
And you can also use yet another deplanning model for feedback.
But well, here's what I have so far.
Is it looking good?
should they just backtrack and try something new.
So I think discrete program search is going to be the key,
but you want to make it dramatically better
or those of magnitude more efficient by leveraging deep learning.
And by the way, another thing that you can use deep learning for
is of course things like common sense knowledge
and knowledge in general.
And I think you're going to end up with this sort of system
where you have this on-the-fly synthesis engine
that can adapt to new situations.
situations. But the way it adapts is that it's going to fetch from a bank of patterns,
modules that could be themselves curves that could be a differentiable modules and some
others that could be algorithmic in nature. It's going to assemble them via this process that's
intuition guided. And it's going to give you, for every new situation you might be faced with,
it's going to give you with a generalizable model that was synthesized using very, very little data.
Something like this would sort of arc.
That's actually a really interesting
a prompt because I think
an interesting crux here is
when I talk to my friends who are extremely optimistic
about LLMs and
expect AGI within the next couple of years,
they also in some sense agree
that scaling is not all you need,
but that the rest of the progress is undergirded
and enabled by scaling
And but still, you need to add the system to the test time compute atop these models.
And their perspective is that it's relatively straightforward to do that because you have this
library representations that you built up from free training.
But it's almost talking like, you know, it's just like skimming through textbooks.
You need some more deliberate way in which it engages with the material it learns.
In context learning is extremely sample efficient.
But to actually distill that into the.
the weights. You need the model to like talk through the things that sees and then added back to the
weights. As far as the system two goes, they talk about adding some kind of RL set up so that it is
encouraged to proceed on the reasoning traces that end up being correct. And they think this is
relatively straightforward stuff that will be added within the next couple of years. That's an empirical
question. So I think we'll see. Your intuition, I assume, is not that. I'm curious. My intuition is,
in fact, this whole like system to architecture is the hard part, is the very hard and not
obvious part. Scaling up the interpretive memory is the easy part. All you need is, like, it's
literally just a big curve. All you need is more data. It's a representation of a data set,
an interpolative representation of a data set. That's the easy part. The heart part is the
architecture of intelligence. Memory and intelligence are separate components. We have the memory,
we don't have the intelligence yet. And I agree with you that, well, having the memory is
actually very useful. And if you just had the intelligence, but it was not hooked up to
an extensive memory. It would not mean that useful because it will not have enough material
to work from. Yeah. The alternative hypothesis here that a former guest Trenton-Brickon
advanced is that intelligence is just hierarchically associated memory where higher-level
patterns, when Sherlock Holmes goes into a crime scene, and he's extremely sample-efficient. He can
just look at a few clues and figure out who was a murderer. And the way he's able to do that is he
has learned higher level sort of associations.
It's memory in some fundamental sense.
But so here's one way to ask a question.
In the brain, supposedly we do program synthesis,
but it is just synapsis connected to one, each other.
And so physically it's got to be that you just query the right circuit, right?
You are, yeah, yeah.
You know, it's a matter of degree.
But if you can learn it, if training in the environment
that the human ancestors are trained in means you learn that those
circuits, training on the same kinds of outfits of humans produce, which to replicate,
require these kinds of circuits. Wouldn't that train the same kind of whatever humans have?
You know, it's a matter of degree. If you have a system that has a memory and is only
capable of doing local generalization from that, it's not going to be very adaptable. To be really
general, you need the memory plus the ability to search to quite some depth to achieve, you know,
broader even extreme generalization.
You know, like one of my favorite psychologists, so Jean Piaget, was the founder of
the environmental psychology.
He had a very good quote about intelligence.
He said, intelligence is what you use when you don't know what to do.
And it's like as a human living your life, in most situations, you already know what to do
because you've been in this situation before.
You already have the answer, right?
And you're only going to need to use intelligence
when you're faced with novelty,
with something you didn't expect,
with something that you weren't prepared for
either by your own experience,
your own life experience,
or by your evolutionary history.
Like this day that you're living right now
is different in some important ways
from every day you've lived before,
but it's also different from any day ever lived
by any of your ancestors.
And still, you're capable of being functional.
Right.
How is it possible?
I'm not denying that generalization is extremely and is the basis for intelligence.
That's not the correct.
The correct says how much of that is happening in the models.
But, okay, let me ask a separate question.
We might keep going in the circle here.
The difference is in intelligence between humans.
Maybe the intelligence test because of reasons you mentioned are not measuring it well,
but clearly there's differences in intelligence between different humans.
Sure.
What is your explanation for what's going on there?
Because I think that's sort of compatible with my story that there's
the spectrum of generality and that these models are climbing up to a human level.
And even some humans haven't even climbed up to the Einstein level or the Francois level.
That's a great question.
You know, there is extensive evidence that intelligence, difference in intelligence are mostly genetic in nature, right?
Meaning that if you take someone who is not very intelligent, there is no amount of training, of like training data you can expose that person to that would make them become Einstein.
And this kind of points to the fact that you really need a better architecture.
You need a better algorithm.
And more training data is not, in fact, all you need.
I think I agree with that.
I think maybe a way I might phrase it is that the people who are smarter have in ML language,
better initializations.
It just, the neural wiring, if you just look at it's more efficient.
They have maybe greater density of firing.
And so some part of the story scaling, there is some correlation between brain size and intelligence.
And we also see within the context of quote unquote scaling that people talk about within the context of LLM's architectural improvements where a model like Gemini 1.5 Flash is performs as well as GPT4 did when GPT4 was released a year ago, but is 57 times cheaper on output.
So part of the scaling story is that the architectural improvements are we're in like extremely low-hanging fruit,
territory when it comes to those.
Okay, we're back now with the co-founder of Zapier, Mike Knoof.
We had to restart a few times there.
And you're funding this prize and you're running this prize with Francois.
And so tell me about how this came together.
What prompted you guys to launch this prize?
Yeah.
I guess I've been sort of like AI curious for 13 years.
I've been, I co-founded Zapper, been running it for the last 13 years.
And I think I first got introduced to your work during COVID.
I kind of went down the rabbit hole.
I had a lot of free time.
And it was right after you published your On Measure of Intelligence paper,
where you sort of introduced the concept of AGI.
This efficiency of skill acquisition is like the right definition and the arc puzzles.
But I don't think the first Cagall contest was done yet.
I think it was still running.
And so I kind of, it was interesting, but I just parked the idea.
And I had bigger fish to fry it Zapier.
were in this middle of this big turnaround of trying to get to our second product.
And then it was January 2022 when the chain of thought paper came out that really like
awoken me to sort of the progress.
I gave a whole presentation to the Zapier on like the GP3 paper even.
So I sort of felt like I had priced in everything that Elms could do.
And that paper was really shocking to me in terms of, oh, these latent capabilities that
Elms have that I didn't expect that they had.
And so I actually gave up my exact team role at Zappar.
I was running half the company at that point.
I went back to be an individual contributor and just to go do AI research alongside Brian, my co-founder.
And ultimately that led me to back towards ARC.
I was looking into it again.
And I had sort of expected to see this saturation effect that MMLE has, that GMSK has.
And when I looked at the scores and the progress since the last four years, I was really, again, shocked to see.
Actually, we've made very little objective progress towards it.
and it felt very, it felt like a really, really important e-val.
And as I sort of spent the last year, asking people, quizzing people about it in sort of my
networking community, very people, few people even knew it existed.
And that felt like, okay, if it's right that this is a really, really, like, globally,
singularly unique EGI Eval, and it's different from every other e-val that exists that are more,
that more narrowly measures AI skill.
Like, more people should know about this thing.
I had my own ideas on how to beat the arc as well.
So I was working on nights and weekends on that.
And I flew up to meet Francois earlier this year to quiz him, show them my ideas.
And ultimately, I was like, well, you know, why don't you think more people know about Arc?
I think you should actually answer that.
I think it's a really interesting question.
Like, why don't you think more people know about Arc?
Sure.
You know, I think benchmarks that gain traction in the research community are benchmarks that are already fairly tractable.
Because the dynamic that you see is that some research groups,
is going to make some initial breakthrough.
And then this is going to catch the attention of everyone else.
And so you're going to get follow-up papers with people trying to beat the first team and so on.
And for ARC, this has not really happened because ARC is actually very hard for existing AI techniques.
Kind of arc requires you to try new ideas.
And that's very much the point, by the way.
Like the point is not that, yeah, you should just be able to apply existing technology and solve ARC.
The point is that existing technology,
technology has reached a plateau.
And if you want to go beyond that,
if you want to start being able to tackle
problems that you haven't memorized,
that you haven't seen before,
you need to try new ideas.
And Arc is not just meant to be
this sort of like measure
of how close we are to a GI.
It's also meant to be a source of inspiration.
Like I want researchers to look at these puzzles
and be like, hey, it's really strange
that these puzzles
are so simple and most humans can just do them very quickly, why is it so hard for existing
AI systems? Why is it so hard for ALLAMs and so on? It's true for ALLAMS and
this is true for ALAMS, but ARC was actually released before ALLAMP were really a thing. And
the only thing that made it special at the time was that it was designed to be a resistance
to memorization. And the fact that it has survived ALAMS and ENIRA in general so well kind
of shows that, yes, it is actually resistant to memorization.
This is what nerds night me, because I went and took a bunch of the puzzles myself.
I've showed it to all my friends and family too, and they're all like, oh, yeah, this is like super easy.
Are you sure AI can't solve this?
Like, that's the reaction and the same one for me as well.
And the more you dig in, you're like, okay, yep, there's not just empirical evidence over the last four years that it's unbeaten, but there's theoretical, like, concepts behind why.
And I completely agree at this point that, like, new ideas basically are needed to be dark.
And there's a lot of current trends in the world that are actually, I think, working against that happening.
Basically, I think we're actually less likely to generate new ideas right now.
You know, I think one of the kind of trends is the closing up frontier research, right?
The GP4 paper from opening, I had no technical detail shared.
The Gemini paper had no technical detail shared and like the longer context part of that work.
And yet that open innovation, that open progress and sharing is what got us to transformers in the first place.
That's what got us to Elms in the first place.
So it's kind of disappointing a little bit, actually, that so much frontier work has gone closed,
it's really making a bet that these individual labs are going to have the breakthrough
and not the ecosystem is going to have the breakthrough.
And I think sort of the Internet open source has shown that that's like the most powerful
innovation ecosystem that's ever existed probably in the entire world.
I think that's actually really sad that frontier research is no longer being published.
If you look back, you know, four years ago, well, everything was just openly shared,
like all the state-of-the-art results were published.
And this is no longer the case.
And it's very much, you know,
Open AI single-handedly changed the game.
And I think OpenEI basically set back progress towards HGII
by quite a few years, probably like five to 10 years,
for two reasons.
And one is that, well, they cause this complete closing down
of research, frontier research publishing.
But also they trigger this initial burst of,
hype around LLMs.
And now LLMs have sucked the oxygen out of the room.
Like everything, everyone is just doing LLM's.
And I see LLMs as more of an off-ramp on the path to a GR, actually.
And all these new resources, they're actually going to LLM's instead of everything else
they could be going to.
And, you know, if you look further into the past to like 2015, 2016, 2016, they're
were like a thousand times fewer people doing AI back then.
And yet I feel like the rate of progress was higher because people were exploring more
directions.
The world felt more open-ended.
Like you could just go and try, like have a cool idea of a launch and try it and get some
interesting results.
So there was this energy.
And now everyone is very much doing some variation of the same thing.
And the big labs also tried their hand on arc.
but because they got bad results, they didn't publish anything.
Like, you know, people only publish positive results.
I wonder how much effort people have put into trying to prompt or scaffold,
do some sort of maybe Devon-type approach into getting the frontier models
and the frontier models of today, not just a year ago,
because a lot of post-training has gone into making them better.
So Cloud 3 Opus or GPD40 into getting good solutions on art,
I hope that one of the things this episode does is get people to try out this open competition
where they have to put in an open source model to compete, but also to like figure out if
there maybe the late capability is latent in Clod Opus and just see if you can show that.
I think that would be super interesting.
So let's talk about the prize.
How much do you win if you solve it, you know, get whatever percent on ARC?
How much do you get if you get the best submission but don't crack it?
So we got a million dollar plus, actually a little over a million dollar.
of the price pool. We're running the contest on an annual basis. We're starting it today
through the middle of November. And the goal is to get 85%. That's the lower bound and human
average that you guys talked about earlier. And there's a $500,000 prize for the first team
that can get to the 85% benchmark. We're also going to run, we don't expect that to happen
this year, actually. One of the early statisticians that Zapier gave me this line that has always
stuck with me, the longer it takes, the longer it takes. So my prior is that like,
arc is going to take years to solve. And so we're going to keep, we're also going to break down
to do a progress price this year. So there's a $100,000 progress price, which we will pay out
to the top scores. So $50,000 is going to go to the top objective scores this year on the Cagle leaderboard,
which is we're hosting it on Caggle. And then we're going to have a $50,000 pot set for a paper
award for the best paper that explains conceptually the scores that they were able to achieve.
And one of the, I think, interesting things we're also going to be doing is we're going to be requiring that in order to win the prize money that you put the solution or your paper out into public domain.
The reason for this is, you know, typically with contests, you see a lot of like closed up sharing people are kind of private secret.
They want to hold their outfit of themselves during the contest period.
And because we expect it's going to be multiple years, we want to enter a game here.
So the plan is, you know, at the end of November, we will award the $100,000 prize money to the top progress prize.
and then use the downtime between December, January, February to share out all the knowledge
from the top scores and the approaches folks were taking in order to re-baseline the community
up to whatever the state of the art is and then run the context again next year.
And keep doing that on a yearly basis until we get 85%.
I'll give some people some context on why I think this prize is very interesting.
I was having conversations with my friends who are very much believers in models as they exist today.
And first of all, it was intriguing to me that they didn't know about our.
These are experienced ML researchers.
And so you show them the, this happened a couple of nights ago.
We went to dinner and I showed them an example problem.
And they said, of course an LLM would be able to solve something like this.
And then we take a screenshot of it.
We just put it into our chat GPT app.
And it doesn't get the pattern.
And so I think it's a very interesting, like,
it is a notable fact I was sort of playing devil's advocate against you
on these kinds of questions, but this is a very intriguing fact.
And I'm extreme, I think this is a prize is extremely interesting
because we're going to learn, we're going to learn something fascinating,
something fascinating one way or another. So with regards to the 85% separate from this
prize I'd be very curious if somebody could replicate that result because obviously
in psychology and other kinds of fields which this result seems to be analogous to
when you run test on some small sample of people often they're hard to replicate. I'd be
very curious if you try to replicate this how what does the average human perform on
arc? Ask for the difficulty on how long it will take to crack this benchmark
It's very interesting because the other benchmarks that are now fully saturated like MMLU math,
actually the people who made them, Dan Hendrix and Colin Burns who did MMLU in math,
I think there were grad students or college students when they made it.
And the goal when they made it just a couple of years ago was that this will be a test of AGI,
and of course it got totally saturated.
I know you'll argue that these are a test of memorization, but I think the pattern we've
seen, in fact, Epoch AI has a very interesting graph that I'll sort of overlay for the YouTube
version here where you see this almost exponential where it gets you know 5% 10% 30% 40%
as you increase the compute across models and then it just shoots up and in the
gbt4 technical report they had this interesting graph of the human eval problem set which was
22 coding problems and they had to graph it on the mean log pass curve basically because
it early on in training or even smaller models can have the
right idea of how to solve this problem but it takes a lot of reliability to make sure they stay on
track to solve the whole problem and so you really want to upweigh the signal where they get it right
at least some of the time be one in a hundred times one at a thousand and then so they go from like
one in thousand one in hundred one in ten and then they just like totally saturate it i guess the
question i have when this is all leading up to is why won't the same thing happen with arc where
people had to try really hard bigger models um and now they figure out these techniques that jack
is figure it out with only a 240 million parameter language model that can get 35%.
Shouldn't we see the same pattern we saw across all these other benchmarks where you're just
like sort of eke out and then once you get the general idea, then you just go all the way to 100?
That's an empirical question.
So we'll see in practice what happens.
But what Jack Cole is doing is actually very unique.
It's not just pre-training an alarm and then prompting it.
He's actually trying to do active inference.
He's doing test time, right?
He's doing like test time fine-tuning.
And this is actually trying to lift one of the key limitations of the LLMs,
which is that at inference time, they cannot learn anything new.
They cannot adapt on the flight where they're seeing.
And it's actually trying to learn.
So what is doing is effectively a form of program synthesis.
Because the LLM contains a lot of useful building blocks,
like programming building blocks.
And by fine units on the task at test time,
you are trying to assemble these building blocks into the right.
pattern that matches the task.
This is exactly what program synthesis is about.
And the way would contrast this approach with discrete program search is that in
discrete program search, so you're trying to assemble a program from a set of primitives,
you have very few primitives.
So people working on discrete program search on Arc, for instance, they tend to work
with DSLs that have like 100 to 200 primitive programs.
So very small DSL, but then they're trying to
combine these parameters into very complex programs.
So there's a very deep depth of search.
And on the other hand, if you look at what Jack Cole is doing with LLMs,
is that he's got this sort of like vector program database DSL of millions of building blocks
in the LLM that are mined by pre-training the LLM,
not just on a ton of programming problems,
but also on millions of generated arc-like tasks.
So you have an extraordinarily large DSL,
and then the fine-tuning is very, very shallow recombination of these primitives.
So discrete program search very deep recombination,
very small set of primitive programs,
and the LLM approach is the same,
but on the complete opposite end of that spectrum,
where you scale up the memorization by a massive factor,
and you're doing very, very shallow,
search. But they are the same thing. Just different ends of the spectrum. And I think where you're
going to get the most value for your compute cycles is going to be somewhere in between. You want
to leverage memorization to build up a richer, more useful bank of alternative programs. And you don't
want them to be hard-coded, like what we saw for the typical audience. You want them to be learned
from examples. But then you also want to do some degree of deep search. As long as you're only
doing a very shadow search, you are limited to local journalism. If you want to generalize further,
more broadly, this depth of search is going to be critical. I might argue that the reason that
he had to rely so heavily on the synthetic data was because he used a 240 million parameter model
because the Kaggle competition at the time
required him to use a P-100 GPU,
which has like a tenth or something of the flops
of an H-100.
And so obviously he can't use
if you believe that sort of scaling
will solve these kind of reasoning,
then there you can just rely on the generalization,
whereas if you're using a much smaller model,
for context for the listeners, by the way,
the frontier models today are literally
a thousand X bigger than that.
And so for your competition,
from what I remember,
you, the submission you have to submit can't make any API calls, can't go online,
and has to run on Nvidia Tesla T4.
P100.
Oh, is it P100?
Yeah.
Okay.
So again, it's like significantly less powerful.
There's a 12-hour runtime limit, basically.
There's a forcing function of efficiency in the Eval.
But here's the thing.
You only have 100 test tasks.
So the amount of computer available for each task is actually quite a bit,
especially if you contrast that with the simplicity of each task.
So it would be seven minutes per task, basically.
Which for, you know, people have tried to do these estimates of how many floss does a human brain have.
And you can take them with a grain of salt, but as a sort of anchor, it's basically the amount of flops in H100 has.
And I guess maybe you would argue with that, well, a human brain can solve this question in faster than 7.2 minutes.
So even with a 10th of the compute, you should be able to do it in seven minutes.
Obviously, we have less memory than, you know, like petabytes of fast access memory in the brain.
with these 29 or whatever gigabytes in this 800.
Anyway, I guess the rudder question masking is,
I wish there's a way to also test this prize
with some sort of scaffolding on the biggest models
as a way to test whether scaling is the path to get to,
you know, solving arc.
Absolutely.
So in the context of the competition,
we want to see how much progress we can do
with limited resources.
But you are entirely right that it's a super interesting
open question, what could the biggest model out there actually do on arc?
So we want to actually also make available a private sort of like one-off track where you
can submit to us a VM.
And so you can put on it any model you want.
You can take one of the largest open source models out there, find you need, do whatever
you want.
And just give us an image.
And then we run it on the H-100 for like 24 hours or something and you see what you get.
I think it's worth pointing out that there's two different test sets.
There is a public test set that's in the public GitHub repository
that anyone can use to train, you know, put it in an open API call,
whatever you'd like to do.
And then there's the private test set, which is the 100,
that is actually measuring the state of the art.
So I think it is pretty open and interesting
to have folks attempt to at least use the public test set and go try it.
Now, there is an asterisk on any score that's reported on against the public test set
because it is public.
It could have leaked into the training data somewhere.
This is actually what people are already doing.
You can already try to prompt one of the best models,
like the latest Gemina, the latest GPT4,
with tasks from the public evaluation set.
And, you know, again, the problem is that these tasks
are available as JSON files on GitHub.
These models are also trained on GitHub.
So they're actually trained on these tasks.
And, yeah, that kind of creates uncertainty about
if they can actually source some of the tasks,
is that because they memorize the answer or not?
You know, maybe you would be better off trying to create your own private,
arc-like, a very novel test set.
Don't make the task difficult.
Don't make them complex, make them very obvious for humans.
But make sure to make them original as much as possible, make them unique, different.
And see how much your GPT4 and so on, or GP5 does on them.
Well, they're having tests on whether these models are being overtrained on these benchmarks.
Scale recently did this where on the GSM
It was really interesting.
They basically replicated the benchmark
where with different questions.
And so some of the models actually were extremely overfit
on the benchmark like MISROL and so forth.
But the frontier models,
Claude and GBT actually did as well
on their novel benchmarker that they did on the specific questions
that were in the existing public benchmark.
So I would be relatively optimistic about them
just sort of training on the JSON.
I was joking with Mike that you should allow API access but sort of keep an even more private
validation set of these ARC questions.
And so allow API access, people can sort of play with GPD4 scaffolding to enter into this
contest.
And if it turns out, maybe later on you run the validation set on the API.
And if it performs worse than the test said that you allowed the API access to originally,
that means that Open AI is training on your API calls and you like, go public with this.
and show them like, oh my God, they're, you know, they've like leaked your data.
We do want to make, we want to evolve the ARC data set.
Like, that is, that is a goal that we want to do.
I think, Francois, you mentioned, you know, it's not perfect.
Yeah, no, our ARC is not perfect, perfect benchmark.
I mean, I made it like four years ago, over four years ago, almost five now.
This was in a time before LAMS.
And I think we learned a lot, actually, since about what potential flaws.
There might be, I think there is some redundancy in the set of tasks,
which is, of course, against the goals.
of the benchmark. Every task is supposed to be unique in practice. That's not quite true.
I think there's also every task is supposed to be very novel, but in practice, they might not be.
They might be structurally similar to something that you might find online somewhere.
So we want to keep iterating and release an Arc 2 version later this year.
And I think when we do that, we're going to want to make the old private test set available.
So maybe we won't be releasing it publicly, but what we could do,
is just create a test server
where you can query, get a task,
you submit a solution,
and of course you can use
whatever Frontier model you want there.
So that way, because you actually have to query this API,
you're making sure that no one is going to, by accident,
train on this data.
It's unlike like the current public auditory,
which is literally on GitHub.
So there's no question about whether the models
are actually trained on it.
Yes, they are because they're trained on GitHub.
So by sort of like gating access
to acquiring this API, we would avoid this issue.
And then we would see, you know, for people
who actually want to try whatever technique they have in mind
using whatever resources they want,
that would be a way for them to get an answer.
I wonder what might happen.
I'm not sure.
One answer is that they've come up with a whole new algorithm
for AI with some explicit program synthesis
that now we're on a new track.
And another is they did something hacky
with the existing models in a way that actually
is valid, which reveals that movie intelligence is more of getting getting things to the right
part of the distribution, but then it can reason.
And in that world, I guess that will be interesting.
And maybe that'll indicate that, you know, you had to do something hacky with current models.
As they get better, you won't have to do something hacky.
I'm also going to be very curious to see how these multimodal models, if they will perform
natively much better at Arc-like tests.
If Arc survives three months from here, we'll up the price.
I think we're about to make a really important moment of like,
contact with reality by blowing up the prize, putting a much big price pool against it.
We're going to learn really quickly if there's like low-hanging fruit of ideas.
Again, I think new ideas are needed.
I think anyone listening, this might have the idea in their head.
And I'd encourage everyone to like give it a try.
And I think as time goes on, that adds strength to the argument that like we've sort of
stalled out in progress and that new ideas are necessary to be dark.
Yeah.
That's the point of having a money price is that you attract more people.
You get them to try to solve it.
And if there's an easy way to hack the benchmark that reveals that the benchmark is far out, then you're going to know about it.
In fact, that was the point of the original Karel competition back in 2020 for ARC.
I was running this competition because I had released this data set and I wanted to know if it was hackable, if you could cheat.
So there was a small money prize at the time.
There was like 20K.
And this was right around the same time as GPT3 was released.
So people of course try GPT3 on the public data.
scored zero. But I think what the first context, the first contest told us is that there is no
obvious shortcut. Right. And well, now there's more money. There's going to be more people
looking into it. Well, we're going to find out. We're going to see if the benchmark is going to
survive. And you know, if we end up with a solution that is not like trying to brute force
the space of possible
arc tasks that's just trained on core knowledge.
I don't think it's necessarily going to be
in and by itself, EGI,
but it's probably going to be a huge milestone
on the way to EGI.
Because what it represents
is the ability
to synthesize
a problem-solving
program from just
two or three examples.
And that alone is a new way
to program.
It's a, it's a,
It's an entirely new paradigm for software development,
where you can start programming potentially quite complex programs
that will generalize very well.
And instead of programming them by coming up
with the shape of the program in your mind
and then typing it up, you're actually just
showing the computer with add which you want
and you let the computer figure it out.
I think that's a little bit
on what kinds of solutions might be possible here
and which you would consider sort of
defeating the purpose of ARC and which are sort of valid.
Here's one I'll mention, which is my friends, Ryan and Buck,
stayed up last night because I told them about this,
and they were like, oh, of course I'll want to solve this.
Of course, I'll solve this.
And then so they were trying to prompt, I think, Claude Opus on this.
And they say they got 25% on the public ARC test.
And what they've done did was have other examples of some of the ARC test.
and in context explain the reasoning of why you went from one output to another output,
and then now you have the current problem.
And I think also maybe expressing the JSON in a way that is more amenable to the tokenizer.
And another thing was using the code interpreter.
So I'm curious actually if you think the code interpreter,
which keeps getting better as these models get smarter,
is just the program synthesis right there
because what they were able to do was the actual output of the cells,
the JSON output, they got through the code interpreter,
like write the Python program that gets right out for here.
Do you think that the program synthesis kind of research
are talking about will look like just using the code interpreter
in large language models?
I think whatever solution we see that will score well
is going to probably need to leverage some aspects
from deep learning models and the LLMs in particular.
We've shown already that LLMs can do quite well.
That's basically the jack-code approach.
We've also shown that pure discrete problems
pure discrete program search from a small DSL does very, very well.
Before Jack Cole, this was the state of the art.
In fact, it's still extremely close to the state of the art.
And there's no deep learning involved at all in these models.
So we have two approaches that have basically no overlap that are doing quite well.
And they're very much at two opposite ends of one spectrum, where on one end you have
these extremely large banks of millions of vector programs, but very, very shallow recombination,
like simplicity recombination.
And on the other end, you have very simplistic DSLs, very simple, like 100 or 200 primitives,
but very deep, very sophisticated program search, the solution is going to be somewhere in between.
Right.
So the people who are going to be winning the R competition and that we are going to be making
the most progress towards near-term HMHR are going to be suits that manage to merge the deep learning
paradigm and a discrete program search paradigm into one elegant way.
You know, you ask like, what would be legitimate and what would be cheating, for instance?
So I think you want to add a code interpreter to the system.
I think that's great.
That's sort of legitimate.
The part that would be cheating is try to anticipate what might be in the test, like brute force
the space of possible tasks and then train a memorization system on it and then rely on the
fact that you're generating so many tasks like millions and millions and millions that,
inevitably there's going to be some overlap between what you're generating and what's
in the test set.
I think that's defeating the purpose of benchmark because then you can just solve it with
that and you need to adapt just by fetching a memorized solution.
So hopefully, Arc will resist to that, but you know, nothing, no benchmark is necessarily perfect.
So maybe there's a way to hack it.
And I guess we're going to get an answer very soon.
Well, I think some amount of fine tuning is valid because these models don't natively think
in terms of, especially the language models alone, which the open source models that they
would have to use to be competitive here, compete here.
They're like natively language,
so they'd like need to be able to think in the,
in this kind of, um, yes, the archetype way.
You want to input core knowledge,
like arc like core knowledge into the model,
but surely you don't need tens of millions of tasks
to do this, like core analysis is extremely basic.
If you look at some of these archetype questions,
I actually do think they rely a little bit
on things I have seen throughout my life.
And for the same, like for example,
like something bounces off a wall and comes back
and you see that pattern.
It's like I played arcade games and I've seen like Pong or something.
And I think for example, when you see the Flynn effect
and people's intelligence as measured on
variance progressive matrices increasing
on these kinds of questions, it's probably a similar story
where since now since childhood,
we actually see these sorts of patterns in TV
and whatever, spatial patterns.
And so I don't think this is sort of core knowledge.
I think actually this is also part of the quote unquote fine-tune
that humans have as they grow up of seeing different kinds of spatial patterns and trying to pattern match to them.
I would definitely file that under core knowledge. Like, core knowledge includes basic physics, for instance, balancing or trajectories.
That would be included. But yeah, I think you're entirely right. The reason why, as a human, you're able to quickly figure out the solution is because you have this set of building blocks, this set of patterns in your mind that you can recombine.
Is core knowledge required to attain intelligence, any algorithm you have? Does the core knowledge have to be,
in some sense hard-coded or can even the core knowledge be learned through intelligence?
Core knowledge can be learned and I think in the case of humans some amount of
core knowledge is something that you're born with like we're actually born with a small amount
of knowledge about the world we're going to live in. We're not blank slates but most core
knowledge is acquired through experience but the thing with core knowledge is that it's not
going to be acquired like for instance in school it's actually acquired very very early in the first
like three to four years of your life. And by age four, you have all the core knowledge you're
going to need as an adult. Okay. Interesting. So, I mean, on the price itself, I'm super excited to see
both the open source versions of maybe with a Lama 70B or something, what people can score
in the competition itself, then if to sort of test specifically the scaling hypothesis, I'm very
curious to see if you can prompt on the public version of ARC, which I guess you won't be
compete, you won't be able to submit to this competition itself. But I'd be very curious to see how
if people can sort of crack that and get our arc working there and if that would update your
reviews on AGI. It's really be motivating. We're going to keep running the contest until
somebody puts a reproducible open source version into public domain. So even if somebody privately
beats Eval or beats the arc Evel, we're going to still keep the price money until someone can
reproduce it and put the public reproducible version out there. Yeah, exactly. Like the goal is
to accelerate progress towards EGI. And a key part of that is that any sort of meaningful beats of
progress needs to be shared, needs to be public. So everyone can know about it and can try to
iterate on it. If there's no sharing, there's no progress. What I'm especially curious about
is sort of disaggregating the bets of like, can we make an open version of this versus,
is this a thing that's just possible with scaling? And we can, I guess, test both of them
based on the public and the private version. We're making contact with reality as well with this, right?
We're going to learn a lot, I think, about what the actual limits of the compute were. If someone
showed up and said, hey, here's a closed source model that, like, I'm getting 50 plus percent on.
I think that would probably update us on like, okay, perhaps we should increase the amount
of compute that we give on the private test set in order to balance.
Some of the decisions initially are somewhat arbitrary in order to learn about, okay, what do people
want?
What does progress look like?
And I think both of us are sort of committed to evolving it over time in order to be the best,
or the closest to perfect as we can get it.
Awesome.
And where can people go to learn more about the prize and maybe give their hand at it?
Parkprise.org.
Which goes live today.
So, live now.
One million dollars is on the line, people.
Good luck.
Thank you guys for coming on the podcast.
It's super fun to go through all the crux on intelligence and get a different perspective
and also to announce surprise here.
So this is awesome.
Thank you for helping break the news.
Thank you for finding us.
Thank you for finding us.
