Dwarkesh Podcast - Francois Chollet — Why the biggest AI models can't solve simple puzzles

Starting point is 00:00:00 Okay, today I have the pleasure to speak with Francois Cholet, who is a AI researcher at Google and creator of Keras. And he's launching a prize in collaboration with Mike Knoof, the co-founder of Zapier, who will also be talking to in a second, a million-dollar prize to solve the ARC benchmark that he created. So first question, what is the ARC benchmark and why do you even need this prize? Why won't the biggest LEM we have in a year be able to just saturate it? Sure. So, ARC is intended as a kind of IQ test for machine intelligence. And what makes it different from most LAM benchmarks out there is that it's designed to be resistant to memorization. So if you look at the way LLM's work, they're basically this big interpolative memory. And the way you scale up their capabilities is by trying to cram as

Starting point is 00:00:48 much knowledge and patterns as possible into them. And by contrast, ARC does not require a lot of knowledge at all is designed to only require what's known as core knowledge which is basic knowledge about things like elementary physics objectness counting that sort of thing the sort of knowledge that any four-year-old or five-year-old possesses right but what's interesting is that each puzzle in arc is novel is something that you've probably not encountered before even if you've memorized the entire internet and that's what makes So in the sweat

Starting point is 00:01:28 makes arc challenging for LLMs. And so far LMs have not been doing very well on it. In fact, the approaches that are working well are more towards discrete program search, program synthesis. So first of all, I'll make a comment that

Starting point is 00:01:44 I'm glad that as a skeptic of LLM you have put out yourself a benchmark that is it accurate to say that suppose that the biggest model we have in a year is able to get 80% on this, then you're, review would be we are on track of to AGI with LLMs, how would you think about that?

Starting point is 00:02:01 Right. I'm pretty skeptical that we're going to see LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved. If you just train the model and millions or billions of puzzles similar to Arc so that you're relying on the ability to have some overlap between the tasks that you train on and the task that you're going to see at test time, then you're still using memory. right and maybe maybe it can work you know hopefully arc is going to be good enough that it's going to be resistant to this sort of attempt and brute forcing but you know you never know

Starting point is 00:02:39 maybe maybe it could happen I'm not saying it's not going to happen arc is not a perfect benchmark maybe maybe it has flaws flows maybe it could be hacked in that way I'm so I guess I'm curious about what would GPTFI have to do that you're very confident that you know it's on the path to AGI. What would make me change my mind about that alarm is basically if I start seeing a critical mass of cases where you show the model with something that has not seen before, a task that's actually novel from the perspective of its training data, something that's not in training data, and if it can actually adapt on the fly. And this is true for al-LAMS, but really this would catch my attention for any AI technique out there. If I can see the ability to

Starting point is 00:03:27 adapt to novelty on the fly to pick up new skills efficiently, then I would be extremely interested. I would think this is on the past to AGI. So the advantage they have is that they do get to see everything. Maybe I'll take issue with how much they are relying on that. But let's suppose that they are relying, obviously they're relying on that more than humans do. To the extent that they do have so much in distribution, to the extent that we have trouble distinguishing whether an example is in distribution or not. Well, if they have everything in distribution, then they can do everything that we can do, maybe it's not in distribution for us. Why is it so crucial that it has to be out of distribution

Starting point is 00:04:06 for them? You know, why can't we just leverage the fact that they do get to see everything? Right. You're asking basically what's the difference between actual intelligence, which is the ability to adapt to things you've not been prepared for, and pure memorization, like reciting what you've seen before. And it's not just some semantic difference. The big difference is the big difference is that you can never pre-trained on everything that you might see at test time, right? Because the world changes all the time. So it's not just the fact that the space of possible tasks is infinite. And even if you're trained on millions of them, you've only seen zero person of the total space.

Starting point is 00:04:46 It's also the fact that the world is changing every day, right? This is why the human species has developed intelligence in the first place. if there was a thing as a distribution for the world, for the universe, for our lives, then we would not need intelligence at all. In fact, many creatures, many insects, for instance, do not have intelligence.

Starting point is 00:05:09 Instead, what they have in their connectum, in their genes, hard-coded programs, behavioral programs that map some stimuli to appropriate response. And they can actually navigate their lives, their environment, very evolutionary fits that way without needing to learn anything. And while, if our environment was static enough, predictable enough,

Starting point is 00:05:35 what would have happened is that evolution would have found the perfect behavioral program, a hard-coded static behavioral program. We would have written it into our genes. We would have a hard-coded brain connectum, and that's what we would be running on. But no, that's not what happened. Instead, we have general intelligence. So we are born with extremely little knowledge about the way. world, but we are born with the ability to learn very efficiently and to adapt in the face of

Starting point is 00:06:01 things that we've never seen before. And that's what makes us unique. And that's what is really, really challenging to recreate in machines. I want to rabbit hole on that a little bit. But before I do that, maybe I'm going to overlay some examples of what an arc like challenge look like for the YouTube audience, but maybe for people listening on audio, can you just describe what would a sample arc challenge look like? Sure. So one arc puzzle, it looks kind of like an IQ test puzzle, you've got a number of demonstration input-adput pairs. So one pair is made of two grids. So one grid shows you an input and the second grade shows you what you should produce as a response to that input. And you get a couple pairs like

Starting point is 00:06:46 this to demonstrate the nature of the task, to demonstrate what you're supposed to do with your inputs, and then you get a new test input. And you get a new test input. And you'll job is to produce the corresponding test output. You look at the demonstration pairs and from that you figure out what you're supposed to do and you show that you've understood it on this new test pair. And importantly, in order to the sort of like knowledge basis that you need, in order to approach these challenges, is you just need core knowledge. And core knowledge is it's basically the knowledge of what makes an object, basic counting,

Starting point is 00:07:25 basic geometry, topology, symmetries, that sort of thing. So extremely basic knowledge, LLMs for sure possess such knowledge. Any child possesses such knowledge. And what's really interesting is that each puzzle is new. So it's not something that you're going to find elsewhere on the Internet, for instance. And that means that whether it's as a human or as machine, every puzzle, you have to approach it from scratch. You have to actually reason your way through it. We cannot just fetch the response from your memory.

Starting point is 00:07:59 So the core knowledge, one contention here is we are only now getting multimodal models who, because of the data they are trained on, are trained to do spatial reasoning. Whereas obviously, not only humans, but for billions of years of revolution, we've had, our ancestors have had to learn how to understand abstract physical and spatial properties and recognize the patterns there. And so one view would be in the next year, as we gain models that are multimodal native, that isn't just a sort of second class that is an add-on, but the multimodal capability is a priority, that it will understand these kinds of patterns because that's something we'd see natively. Whereas right now, what arc sees is some JSON string of 100-1-0, and it's supposed to recognize a pattern there. and even if you showed a human such a,

Starting point is 00:08:53 like just a sequence of these kinds of numbers, it would have a challenge making sense of what kind of question you're asking it. So why I want it to be the case that as soon as we get multimodal models, which we're on the path to unlock right now, they're going to be so much better at archetype's facial reasoning? That's an empirical question, so I guess we're going to see the answer within a few months.

Starting point is 00:09:11 But my answer to that is, you know, our grades, they're just discrete 2D grades of symbols. They're pretty small, like it's not like, If you flatten an image as a sequence of pixels, for instance, then you get something that's actually very, very difficult to parse. But that's not true for arc because the grids are very small, you only have 10 possible symbols. So there are these two degrees that are actually very easy to flatten

Starting point is 00:09:36 as sequences. And transformers, LLMs, they are very good at processing the sequences. In fact, you can show that LLAMs do fine with processing arc-like data by simply fine-tuning LLM on some subsets of the tasks and then trying to test it on small variations of these tasks. And you see that, yeah, the LLM can encode just fine solution programs for tasks that it has seen before. So it does not really have a problem passing the input or figuring out the program.

Starting point is 00:10:13 The reason why LLMs don't do well on Arc is really just the unfamiliarity aspect, the fact that each new task is different from every other other task. You cannot, basically, you cannot memorize the solution programs in advance. You have to synthesize a new solution program on the fly for each new task. And that's really what they're struggling with. So before I do more a devil's advocate, I just want to step back and explain why I'm especially interested in having this conversation. And obviously, the million dollar arc prize, I'm excited to actually play out with it myself

Starting point is 00:10:47 and hopefully the Vesuvius challenge, which was Nat Friedman's Prize for solving, decoding scrolls, the winner of that, decoding the squirrels from, that were buried in the volcanoes in the Herculaneum library, that was solved by a 22-year-old who was listening to the podcast, Luke Farator.

Starting point is 00:11:05 So hopefully somebody listening will find this challenge intriguing and find a solution. And the reason I've had on recently a lot of people who are bullish on LLMs, and I've had discussions with them before interviewing you about how to we explain the fact that LMs don't seem to be natively performing that well on ARC. And I found their explanations somewhat contrived and I'll try out some of the reasons on you. But it is actually an intriguing fact that some of these problems are relatively straightforward for humans to understand.

Starting point is 00:11:38 And they do struggle with them if you just input them natively. All of them are very easy for humans. Like any smart human should be able to do 90% 95% percent on arc. Smart human. A smart human. But even a five-year-old, so with very, very little knowledge, they could definitely do over 50%. So let's talk about that because you, I agree that smart humans will do very well on

Starting point is 00:12:03 this test. But the average human will probably do, you know, mediocre. Not really average. So we actually tried with average humans. The score about 85. That was with Amazon mechanical Turk workers, right? I honestly don't know the demographic profile of Amazon mechanical Turkworkers, but I imagine just interacting with the platform that Amazon is set up to do remote work.

Starting point is 00:12:25 That's not the median human across the planet, I'm guessing. I mean, the broader point here being that, so we see the spectrum in humans where humans obviously have AGI. But even within humans, you see a spectrum where some people are relatively dumber and they'll do perform work on IQ-like tests. For example, Ravens regressive matrices. is if you look at how the average person performs on that, and you look at the kind of questions that is this sort of mid or miss. Half of people will get right. Half of people will get it wrong.

Starting point is 00:12:52 Some of them are pretty trivial. For us, we might think like this is kind of trivial. And so humans have AI, but from relatively small tweaks, you can go from somebody who misses these kinds of basic IQ test questions, somebody who gets them all right, which suggests that actually if these models are doing natively, we'll talk about some of the previous performances that people are tried with these models,

Starting point is 00:13:13 but somebody with a Jack Cole with a 240 million parameter model got 35%. Doesn't that suggest that they're on this spectrum that clearly exists within humans and they're going to get saturated it pretty soon? Yeah, so that's a bunch of interesting points here. So there is indeed a branch of LLM approaches suspended by Jack Cole that are doing quite well, that are in fact state of the art. But you have to look at what's going on there. So there are two things.

Starting point is 00:13:42 The first thing is that to get these numbers, you need to pre-train your LLM on millions of generated arc tasks. And of course, if you compare that to a five-year-old child looking at ARC for the first time, the child has never done an acute test before, has never seen something like an ARTAS before. The only overlap between what they know and what they have to do in the test is core knowledge, is knowing about like counting and objects and symmetries and things like that. And still, they're going to do really well.

Starting point is 00:14:11 and they're going to do much better than the LLM trained on millions of similar tasks. And the second thing that's something to note about the Jack-Cold approach is one thing that's really critical to making the model work at all is test time fine-tuning. And that's something that's really missing, by the way, from LLM approaches right now is that, you know, most of the time when you're using an LLM, it's just doing static inference. The model is frozen and you're just, prompt in it and then you're getting an answer. So the model is not actually learning anything on the fly. Its state is not adapting to the task at hand. And what Jacko is actually doing is that

Starting point is 00:14:54 for every test problem is on the fly is fine-tune-in a version of DLLM for that task. And that's really what's unlocking performance. If you don't do that, you get like 1%, 2%. So basically something completely negligible. And if you do test time for intuning and you add a bunch of tricks on top, then you end up with interesting performance numbers. So I think what is doing is trying to address one of the key limitations of LLMs today, which is the lack of active inference. It's actually adding active inference to LLMs. And that's working extremely well, actually. So that's fascinating to me. There's so many interesting rabbit holes there. Should I take them in sequence or deal with

Starting point is 00:15:35 them all once? Let me just start. So the point you made about the fact that you need to unlock the adaptive compute slash test time compute, a lot of the scale maximalists, I think this will be interesting rabbit hole to explore with you, because a lot of the scaling maximalist have your broader perspective in the sense that they think that in addition to scaling, you need these kinds of things like unlocking adaptive compute or doing some sort of RL to get the system two working. And their perspective is that this is a relatively straightforward thing that will be added atop the representations that a scaled-up model has. has greater access to.

Starting point is 00:16:14 No, it's not just a technical detail. It's not a straightforward thing. It is everything. It is the important part. And the scale maximalist argument, really it boils on to, you know, these people, they refer to scaling laws, which is this empirical relationship that you can draw between how much compute you spend entering a model and the performance

Starting point is 00:16:37 you're getting on benchmarks, right? And the key question here, of course, is, well, how do you measure performance what it is that you're actually improving by adding more compute and more data? And, well, it's benchmark performance, right? And the thing is, the way you measure

Starting point is 00:16:54 performance is not a technical detail. It's not an afterthought because it's going to narrow down the sort of questions that you're asking. And so accordingly, it's going to narrow down the sort of answers that you're looking for. If you look at the bencher

Starting point is 00:17:11 we're using for LLMs. They're all memorization-based benchmarks. Like sometimes they are literally just knowledge-based, like a school test. And even if you look at the ones that are, you know, explicitly about reasoning, you realize if you look closely that it's, in order to solve them,

Starting point is 00:17:29 it's enough to memorize a finite set of reasoning patterns. And then you just reapply them. They're like static programs. LMs are very good at memorizing static programs. small Stelix programs. And they've got this sort of like bank of solution programs. And when you give them a new puzzle, they can just fetch the appropriate program apply it.

Starting point is 00:17:53 And it's looking like it's reasoning. But really, it's not doing any sort of on-the-fly program synthesis. All it's doing is program fetching. So you can actually solve all these benchmarks with memorization. And so what you're scaling up here, like if you look at the models, they are big parameters. metric curves fitted to the data distribution, which I got on a descent. So there are basically this big interpolative databases, interpolative memories. And of course, if you scale up the size of your database and you cram into it more knowledge,

Starting point is 00:18:27 more patterns and so on, you are going to be increasing its performance as measured by a memorization benchmark. That's kind of obvious. But as you're doing it, you are not increasing the intelligence of the system one bit. You are increasing the skill of the system. You are increasing its usefulness, its scope of applicability, but not its intelligence, because skill is not intelligence. And that's the fundamental confusion that people run into is that they're confusing skill and intelligence.

Starting point is 00:19:00 Yeah, there's a lot of fascinating things to talk about here. So skill, intelligence, interpolation. I mean, okay, so the thing about they're fitting some manifold is. into that maps the input data. There's a reductionist way to talk about what happens in the human brain that says that it's just axons firing at each other. But we don't care about the reductionist explanation

Starting point is 00:19:22 of what's happening. We care about what the sort of meta at the macroscopic level, what happens when these things combine? As far as the interpolation goes, so okay, let's look at one of the benchmarks here. There's one benchmark that does great school math And these are problems that, like a smart high schooler would be able to solve. It's called GSM 8K.

Starting point is 00:19:46 And these models get 95% on these. Like, basically, they always nail it. That's memorization benchmark. Okay, let's talk about what that means. So here's one question about from that benchmark. So 30 students are in a class. One fifth of them are 12-year-olds. One-third are 13-year-old.

Starting point is 00:19:59 One-tenth-or-11-year-olds. How many of them are not 11, 12, or 13-year-olds? So I agree, like, this is not rocket science, right? You can write down on paper how you go through this problem. and a high school kid, at least a smart high school kid, should be able to solve it. Now, when you say memorization,

Starting point is 00:20:15 it still has to reason through how to think about fractions and what is the context of the whole problem and then combining the different calculations that's doing. It depends how you want to define reasoning. But there are two definitions you can use. So one is, I have available a set of program templates. It's like the structure of the puzzle,

Starting point is 00:20:36 which can also generate its solution. And I'm just going to identify the right template, which is in my memory. I'm going to input the new values into the template, run the program, get the solution. And you could say this is reasoning. And I say, yeah, sure, okay. But another definition you can use is reasoning is the ability to, when you're faced with a puzzle, given that you don't have already a program in memory to solve it, you must synthesize on the fly a new program based on bits of pieces of existing programs that you have.

Starting point is 00:21:08 you have to do on-the-fly program synthesis. And it's actually dramatically harder than just fetching the right memorized program and replying it. So I think maybe we are overestimating the extent to which humans are so sample efficient that they also don't need training in this way where they have to drill in these kinds of pathways

Starting point is 00:21:31 of reasoning through certain kinds of problems. So let's take math, for example. Yeah. It's not like you can just show a baby the axioms of set theory. And now they know math, right? So when they're growing up, you had to do years of teaching them pre-algebra. Then you've got to do a year of teaching them doing drills and going through the same kind of problem in algebra,

Starting point is 00:21:48 then geometry, pre-calculus, calculus. Absolutely. So training? Yeah, but isn't that like the same kind of thing where you can't just see one example and now you have the program or whatever? You actually had to drill it. These models also had to drill with a bunch of fruit training data. Sure. I mean, in order to do on-the-fly program synthesis, you actually need building blocks to work from.

Starting point is 00:22:07 So knowledge and memory are actually tremendously important in the process. I'm not saying it's memory versus reasoning in order to do effective reasoning. You need memory. But it sounds like it's compatible with your story that through seeing a lot of different kinds of examples, these things can learn to reason within the context of those examples. And we can also see within bigger and bigger models. So that was an example of a high school level of math problem. let's say a model that's like smaller than GPT3 couldn't do that at all.

Starting point is 00:22:39 As these models get bigger, they seem to be able to pick a bigger and bigger. It's not really a size issue. It's more like a trained data issue in this case. Well, bigger models can pick up these kinds of circuits, which smaller models apparently don't do a good job of doing this, even if you were to train them on this kind of data. Doesn't that just suggest that as you have bigger and bigger models, they can pick up bigger and bigger pathways or more general ways of reasoning? Absolutely.

Starting point is 00:23:01 But then isn't that intelligence? No, no, it's not. If you scale up your database and you keep adding to it more knowledge, more program templates, then sure, it becomes more and more skillful. You can apply to more and more tasks. But general intelligence is not task specific skills scaled up to many skills. Because there is an infinite space of possible skills. General intelligence is the ability to approach any problem, any skill, and very quickly master it using very little data.

Starting point is 00:23:29 Because this is what makes you able to face anything. you might ever encounter. This is what makes, this is the definition of generality. Like, generality is not specificity scaled up. It is the ability to apply your mind to anything at all, to arbitrary things.

Starting point is 00:23:47 And this requires, fundamentally, this requires the ability to adapt, to learn on the fly efficiently. So my claim is that by doing this free training on bigger and bigger models, you are gaining that capacity to then generalize very efficiently. Let me give me an example.

Starting point is 00:24:04 Let me give me an example. So your own company, Google, in their paper on Gemini 1.5, they had this very interesting example where they would give, in context, they would give the model the grammar book and the dictionary of a language that has less than 200 living speakers. So it's not in the pre-training data. And you just give them the dictionary. And it basically is able to speak this language and translate to it, including the complex. and organic ways in which languages are structured.

Starting point is 00:24:36 So a human, if you showed me a dictionary from English to Spanish, I'm not going to be able to pick up the how to structure sentences and how to say things in Spanish. The fact that because of the representations that it has gained through this free training, it is able to now extremely efficiently learn a new language. Doesn't that show that this kind of pre-taining actually does increase your ability to learn new tasks?

Starting point is 00:24:57 If you're right, if you were right, LLMs would do really well on arc puzzles, because arc puzzles are not complex. Each one of them requires very little knowledge. Each one of them is very low on complexity. You don't need to think very hard about it. They're actually extremely obvious for humans, like even children can do them. But ALMs cannot,

Starting point is 00:25:16 even ELMs that have, you know, 100,000 times more knowledge than you do. They still cannot. And the only thing that makes ARC special is that it was designed with this intent to resist memorization. This is the only thing. And this is the huge,

Starting point is 00:25:31 blocker for LLM performance. Right. And so, you know, I think if you look at LLMs closely, it's pretty obvious that they're not really like synthesizing new programs on the fly to solve the tasks that they're faced with. They're very much reapplying things that they've stored in memory. For instance, one thing that's very striking is LLMS can solve a Cesar Cipher,

Starting point is 00:25:58 you know, like a Cesar Cipher, like transposing, letters to code a message. And well, there's a very complex algorithm, right? But it comes up quite a bit on the internet. So they've basically memorized it. And what's really interesting is that they can do it for a transposition length of like three or five because there are very, very common numbers in examples

Starting point is 00:26:21 provided on the internet. But if you try to do it with an arbitrary number, like nine, it's going to fail. Because it does not encode the generalized form of the algorithm, but only specific cases. It has memorized specific cases of the algorithm. And if it could actually synthesize on the

Starting point is 00:26:38 fly, the solver algorithm, then the value of N would not matter at all, because it does not increase the problem and complexity. I think this is true of humans as well, where what was the study? Humans use memorization pattern matching all the time, of course, but humans

Starting point is 00:26:54 are not limited to memorization and pattern matching. They have this very unique ability to adapt to new situations on the fly. This is exactly what enables you to navigate every new day in your life. I'm forgetting the details, but there was some study that chess grandmasters will perform very well within the context of the moves that... Excellent example, because chess at the highest level is all about memorization. Chess memorization gauge. Okay, sure. We can leave that aside. What is your explanation for the original question of why can, why in context the GPD one,

Starting point is 00:27:26 sorry, Gemini 1.5 was able to learn a language, including the complex. grammar structure. Doesn't that show that they can pick up new knowledge? I would assume that it has simply mined from its extremely extensive, unimaginably vast, training data. It has mined the required template and then it's just reusing it. We know that Lelames have a very poor ability to synthesize new program templates like this on the fly or even adapt existing ones. They're very much limited to fetching. Suppose there's a programmer at Google. They go into the office in the morning. At what point are they doing something that 100% cannot be due to fetching some template that

Starting point is 00:28:04 could, even if they, suppose they were an LLM, they could not do if they had fetched some template from their program. At what point do they have to use this so-called extreme generalization capability? Forget about Google software developers. Every human, every day of their lives, is full of novel things that they've not been prepared for. You cannot navigate your life based on memorization alone. It's impossible.

Starting point is 00:28:24 I'm sort of denying the premise that you are also agreed they're not doing like, quote, numeralization, it seems like you're saying they're less capable of generalization, but I'm just curious of like the kind of generalization they do, if you get into the office and you try to do this kind of generalization, you're going to fail at your job. What is the first point, you're a programmer, what is the first point when you try to do that generalization? You would lose your job because you can't do the extreme generalization. I don't have any specific examples, but literally, like, take this situation, for instance, you've never been here in this room.

Starting point is 00:28:59 Maybe you've been in this city a few times. I don't know, but there's a fair amount of novelty. You've never been interviewing me. There's a fair amount of novelty in every hour of every day in your life. It's in fact, by and large, more novelty than any LLM could handle. Like if you just put LLM in a robot, it could not be doing all the things that you've been doing today. Or take either like cell driving cars. for instance, you take a self-driving car operating in the barrier, do you think you could just

Starting point is 00:29:32 drop it in New York City or drop it in London where people drive on the left? No, it's going to fail. So not only can you not like make it generalize to a change of rules, of driving rules, but you cannot even make it generalize to a new city. It needs to be trained on each specific environment. I mean, I agree that self-driving cars aren't AGI. But it's the same type of model. They are transformers as well. I mean, I don't know. Aides also have brains with neurons in them, but they're less intelligent because they're small.

Starting point is 00:30:05 It's not the same architect. We can get into that. So I still don't understand a concrete thing of we also need training. That's why education exists. That's why we had to spend the first 18 years of our life, doing drills. We have a memory, but we are not a memory. We are not limited to just a memory. But I'm denying the firmament that's necessarily the only thing these models are doing.

Starting point is 00:30:27 And I'm still not sure what is the task that a remote worker would be doing, have to, like, suppose you just have to step out a remote work with an LLM and their programmer. What is the first point at which you realize this is not a human, this is an LLM? What about I just send them a knock puzzle and see how they do? No, like part of their job, you know? But you have to deal with novelty all the time. Okay, so if you, is there a world in which all the programmers are replaced? and then we're still saying,

Starting point is 00:30:54 but they're only doing memorization late in programming tasks, but they're still producing a trillion dollars of worth of, you know, output in the form of code. Software development is actually a pretty good example of a job where you're dealing with novelty all the time. Or if you're not, well, I'm not sure what you're doing. So

Starting point is 00:31:10 I personally use Genentee very little in my software development job. And before LMS where I think, I was also using Stack Overflow very little. You know, some people, I'll just copy pasting stuff from Stack Overflow on nowadays, it can be basing stuff from an LLM.

Starting point is 00:31:28 Personally, I try to focus on problem solving. The syntax is just a technical detail. What's really important is the problem solving. Like the essence of programming is engineering mental models, like mental representations of the problem you're trying to solve. But you can add, you know, we have many, people can interact with these systems themselves, and you can go to chat GPT and say,

Starting point is 00:31:52 here's a specification of the kind of program I want, they'll build it for you. As long as there are many examples of this program on like ITEM and Sarkovacru and so on, sure, they will fetch the program for you from their memory. But you can change arbitrary details. No, it doesn't work. I need it to work on this different kind of server. If that were true, there would be no software engineers today. I agree we're not at a full AGIEI yet in the sense that these models have, let's say,

Starting point is 00:32:17 less than a trillion parameters. A human brain has somewhere on the order of 10 to 30 trillion synapses. I mean, if you were just doing some naive math, you're like at least 10x underparameterized. So I agree we're not there yet. But I'm sort of confused on why we're not on the spectrum where, yes, I agree that there's many kinds of generalization they can't do. But it seems like they're on this kind of smooth spectrum that we see even within humans, where some humans would have a hard time doing an archetype test.

Starting point is 00:32:43 We see that based on the performance on progressive Ravens, matrices type IQ tests. I'm not a fan of IQ test because for the most parts you can train on IQ tests and get better at them. So they have very much memorization based. And this is actually the main pitfall that AHC tries not to fall for. I'm still on computer. So if all remote jobs are automated

Starting point is 00:33:05 in the next five years, let's say, at least that don't require you to be like sort of a service. It's not like a salesperson where you want the human to be talking, but like programming whatever. In that world, would you say that that's not possible because a lot of what a programmer needs to do

Starting point is 00:33:21 definitely requires things that would not be in any free training corpus? Sure. I mean, in five years, there will be more software engineers than they are today, not sure. But I just want to understand. So I'm still not sure. I mean, I know how to, I studied computer science. If I had become a code monkey out of college, like, what would I be doing?

Starting point is 00:33:38 I go to my job. What is the first thing, my boss tells me something to do? When does he realize I'm an LLM if I was an LLM? Probably on the first day, you know. Again, if it were true that LLMs could generalize to novel problems like this and actually develop software to solve a problem they've never seen before, you would not need software engineers anymore.

Starting point is 00:34:04 In practice, if I look at how people are using LLMs in their software engineering job today, they are using it as a stack of a flow replacement. So they are using it as a way to copy-paste, code snippets, to perform very common actions. And what they actually need is a database of code snippets. They don't actually need any of the abilities that actually make them software engineers. I mean, when we talk about interpolating

Starting point is 00:34:29 between the stack overflow databases, if you look at the kinds of math problems or coding problems, maybe to say that they're... Maybe let's step back on interpolation and let me ask the question this way. Why can't creativity, why isn't creativity just interpolation in a higher dimension where if,

Starting point is 00:34:46 A bigger model can learn a more complex manifold. If we're going to use the ML language. And if you look at read a biography of a scientist, right? It doesn't feel like they're not zero-shodding new scientific theories. They're playing with existing ideas. They're trying to juxtapose them in their head. They try out some like slightly ever, in the tree of evolution, intellectual descendants, they try out a different evolutionary path.

Starting point is 00:35:10 You sort of run the experiment there in terms of publishing the paper, whatever. It seems like a similar kind of thing humans are doing. there's like at a higher level of generalization. And what you see across bigger and bigger models is they can, they seem to be approaching higher and higher level to centralization where GPT2 couldn't do a great school level math problem that requires more generalization

Starting point is 00:35:27 that it has capability for, even that skill, then GPD three and four can. So not quite. So GPT4 has a higher degree of skill and higher range of skills. Because the same semantics here, but I don't want to get a semantics here, but the question of why can't creativity be

Starting point is 00:35:45 Just interpolation on a higher dimension. I think interpolation can be creative, absolutely. And to your point, I do think that on some level, humans also do a lot of memorization, a lot of reciting, a lot of pattern matching, and a lot of interpolation as well. So it's very much a spectrum between pattern matching and true reasoning, it's a spectrum.

Starting point is 00:36:08 And humans are never really at one hand, end of the spectrum. They're never really doing pure pattern matching of pure reasoning. They're usually doing some mixture of both. Even if you're doing something that seems very reasoning heavy, like proving a mathematical theorem, as you're doing it, sure, you're doing quite a bit of discrete search in your mind, quite a bit of actual reasoning, but you're also very much guided by intuition, guided by the shape of proofs that you've seen before, by your knowledge of mathematics. So it's never really, you know, all of our thoughts,

Starting point is 00:36:44 everything we do is a mixture of this sort of like interpolative memorization based thinking, this sort of like type 1 thinking and type 2 thinking. Why are bigger models more sample efficient? Because they have more reusable building blocks that they can lean on to pick up new patterns in that training data.

Starting point is 00:37:09 And does that pattern keep continuing as you keep getting bigger and bigger? To the extent that the new patterns, that the new patterns you're giving the model to learn are good match for what it has learned before. If you present something that's actually novel that is not in a data distribution, like an arc puzzle, for instance, it will fail.

Starting point is 00:37:25 Let me make this claim. The program synthesis, I think, is a very, very useful intuition pump. Why can it be the case that what's happening in the transformer is the early layers are doing the figuring out how to represent the inputting tokens. And what the middle layers do

Starting point is 00:37:39 is this kind of program search, program synthesis, where they combine the inputs to all the circuits in the model where they go from the low level representation to a higher level representation near the middle of the model, they use these programs, they combine these concepts,

Starting point is 00:37:55 then what comes out at the other end is the reasoning based on that high level intelligence. Possibly, why not? But you know, if these models were actually capable of synthesizing novel programs, however simple, they should be able to do arc. because for any arc task, if you write down the solution program in Python, it's not a complex program. It's extremely simple.

Starting point is 00:38:22 And humans can figure it out. So why can LLMS not do it? Okay. I think that's a fair point. And if I turn the question around to you, so suppose that it's the case that in a year, a multimodal model can solve ARC, let's get 80%, whatever the average human would get, then AGI? Quite possibly, yes. I think if you start, so honestly what I would like to see is an LLM type model,

Starting point is 00:38:50 solving arc at like 80%, but after having only been trained on core knowledge-related stuff. But human kids, I don't think we're necessarily just traded none. It's not just that we have in our genes, object permanence. Let me rephrase that. Only trained on information that is not explicitly trying to uncovering, anticipate what's going to be in the arc test set. But isn't the whole point of Arc that you can't sort of,

Starting point is 00:39:17 it's a new chart of type of intelligence test every single time? Yes, that is the point. So if Arc were a perfect, flawless benchmark, it would be impossible to anticipate what's in the test set. And, you know, Arc was released more than four years ago, and so far it's been resistant to memorization. So I think it has, to some extent, I pass a test of time.

Starting point is 00:39:37 But I don't think it's perfect. I think if you try to make by hand hundreds of thousands of arc tasks, and then you try to multiply them by programmatically generating variations, and then you end up with maybe hundreds of millions of tasks. Just by brute forcing the task space, there will be enough overlap between what you're trained on and what's in the test set that you can actually score very highly. So, you know, with enough scale, you can always cheat. If you can do this for every single thing that supposedly requires intelligence, then what good is intelligence? Apparently you can just brute force intelligence.

Starting point is 00:40:11 If the world, if your life were a static distribution, then sure, you could just brute force the space of possible behaviors. Like, you know, the way I would think about intelligence, there are several metaphors, Salactoes, but one of them is you can think of intelligence as a past-finding algorithm in future situation space. Like, I don't know if you're familiar with game development, like RTS game development, but you have a map, right? And you have, it's like a 2D, 2D map. And you have partial information about it. Like there is some fog of war on your map. There are areas that you haven't explored yet. You know nothing about them.

Starting point is 00:40:51 And then there are areas that you've explored, but you only know how they were like in the past. You don't know how they're like today. And now instead of thinking about a 2D map, think about the space of possible future situations that you might encounter and how they're connected to each other. is a pass-finding algorithm. So once you set a goal, it will tell you how to get there optimally. But of course, it's constrained by the information you have. It cannot pass-finding in an area that you know nothing about.

Starting point is 00:41:25 It cannot also anticipate changes. And the thing is, if you had complete information about the map, then you could solve the pass-finding problem by simply memorizing every possible path, every mapping from point A to point B, you could solve the problem with pure memory. But the reason you cannot do that in real life is because you don't actually know what's going to happen in the future. Life is ever changing. I feel like you're using words in really memorization, which we would never use for human children. If you're like, your kid learns to do algebra and then like now learns to do calculus, you wouldn't say they

Starting point is 00:42:03 memorize calculus. If they can just solve any arbitrary algebraic problem, you wouldn't say, like they've memorized algebra. They say they've learned algebra. Humans are never really doing pure memorization or pure reasoning. But that's only because you're semantically labeling when the human does a skill, it's a memorization, when the exact same school is done by the LLM, as you can measure by these benchmarks. And you can just, like, plug in any sort of math problem.

Starting point is 00:42:22 Sometimes humans are doing the exact same as the LLM is doing, which is just, for instance, I know, if you learn to add numbers, you're memorizing an algorithm, you're memorizing a program, and then you can reapply it. You are not synthesizing on the fly the addition program. So obviously at some point, some human had to figure out how to do addition. But the way a kid learns it is not that they sort of figure out from the accents of that theory, how to do addition. I think what you learn in school is mostly memorization.

Starting point is 00:42:49 Right. So my claim is that, listen, these models are vastly underparameterized relative to how many flops or how many parameters you have the human brain. And so, yeah, they're not going to be like coming up with new theorems like the smartest humans can. But most humans can't do that either. what most humans do, it sounds like it's similar to what you are calling memorization, which is memorizing skills or memorizing, you know, techniques that you've learned. And so it sounds like it's compatible in your, tell me if this is wrong. Is it compatible in your world if like all the remote workers are gone, but they're doing

Starting point is 00:43:25 skills which we can potentially make synthetic data off? So we record everybody's screen and every single remote worker's screen. We sort of understand the skills they're performing there. And now we've trained a model that can do all. all the remote workers are unemployed, we're generating trillions of dollars to economic activity from AI, remote workers. In that world, are we still in the memorization regime?

Starting point is 00:43:44 So, sure, with memorization, you can automate almost anything. As long as it's a static distribution, as long as you don't have to deal with change. Are most jobs part of such a static distribution? Potentially, there are lots of things that you can automate. And LLMs are an excellent tool for automation. And I think that's really, but you have to understand that automation automation is not the same as integers.

Starting point is 00:44:07 I'm not saying that all limbs are useless. I've been a huge proponent of deep learning for many years. And you know, for many years, I've been saying two things. I've been saying that if you keep scaling up deep learning, it will keep paying off. And at the same time, I've been saying, if you keep scaling up deep learning, this will not lead to

Starting point is 00:44:23 a GI. So we can automate more and more things. And yes, this is economically valuable. And yes, potentially there are many jobs. You could automate a way like this, and that would be economically valuable. But you're not still not going to have intelligence. So you can ask, you know, okay, so what does it matter if we can generate all this economic value? Maybe you don't need intelligence after all. Well, you need intelligence the moment

Starting point is 00:44:44 you have to deal with change, with novelty, with uncertainty. As long as you're in a space that can be exactly described in advance, you can just, you can just make your pure memorization, right? In fact, you can always solve any problem. You can always display arbitrary levels of skills on any task without leveraging any intelligence whatsoever as long as it is possible to describe the problem and its solution very, very precisely.

Starting point is 00:45:17 But when they do deal with novelty, then you just call it interpolation, right? No, no, interpolation is not enough to deal with all kinds of novelty if it were, then LLMs would be a GI. Well, I agree they're not a GI. I'm just trying to figure out how do we figure out

Starting point is 00:45:32 we're on the path to a GI. And I think sort of crux here is maybe that it seems to me that these things are on a spectrum and we're clearly covering the earliest part of the spectrum with LLMs. I think so. And oh, okay, interesting. But here's another sort of thing that I think is evidence for this. Grocking, right? So clearly even within deep learning, there's a difference between the memorization regime

Starting point is 00:45:55 and the generalization regime where at first they'll just memorize the data set of, you know, if you're doing modular edition, how to add digits. And then at some point, if you keep training on that, they'll learn the skill. So the fact that there is that distinction suggests that the generalized circuit, the deep learning can learn, there is a regime it enters where it generalizes if you have an over-parameterized model, which you don't have in comparison to all the tasks we want these models to do right now. Grogh is very, very old phenomenon. We've been observing it for decades.

Starting point is 00:46:25 It's basically an instance of the minimum description length principle, where sure you can given a problem you can just memorize an input pointwise input to output mapping which is completely overfit so it does not generalize at all but it solves the problem on the train data and from there you can actually keep pruning it keep making your mapping simpler and simpler and more compressed and at some point it will start generalizing and so that's something called the the minimum description next principle. It's decided that the program that will generalize best is the shortest.

Starting point is 00:47:06 Right. And it doesn't mean that you're doing anything other than memorization, but you're doing memorization plus regularization. Right. AKA generalization. Yeah. And that is absolutely at least to generalization. Right.

Starting point is 00:47:20 And then so you do that within one skill. But then the pattern you see here of meta learning is that it's more efficient to store a program that can perform many skills rather than one skill. which is what we might call fluid intelligence. And so as you get bigger in moving models, you would expect it to go up this hierarchy of generalization where it generalizes to a skill, then it generalizes across multiple skills.

Starting point is 00:47:38 That's correct. That's correct. And, you know, LLMs, they're not infinitely large. They have only a fixed number of parameters, and so they have to compress their knowledge as much as possible. And in practice, so LLMs are mostly storing reusable bits of programs, like vector programs. And because they have this need for compression, it means that every time they're learning a new program,

Starting point is 00:48:01 they're going to try to express it in terms of existing bits and pieces of programs that they've already learned before. Right? Isn't this the generalization? Absolutely. Oh, wait. This is why, you know,

Starting point is 00:48:14 clearly LLMs have some degree of generalization. And this is precisely why. It's because they have to compress. And why is that intrinsically limited? Why can't you just go, at some point, it has to learn a higher level of generalization, higher level and then the highest level is the fluid intelligence. It's intrinsically limited because the substrate of your model is a big parametric curve.

Starting point is 00:48:34 And all you can do with this is local generalization. If you want to go beyond this towards broader or even extreme generalization, you have to move to a different type of model. And my paradigm of choice is discrete program search, program synthesis. So, and if you want to understand that, you can sort of like compare, compare it, contrasts it with deep learning. So in deep learning, your model is a parametric curve, a differential ball parametric curve.

Starting point is 00:49:03 In program synthesis, your model is a discrete graph of operators. So you've got like a set of logical operators like a domain specific language. You're picking instances of it. You're structuring that into a graph. That's a program. And that's actually very similar to like a program you might write in Python or C++ and so on. And in deploying your learning engine, because we are doing machine learning here, like we're trying to automatically learn these models, and deep learning your learning engine is gradient descent, right?

Starting point is 00:49:37 And gradient descent is very compute efficient because you have this very strong, informative feedback signal about where the solution is, so you can get to the solution very quickly. But it is very data inefficient, meaning that in order to make it work, you need a dense sampling of the operation. rating space. You need a dense sampling of data distribution. And then you're limited to only generalizing within that data distribution. And the reason why you have this limitation is because your model is a curve. And meanwhile, if you look at discrete program search, the learning engine is combinatoral of search. You're just trying a bunch of programs until you find one that actually meets your spec. This process is extremely data efficient. You can learn

Starting point is 00:50:21 and generalizable program from just one example, two examples, which is why works so well on arc by the way but the big limitations that it's extremely computing efficient because you're running into a combinator explosion of course and so you can you can sort of see here how the learning and discrete program search they have very complementary strength and limitations as well like every limitation of deep learning as a strength corresponding strengths in in program synthesis and in and university and i think the past forward is going to

Starting point is 00:50:54 going to be to merge the two, to basically start doing. So another way you can think about it is, so this parametric curves train with ground descent, there are great fit for everything that's system one type thinking, like pattern cognition, intuition, memorization, and so on. And discrete program search is great fit for type two thinking, system two thinking. for instance, planning, reasoning, quickly figuring out a generalizable model, let matches just one or two examples

Starting point is 00:51:30 like for an arc puzzle, for instance. And I think humans are never doing pure system one or pure system two. They're always mixing and matching both. And right now we have all the tools for system one. We have almost nothing for system two. The way for one is to create a hybrid system. And I think the form it's going to take

Starting point is 00:51:49 is it's going to be mostly system too. So the outer structure is going to be a discrete program search system. But you're going to fix the fundamental limitation of discrete program search, which is a counter explosion. You're going to fix it with deep learning. You're going to leverage deep learning to guide, to provide intuition in program space, to guide the program search. And I think that's very similar to what you see, for instance,

Starting point is 00:52:17 when you're playing chess or when you're trying to prove a theorem is that it's mostly a reasoning thing, but you start out with some intuition about the shape of the solution. And that's very much something you can get via a deep planning model. Deplanning models, they are very much like

Starting point is 00:52:37 intuition machines. They're pattern matching machines. So you start from this shape of the solution And then you're going to do actual explicit discrete program search. But you're not going to do it via brute force. You're not going to try things kind of like randomly. You're actually going to ask another deep learning model for suggestions. Like here's the best likely next step.

Starting point is 00:53:06 Here's where in the graph you should be going. And you can also use yet another deplanning model for feedback. But well, here's what I have so far. Is it looking good? should they just backtrack and try something new. So I think discrete program search is going to be the key, but you want to make it dramatically better or those of magnitude more efficient by leveraging deep learning.

Starting point is 00:53:27 And by the way, another thing that you can use deep learning for is of course things like common sense knowledge and knowledge in general. And I think you're going to end up with this sort of system where you have this on-the-fly synthesis engine that can adapt to new situations. situations. But the way it adapts is that it's going to fetch from a bank of patterns, modules that could be themselves curves that could be a differentiable modules and some

Starting point is 00:53:58 others that could be algorithmic in nature. It's going to assemble them via this process that's intuition guided. And it's going to give you, for every new situation you might be faced with, it's going to give you with a generalizable model that was synthesized using very, very little data. Something like this would sort of arc. That's actually a really interesting a prompt because I think an interesting crux here is when I talk to my friends who are extremely optimistic

Starting point is 00:54:28 about LLMs and expect AGI within the next couple of years, they also in some sense agree that scaling is not all you need, but that the rest of the progress is undergirded and enabled by scaling And but still, you need to add the system to the test time compute atop these models. And their perspective is that it's relatively straightforward to do that because you have this

Starting point is 00:54:55 library representations that you built up from free training. But it's almost talking like, you know, it's just like skimming through textbooks. You need some more deliberate way in which it engages with the material it learns. In context learning is extremely sample efficient. But to actually distill that into the. the weights. You need the model to like talk through the things that sees and then added back to the weights. As far as the system two goes, they talk about adding some kind of RL set up so that it is encouraged to proceed on the reasoning traces that end up being correct. And they think this is

Starting point is 00:55:29 relatively straightforward stuff that will be added within the next couple of years. That's an empirical question. So I think we'll see. Your intuition, I assume, is not that. I'm curious. My intuition is, in fact, this whole like system to architecture is the hard part, is the very hard and not obvious part. Scaling up the interpretive memory is the easy part. All you need is, like, it's literally just a big curve. All you need is more data. It's a representation of a data set, an interpolative representation of a data set. That's the easy part. The heart part is the architecture of intelligence. Memory and intelligence are separate components. We have the memory, we don't have the intelligence yet. And I agree with you that, well, having the memory is

Starting point is 00:56:09 actually very useful. And if you just had the intelligence, but it was not hooked up to an extensive memory. It would not mean that useful because it will not have enough material to work from. Yeah. The alternative hypothesis here that a former guest Trenton-Brickon advanced is that intelligence is just hierarchically associated memory where higher-level patterns, when Sherlock Holmes goes into a crime scene, and he's extremely sample-efficient. He can just look at a few clues and figure out who was a murderer. And the way he's able to do that is he has learned higher level sort of associations. It's memory in some fundamental sense.

Starting point is 00:56:46 But so here's one way to ask a question. In the brain, supposedly we do program synthesis, but it is just synapsis connected to one, each other. And so physically it's got to be that you just query the right circuit, right? You are, yeah, yeah. You know, it's a matter of degree. But if you can learn it, if training in the environment that the human ancestors are trained in means you learn that those

Starting point is 00:57:10 circuits, training on the same kinds of outfits of humans produce, which to replicate, require these kinds of circuits. Wouldn't that train the same kind of whatever humans have? You know, it's a matter of degree. If you have a system that has a memory and is only capable of doing local generalization from that, it's not going to be very adaptable. To be really general, you need the memory plus the ability to search to quite some depth to achieve, you know, broader even extreme generalization. You know, like one of my favorite psychologists, so Jean Piaget, was the founder of the environmental psychology.

Starting point is 00:57:51 He had a very good quote about intelligence. He said, intelligence is what you use when you don't know what to do. And it's like as a human living your life, in most situations, you already know what to do because you've been in this situation before. You already have the answer, right? And you're only going to need to use intelligence when you're faced with novelty, with something you didn't expect,

Starting point is 00:58:15 with something that you weren't prepared for either by your own experience, your own life experience, or by your evolutionary history. Like this day that you're living right now is different in some important ways from every day you've lived before, but it's also different from any day ever lived

Starting point is 00:58:33 by any of your ancestors. And still, you're capable of being functional. Right. How is it possible? I'm not denying that generalization is extremely and is the basis for intelligence. That's not the correct. The correct says how much of that is happening in the models. But, okay, let me ask a separate question.

Starting point is 00:58:51 We might keep going in the circle here. The difference is in intelligence between humans. Maybe the intelligence test because of reasons you mentioned are not measuring it well, but clearly there's differences in intelligence between different humans. Sure. What is your explanation for what's going on there? Because I think that's sort of compatible with my story that there's the spectrum of generality and that these models are climbing up to a human level.

Starting point is 00:59:12 And even some humans haven't even climbed up to the Einstein level or the Francois level. That's a great question. You know, there is extensive evidence that intelligence, difference in intelligence are mostly genetic in nature, right? Meaning that if you take someone who is not very intelligent, there is no amount of training, of like training data you can expose that person to that would make them become Einstein. And this kind of points to the fact that you really need a better architecture. You need a better algorithm. And more training data is not, in fact, all you need. I think I agree with that.

Starting point is 00:59:52 I think maybe a way I might phrase it is that the people who are smarter have in ML language, better initializations. It just, the neural wiring, if you just look at it's more efficient. They have maybe greater density of firing. And so some part of the story scaling, there is some correlation between brain size and intelligence. And we also see within the context of quote unquote scaling that people talk about within the context of LLM's architectural improvements where a model like Gemini 1.5 Flash is performs as well as GPT4 did when GPT4 was released a year ago, but is 57 times cheaper on output. So part of the scaling story is that the architectural improvements are we're in like extremely low-hanging fruit, territory when it comes to those.

Starting point is 01:00:39 Okay, we're back now with the co-founder of Zapier, Mike Knoof. We had to restart a few times there. And you're funding this prize and you're running this prize with Francois. And so tell me about how this came together. What prompted you guys to launch this prize? Yeah. I guess I've been sort of like AI curious for 13 years. I've been, I co-founded Zapper, been running it for the last 13 years.

Starting point is 01:01:04 And I think I first got introduced to your work during COVID. I kind of went down the rabbit hole. I had a lot of free time. And it was right after you published your On Measure of Intelligence paper, where you sort of introduced the concept of AGI. This efficiency of skill acquisition is like the right definition and the arc puzzles. But I don't think the first Cagall contest was done yet. I think it was still running.

Starting point is 01:01:26 And so I kind of, it was interesting, but I just parked the idea. And I had bigger fish to fry it Zapier. were in this middle of this big turnaround of trying to get to our second product. And then it was January 2022 when the chain of thought paper came out that really like awoken me to sort of the progress. I gave a whole presentation to the Zapier on like the GP3 paper even. So I sort of felt like I had priced in everything that Elms could do. And that paper was really shocking to me in terms of, oh, these latent capabilities that

Starting point is 01:01:54 Elms have that I didn't expect that they had. And so I actually gave up my exact team role at Zappar. I was running half the company at that point. I went back to be an individual contributor and just to go do AI research alongside Brian, my co-founder. And ultimately that led me to back towards ARC. I was looking into it again. And I had sort of expected to see this saturation effect that MMLE has, that GMSK has. And when I looked at the scores and the progress since the last four years, I was really, again, shocked to see.

Starting point is 01:02:27 Actually, we've made very little objective progress towards it. and it felt very, it felt like a really, really important e-val. And as I sort of spent the last year, asking people, quizzing people about it in sort of my networking community, very people, few people even knew it existed. And that felt like, okay, if it's right that this is a really, really, like, globally, singularly unique EGI Eval, and it's different from every other e-val that exists that are more, that more narrowly measures AI skill. Like, more people should know about this thing.

Starting point is 01:02:57 I had my own ideas on how to beat the arc as well. So I was working on nights and weekends on that. And I flew up to meet Francois earlier this year to quiz him, show them my ideas. And ultimately, I was like, well, you know, why don't you think more people know about Arc? I think you should actually answer that. I think it's a really interesting question. Like, why don't you think more people know about Arc? Sure.

Starting point is 01:03:18 You know, I think benchmarks that gain traction in the research community are benchmarks that are already fairly tractable. Because the dynamic that you see is that some research groups, is going to make some initial breakthrough. And then this is going to catch the attention of everyone else. And so you're going to get follow-up papers with people trying to beat the first team and so on. And for ARC, this has not really happened because ARC is actually very hard for existing AI techniques. Kind of arc requires you to try new ideas. And that's very much the point, by the way.

Starting point is 01:03:49 Like the point is not that, yeah, you should just be able to apply existing technology and solve ARC. The point is that existing technology, technology has reached a plateau. And if you want to go beyond that, if you want to start being able to tackle problems that you haven't memorized, that you haven't seen before, you need to try new ideas.

Starting point is 01:04:08 And Arc is not just meant to be this sort of like measure of how close we are to a GI. It's also meant to be a source of inspiration. Like I want researchers to look at these puzzles and be like, hey, it's really strange that these puzzles are so simple and most humans can just do them very quickly, why is it so hard for existing

Starting point is 01:04:33 AI systems? Why is it so hard for ALLAMs and so on? It's true for ALLAMS and this is true for ALAMS, but ARC was actually released before ALLAMP were really a thing. And the only thing that made it special at the time was that it was designed to be a resistance to memorization. And the fact that it has survived ALAMS and ENIRA in general so well kind of shows that, yes, it is actually resistant to memorization. This is what nerds night me, because I went and took a bunch of the puzzles myself. I've showed it to all my friends and family too, and they're all like, oh, yeah, this is like super easy. Are you sure AI can't solve this?

Starting point is 01:05:07 Like, that's the reaction and the same one for me as well. And the more you dig in, you're like, okay, yep, there's not just empirical evidence over the last four years that it's unbeaten, but there's theoretical, like, concepts behind why. And I completely agree at this point that, like, new ideas basically are needed to be dark. And there's a lot of current trends in the world that are actually, I think, working against that happening. Basically, I think we're actually less likely to generate new ideas right now. You know, I think one of the kind of trends is the closing up frontier research, right? The GP4 paper from opening, I had no technical detail shared. The Gemini paper had no technical detail shared and like the longer context part of that work.

Starting point is 01:05:43 And yet that open innovation, that open progress and sharing is what got us to transformers in the first place. That's what got us to Elms in the first place. So it's kind of disappointing a little bit, actually, that so much frontier work has gone closed, it's really making a bet that these individual labs are going to have the breakthrough and not the ecosystem is going to have the breakthrough. And I think sort of the Internet open source has shown that that's like the most powerful innovation ecosystem that's ever existed probably in the entire world. I think that's actually really sad that frontier research is no longer being published.

Starting point is 01:06:13 If you look back, you know, four years ago, well, everything was just openly shared, like all the state-of-the-art results were published. And this is no longer the case. And it's very much, you know, Open AI single-handedly changed the game. And I think OpenEI basically set back progress towards HGII by quite a few years, probably like five to 10 years, for two reasons.

Starting point is 01:06:37 And one is that, well, they cause this complete closing down of research, frontier research publishing. But also they trigger this initial burst of, hype around LLMs. And now LLMs have sucked the oxygen out of the room. Like everything, everyone is just doing LLM's. And I see LLMs as more of an off-ramp on the path to a GR, actually. And all these new resources, they're actually going to LLM's instead of everything else

Starting point is 01:07:11 they could be going to. And, you know, if you look further into the past to like 2015, 2016, 2016, they're were like a thousand times fewer people doing AI back then. And yet I feel like the rate of progress was higher because people were exploring more directions. The world felt more open-ended. Like you could just go and try, like have a cool idea of a launch and try it and get some interesting results.

Starting point is 01:07:38 So there was this energy. And now everyone is very much doing some variation of the same thing. And the big labs also tried their hand on arc. but because they got bad results, they didn't publish anything. Like, you know, people only publish positive results. I wonder how much effort people have put into trying to prompt or scaffold, do some sort of maybe Devon-type approach into getting the frontier models and the frontier models of today, not just a year ago,

Starting point is 01:08:10 because a lot of post-training has gone into making them better. So Cloud 3 Opus or GPD40 into getting good solutions on art, I hope that one of the things this episode does is get people to try out this open competition where they have to put in an open source model to compete, but also to like figure out if there maybe the late capability is latent in Clod Opus and just see if you can show that. I think that would be super interesting. So let's talk about the prize. How much do you win if you solve it, you know, get whatever percent on ARC?

Starting point is 01:08:44 How much do you get if you get the best submission but don't crack it? So we got a million dollar plus, actually a little over a million dollar. of the price pool. We're running the contest on an annual basis. We're starting it today through the middle of November. And the goal is to get 85%. That's the lower bound and human average that you guys talked about earlier. And there's a $500,000 prize for the first team that can get to the 85% benchmark. We're also going to run, we don't expect that to happen this year, actually. One of the early statisticians that Zapier gave me this line that has always stuck with me, the longer it takes, the longer it takes. So my prior is that like,

Starting point is 01:09:18 arc is going to take years to solve. And so we're going to keep, we're also going to break down to do a progress price this year. So there's a $100,000 progress price, which we will pay out to the top scores. So $50,000 is going to go to the top objective scores this year on the Cagle leaderboard, which is we're hosting it on Caggle. And then we're going to have a $50,000 pot set for a paper award for the best paper that explains conceptually the scores that they were able to achieve. And one of the, I think, interesting things we're also going to be doing is we're going to be requiring that in order to win the prize money that you put the solution or your paper out into public domain. The reason for this is, you know, typically with contests, you see a lot of like closed up sharing people are kind of private secret. They want to hold their outfit of themselves during the contest period.

Starting point is 01:10:05 And because we expect it's going to be multiple years, we want to enter a game here. So the plan is, you know, at the end of November, we will award the $100,000 prize money to the top progress prize. and then use the downtime between December, January, February to share out all the knowledge from the top scores and the approaches folks were taking in order to re-baseline the community up to whatever the state of the art is and then run the context again next year. And keep doing that on a yearly basis until we get 85%. I'll give some people some context on why I think this prize is very interesting. I was having conversations with my friends who are very much believers in models as they exist today.

Starting point is 01:10:42 And first of all, it was intriguing to me that they didn't know about our. These are experienced ML researchers. And so you show them the, this happened a couple of nights ago. We went to dinner and I showed them an example problem. And they said, of course an LLM would be able to solve something like this. And then we take a screenshot of it. We just put it into our chat GPT app. And it doesn't get the pattern.

Starting point is 01:11:02 And so I think it's a very interesting, like, it is a notable fact I was sort of playing devil's advocate against you on these kinds of questions, but this is a very intriguing fact. And I'm extreme, I think this is a prize is extremely interesting because we're going to learn, we're going to learn something fascinating, something fascinating one way or another. So with regards to the 85% separate from this prize I'd be very curious if somebody could replicate that result because obviously in psychology and other kinds of fields which this result seems to be analogous to

Starting point is 01:11:31 when you run test on some small sample of people often they're hard to replicate. I'd be very curious if you try to replicate this how what does the average human perform on arc? Ask for the difficulty on how long it will take to crack this benchmark It's very interesting because the other benchmarks that are now fully saturated like MMLU math, actually the people who made them, Dan Hendrix and Colin Burns who did MMLU in math, I think there were grad students or college students when they made it. And the goal when they made it just a couple of years ago was that this will be a test of AGI, and of course it got totally saturated.

Starting point is 01:12:05 I know you'll argue that these are a test of memorization, but I think the pattern we've seen, in fact, Epoch AI has a very interesting graph that I'll sort of overlay for the YouTube version here where you see this almost exponential where it gets you know 5% 10% 30% 40% as you increase the compute across models and then it just shoots up and in the gbt4 technical report they had this interesting graph of the human eval problem set which was 22 coding problems and they had to graph it on the mean log pass curve basically because it early on in training or even smaller models can have the right idea of how to solve this problem but it takes a lot of reliability to make sure they stay on

Starting point is 01:12:50 track to solve the whole problem and so you really want to upweigh the signal where they get it right at least some of the time be one in a hundred times one at a thousand and then so they go from like one in thousand one in hundred one in ten and then they just like totally saturate it i guess the question i have when this is all leading up to is why won't the same thing happen with arc where people had to try really hard bigger models um and now they figure out these techniques that jack is figure it out with only a 240 million parameter language model that can get 35%. Shouldn't we see the same pattern we saw across all these other benchmarks where you're just like sort of eke out and then once you get the general idea, then you just go all the way to 100?

Starting point is 01:13:27 That's an empirical question. So we'll see in practice what happens. But what Jack Cole is doing is actually very unique. It's not just pre-training an alarm and then prompting it. He's actually trying to do active inference. He's doing test time, right? He's doing like test time fine-tuning. And this is actually trying to lift one of the key limitations of the LLMs,

Starting point is 01:13:47 which is that at inference time, they cannot learn anything new. They cannot adapt on the flight where they're seeing. And it's actually trying to learn. So what is doing is effectively a form of program synthesis. Because the LLM contains a lot of useful building blocks, like programming building blocks. And by fine units on the task at test time, you are trying to assemble these building blocks into the right.

Starting point is 01:14:13 pattern that matches the task. This is exactly what program synthesis is about. And the way would contrast this approach with discrete program search is that in discrete program search, so you're trying to assemble a program from a set of primitives, you have very few primitives. So people working on discrete program search on Arc, for instance, they tend to work with DSLs that have like 100 to 200 primitive programs. So very small DSL, but then they're trying to

Starting point is 01:14:43 combine these parameters into very complex programs. So there's a very deep depth of search. And on the other hand, if you look at what Jack Cole is doing with LLMs, is that he's got this sort of like vector program database DSL of millions of building blocks in the LLM that are mined by pre-training the LLM, not just on a ton of programming problems, but also on millions of generated arc-like tasks. So you have an extraordinarily large DSL,

Starting point is 01:15:17 and then the fine-tuning is very, very shallow recombination of these primitives. So discrete program search very deep recombination, very small set of primitive programs, and the LLM approach is the same, but on the complete opposite end of that spectrum, where you scale up the memorization by a massive factor, and you're doing very, very shallow, search. But they are the same thing. Just different ends of the spectrum. And I think where you're

Starting point is 01:15:47 going to get the most value for your compute cycles is going to be somewhere in between. You want to leverage memorization to build up a richer, more useful bank of alternative programs. And you don't want them to be hard-coded, like what we saw for the typical audience. You want them to be learned from examples. But then you also want to do some degree of deep search. As long as you're only doing a very shadow search, you are limited to local journalism. If you want to generalize further, more broadly, this depth of search is going to be critical. I might argue that the reason that he had to rely so heavily on the synthetic data was because he used a 240 million parameter model because the Kaggle competition at the time

Starting point is 01:16:37 required him to use a P-100 GPU, which has like a tenth or something of the flops of an H-100. And so obviously he can't use if you believe that sort of scaling will solve these kind of reasoning, then there you can just rely on the generalization, whereas if you're using a much smaller model,

Starting point is 01:16:56 for context for the listeners, by the way, the frontier models today are literally a thousand X bigger than that. And so for your competition, from what I remember, you, the submission you have to submit can't make any API calls, can't go online, and has to run on Nvidia Tesla T4. P100.

Starting point is 01:17:17 Oh, is it P100? Yeah. Okay. So again, it's like significantly less powerful. There's a 12-hour runtime limit, basically. There's a forcing function of efficiency in the Eval. But here's the thing. You only have 100 test tasks.

Starting point is 01:17:27 So the amount of computer available for each task is actually quite a bit, especially if you contrast that with the simplicity of each task. So it would be seven minutes per task, basically. Which for, you know, people have tried to do these estimates of how many floss does a human brain have. And you can take them with a grain of salt, but as a sort of anchor, it's basically the amount of flops in H100 has. And I guess maybe you would argue with that, well, a human brain can solve this question in faster than 7.2 minutes. So even with a 10th of the compute, you should be able to do it in seven minutes. Obviously, we have less memory than, you know, like petabytes of fast access memory in the brain.

Starting point is 01:18:04 with these 29 or whatever gigabytes in this 800. Anyway, I guess the rudder question masking is, I wish there's a way to also test this prize with some sort of scaffolding on the biggest models as a way to test whether scaling is the path to get to, you know, solving arc. Absolutely. So in the context of the competition,

Starting point is 01:18:28 we want to see how much progress we can do with limited resources. But you are entirely right that it's a super interesting open question, what could the biggest model out there actually do on arc? So we want to actually also make available a private sort of like one-off track where you can submit to us a VM. And so you can put on it any model you want. You can take one of the largest open source models out there, find you need, do whatever

Starting point is 01:18:54 you want. And just give us an image. And then we run it on the H-100 for like 24 hours or something and you see what you get. I think it's worth pointing out that there's two different test sets. There is a public test set that's in the public GitHub repository that anyone can use to train, you know, put it in an open API call, whatever you'd like to do. And then there's the private test set, which is the 100,

Starting point is 01:19:16 that is actually measuring the state of the art. So I think it is pretty open and interesting to have folks attempt to at least use the public test set and go try it. Now, there is an asterisk on any score that's reported on against the public test set because it is public. It could have leaked into the training data somewhere. This is actually what people are already doing. You can already try to prompt one of the best models,

Starting point is 01:19:36 like the latest Gemina, the latest GPT4, with tasks from the public evaluation set. And, you know, again, the problem is that these tasks are available as JSON files on GitHub. These models are also trained on GitHub. So they're actually trained on these tasks. And, yeah, that kind of creates uncertainty about if they can actually source some of the tasks,

Starting point is 01:19:59 is that because they memorize the answer or not? You know, maybe you would be better off trying to create your own private, arc-like, a very novel test set. Don't make the task difficult. Don't make them complex, make them very obvious for humans. But make sure to make them original as much as possible, make them unique, different. And see how much your GPT4 and so on, or GP5 does on them. Well, they're having tests on whether these models are being overtrained on these benchmarks.

Starting point is 01:20:28 Scale recently did this where on the GSM It was really interesting. They basically replicated the benchmark where with different questions. And so some of the models actually were extremely overfit on the benchmark like MISROL and so forth. But the frontier models, Claude and GBT actually did as well

Starting point is 01:20:47 on their novel benchmarker that they did on the specific questions that were in the existing public benchmark. So I would be relatively optimistic about them just sort of training on the JSON. I was joking with Mike that you should allow API access but sort of keep an even more private validation set of these ARC questions. And so allow API access, people can sort of play with GPD4 scaffolding to enter into this contest.

Starting point is 01:21:14 And if it turns out, maybe later on you run the validation set on the API. And if it performs worse than the test said that you allowed the API access to originally, that means that Open AI is training on your API calls and you like, go public with this. and show them like, oh my God, they're, you know, they've like leaked your data. We do want to make, we want to evolve the ARC data set. Like, that is, that is a goal that we want to do. I think, Francois, you mentioned, you know, it's not perfect. Yeah, no, our ARC is not perfect, perfect benchmark.

Starting point is 01:21:40 I mean, I made it like four years ago, over four years ago, almost five now. This was in a time before LAMS. And I think we learned a lot, actually, since about what potential flaws. There might be, I think there is some redundancy in the set of tasks, which is, of course, against the goals. of the benchmark. Every task is supposed to be unique in practice. That's not quite true. I think there's also every task is supposed to be very novel, but in practice, they might not be. They might be structurally similar to something that you might find online somewhere.

Starting point is 01:22:12 So we want to keep iterating and release an Arc 2 version later this year. And I think when we do that, we're going to want to make the old private test set available. So maybe we won't be releasing it publicly, but what we could do, is just create a test server where you can query, get a task, you submit a solution, and of course you can use whatever Frontier model you want there.

Starting point is 01:22:37 So that way, because you actually have to query this API, you're making sure that no one is going to, by accident, train on this data. It's unlike like the current public auditory, which is literally on GitHub. So there's no question about whether the models are actually trained on it. Yes, they are because they're trained on GitHub.

Starting point is 01:22:54 So by sort of like gating access to acquiring this API, we would avoid this issue. And then we would see, you know, for people who actually want to try whatever technique they have in mind using whatever resources they want, that would be a way for them to get an answer. I wonder what might happen. I'm not sure.

Starting point is 01:23:13 One answer is that they've come up with a whole new algorithm for AI with some explicit program synthesis that now we're on a new track. And another is they did something hacky with the existing models in a way that actually is valid, which reveals that movie intelligence is more of getting getting things to the right part of the distribution, but then it can reason. And in that world, I guess that will be interesting.

Starting point is 01:23:37 And maybe that'll indicate that, you know, you had to do something hacky with current models. As they get better, you won't have to do something hacky. I'm also going to be very curious to see how these multimodal models, if they will perform natively much better at Arc-like tests. If Arc survives three months from here, we'll up the price. I think we're about to make a really important moment of like, contact with reality by blowing up the prize, putting a much big price pool against it. We're going to learn really quickly if there's like low-hanging fruit of ideas.

Starting point is 01:24:04 Again, I think new ideas are needed. I think anyone listening, this might have the idea in their head. And I'd encourage everyone to like give it a try. And I think as time goes on, that adds strength to the argument that like we've sort of stalled out in progress and that new ideas are necessary to be dark. Yeah. That's the point of having a money price is that you attract more people. You get them to try to solve it.

Starting point is 01:24:25 And if there's an easy way to hack the benchmark that reveals that the benchmark is far out, then you're going to know about it. In fact, that was the point of the original Karel competition back in 2020 for ARC. I was running this competition because I had released this data set and I wanted to know if it was hackable, if you could cheat. So there was a small money prize at the time. There was like 20K. And this was right around the same time as GPT3 was released. So people of course try GPT3 on the public data. scored zero. But I think what the first context, the first contest told us is that there is no

Starting point is 01:25:03 obvious shortcut. Right. And well, now there's more money. There's going to be more people looking into it. Well, we're going to find out. We're going to see if the benchmark is going to survive. And you know, if we end up with a solution that is not like trying to brute force the space of possible arc tasks that's just trained on core knowledge. I don't think it's necessarily going to be in and by itself, EGI, but it's probably going to be a huge milestone

Starting point is 01:25:34 on the way to EGI. Because what it represents is the ability to synthesize a problem-solving program from just two or three examples. And that alone is a new way

Starting point is 01:25:52 to program. It's a, it's a, It's an entirely new paradigm for software development, where you can start programming potentially quite complex programs that will generalize very well. And instead of programming them by coming up with the shape of the program in your mind and then typing it up, you're actually just

Starting point is 01:26:11 showing the computer with add which you want and you let the computer figure it out. I think that's a little bit on what kinds of solutions might be possible here and which you would consider sort of defeating the purpose of ARC and which are sort of valid. Here's one I'll mention, which is my friends, Ryan and Buck, stayed up last night because I told them about this,

Starting point is 01:26:36 and they were like, oh, of course I'll want to solve this. Of course, I'll solve this. And then so they were trying to prompt, I think, Claude Opus on this. And they say they got 25% on the public ARC test. And what they've done did was have other examples of some of the ARC test. and in context explain the reasoning of why you went from one output to another output, and then now you have the current problem. And I think also maybe expressing the JSON in a way that is more amenable to the tokenizer.

Starting point is 01:27:06 And another thing was using the code interpreter. So I'm curious actually if you think the code interpreter, which keeps getting better as these models get smarter, is just the program synthesis right there because what they were able to do was the actual output of the cells, the JSON output, they got through the code interpreter, like write the Python program that gets right out for here. Do you think that the program synthesis kind of research

Starting point is 01:27:31 are talking about will look like just using the code interpreter in large language models? I think whatever solution we see that will score well is going to probably need to leverage some aspects from deep learning models and the LLMs in particular. We've shown already that LLMs can do quite well. That's basically the jack-code approach. We've also shown that pure discrete problems

Starting point is 01:27:51 pure discrete program search from a small DSL does very, very well. Before Jack Cole, this was the state of the art. In fact, it's still extremely close to the state of the art. And there's no deep learning involved at all in these models. So we have two approaches that have basically no overlap that are doing quite well. And they're very much at two opposite ends of one spectrum, where on one end you have these extremely large banks of millions of vector programs, but very, very shallow recombination, like simplicity recombination.

Starting point is 01:28:20 And on the other end, you have very simplistic DSLs, very simple, like 100 or 200 primitives, but very deep, very sophisticated program search, the solution is going to be somewhere in between. Right. So the people who are going to be winning the R competition and that we are going to be making the most progress towards near-term HMHR are going to be suits that manage to merge the deep learning paradigm and a discrete program search paradigm into one elegant way. You know, you ask like, what would be legitimate and what would be cheating, for instance? So I think you want to add a code interpreter to the system.

Starting point is 01:28:59 I think that's great. That's sort of legitimate. The part that would be cheating is try to anticipate what might be in the test, like brute force the space of possible tasks and then train a memorization system on it and then rely on the fact that you're generating so many tasks like millions and millions and millions that, inevitably there's going to be some overlap between what you're generating and what's in the test set. I think that's defeating the purpose of benchmark because then you can just solve it with

Starting point is 01:29:27 that and you need to adapt just by fetching a memorized solution. So hopefully, Arc will resist to that, but you know, nothing, no benchmark is necessarily perfect. So maybe there's a way to hack it. And I guess we're going to get an answer very soon. Well, I think some amount of fine tuning is valid because these models don't natively think in terms of, especially the language models alone, which the open source models that they would have to use to be competitive here, compete here. They're like natively language,

Starting point is 01:29:53 so they'd like need to be able to think in the, in this kind of, um, yes, the archetype way. You want to input core knowledge, like arc like core knowledge into the model, but surely you don't need tens of millions of tasks to do this, like core analysis is extremely basic. If you look at some of these archetype questions, I actually do think they rely a little bit

Starting point is 01:30:14 on things I have seen throughout my life. And for the same, like for example, like something bounces off a wall and comes back and you see that pattern. It's like I played arcade games and I've seen like Pong or something. And I think for example, when you see the Flynn effect and people's intelligence as measured on variance progressive matrices increasing

Starting point is 01:30:34 on these kinds of questions, it's probably a similar story where since now since childhood, we actually see these sorts of patterns in TV and whatever, spatial patterns. And so I don't think this is sort of core knowledge. I think actually this is also part of the quote unquote fine-tune that humans have as they grow up of seeing different kinds of spatial patterns and trying to pattern match to them. I would definitely file that under core knowledge. Like, core knowledge includes basic physics, for instance, balancing or trajectories.

Starting point is 01:31:01 That would be included. But yeah, I think you're entirely right. The reason why, as a human, you're able to quickly figure out the solution is because you have this set of building blocks, this set of patterns in your mind that you can recombine. Is core knowledge required to attain intelligence, any algorithm you have? Does the core knowledge have to be, in some sense hard-coded or can even the core knowledge be learned through intelligence? Core knowledge can be learned and I think in the case of humans some amount of core knowledge is something that you're born with like we're actually born with a small amount of knowledge about the world we're going to live in. We're not blank slates but most core knowledge is acquired through experience but the thing with core knowledge is that it's not going to be acquired like for instance in school it's actually acquired very very early in the first

Starting point is 01:31:46 like three to four years of your life. And by age four, you have all the core knowledge you're going to need as an adult. Okay. Interesting. So, I mean, on the price itself, I'm super excited to see both the open source versions of maybe with a Lama 70B or something, what people can score in the competition itself, then if to sort of test specifically the scaling hypothesis, I'm very curious to see if you can prompt on the public version of ARC, which I guess you won't be compete, you won't be able to submit to this competition itself. But I'd be very curious to see how if people can sort of crack that and get our arc working there and if that would update your reviews on AGI. It's really be motivating. We're going to keep running the contest until

Starting point is 01:32:25 somebody puts a reproducible open source version into public domain. So even if somebody privately beats Eval or beats the arc Evel, we're going to still keep the price money until someone can reproduce it and put the public reproducible version out there. Yeah, exactly. Like the goal is to accelerate progress towards EGI. And a key part of that is that any sort of meaningful beats of progress needs to be shared, needs to be public. So everyone can know about it and can try to iterate on it. If there's no sharing, there's no progress. What I'm especially curious about is sort of disaggregating the bets of like, can we make an open version of this versus, is this a thing that's just possible with scaling? And we can, I guess, test both of them

Starting point is 01:33:03 based on the public and the private version. We're making contact with reality as well with this, right? We're going to learn a lot, I think, about what the actual limits of the compute were. If someone showed up and said, hey, here's a closed source model that, like, I'm getting 50 plus percent on. I think that would probably update us on like, okay, perhaps we should increase the amount of compute that we give on the private test set in order to balance. Some of the decisions initially are somewhat arbitrary in order to learn about, okay, what do people want? What does progress look like?

Starting point is 01:33:26 And I think both of us are sort of committed to evolving it over time in order to be the best, or the closest to perfect as we can get it. Awesome. And where can people go to learn more about the prize and maybe give their hand at it? Parkprise.org. Which goes live today. So, live now. One million dollars is on the line, people.

Starting point is 01:33:41 Good luck. Thank you guys for coming on the podcast. It's super fun to go through all the crux on intelligence and get a different perspective and also to announce surprise here. So this is awesome. Thank you for helping break the news. Thank you for finding us. Thank you for finding us.

Dwarkesh Podcast - Francois Chollet — Why the biggest AI models can't solve simple puzzles

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.