Latent Space: The AI Engineer Podcast - ⚡️How Claude 3.7 Plays Pokémon

Starting point is 00:00:01 Hey everyone, welcome back to another Latenspace Lightning Pod. This is Alessio, partner, and CTOA Decibel. There's no SWICS today. We got a special co-host, Vibu, which, if you're part of the Latent Space community on Discord, you're definitely seen. Welcome, Vibu as a co-host, first time. What's up, guys? In that we had David Hershey from Anthropic today, who's the person behind Cloud Place Pokemon. It's funny, I saw we first DM'd about playing Magic the Gathering together in NSF.

Starting point is 00:00:31 And then people are like... On all of the different nerd angles, you can get me. And then people were like, David is the person doing this. And I was like, okay, I'll DM him. And then, yeah, it was cool. We already had a touch point. So welcome to the show. This is our second Anthropic episode.

Starting point is 00:00:48 We had Eric Schlons from the sweet agent before. So welcome. Thank you. Glad to be here. Excited to talk Pokemon. Yeah. So let's give a little background on this. So Sonnet, Trubon, 7 came out a couple weeks ago.

Starting point is 00:01:01 I don't know. Time goes by this week. This week, I don't know, man. It feels like two weeks ago. And then you had this Cloud Place Pokemon thing that kind of went viral where if people remember there used to be this thing called Twitch Place Pokemon, where people could go on Twitch and kind of type in the chat and then busy like figure out what the next section that the emulator would take us.

Starting point is 00:01:20 What you've done instead is giving it and it's a cloud and basically have Cloud figure out how to walk through it. I'm looking at it right now. So far, it's been stuck in Mount Moon for 52 hours. poor guy, probably met 15,000 Zubets. So yeah, let's talk about what gave you the idea for it, kind of the origin story that we can go through the implementation. Totally.

Starting point is 00:01:42 Yes, I actually started working on it in like June of last year for the first time. And for me, so I work with customers at Anthropic. And I just like really wanted to have some way for myself to be able to like experiment with agents like in a real way. Some framework, some harness where I could actually just like go to town and try some different things and see what actually worked to get caught to do like pretty long running tasks in general. And so I like had that in one hand. And then I was like, okay, what is the thing that will make me the most addicted to making

Starting point is 00:02:10 this work? Like how will I grind the hardest actually trying this? And Pokemon was like a pretty clear answer. Someone else that Anthropica actually like tried once to hook it up. So I had a little bit of like the shell of what I needed to actually put together and to like kick off what became an obsession a little bit in the coming months. So yeah, like I played with it in June in the switch. trying things out. This was like Sona 3.5 came out in June last year, which is when I

Starting point is 00:02:35 started kicked it around. It's very good. But you can see like kind of signs of life, but like not much really happened. And then ever since then as we released new models, it's sort of been like the way that I get to know one of our new models a little bit, right? So we released the new version of Sona 3.5 in October and like use this to like really kind of see like what's it better at. And it got better. Like you could see it start to like, it could get out of the house somewhat reliably, which was not always true and it got a starter and it like even named it sometimes. Like it was like doing stuff. Not great, but like it could move.

Starting point is 00:03:05 Along the way to like, I'm just like we have a Quad Place Pokemon Slack channel. Like I'm sort of just like giving people updates. So over time as I'm like posting JIFs and up on these updates, like I'm this is like slightly growing a popularity of a cult following internally of people who are someone interested. But then like a couple of weeks ago, I was bashing an early version of Sonnet 3.7 and it just like, you can just tell it had like, it was a little different. It's clearly not still good. As you said at the top, like, it's, it's in Mount Moon for its 50-something hour. This is a little bit

Starting point is 00:03:37 worse than average from what I've seen so far by now, but like, this is like, you know, about on brand. It doesn't really have a great sense of direction. It's pretty bad at seeing the screen, stuff like that. But like, it plays the game, you know? Like, it gets Pokemon. It catches Pokemon. Like, it caught its first Pokemon. It got out of the radio the first time. Like, a whole bunch of stuff happened for the first time. We're like, could squint and see a thing play in the game. And yeah, like, posting updates, obviously internally. It was very fun. Like, people were just, like, kind of going wild at the fact that this was actually happening finally. And it was, like, entertaining enough. They were like, like, you could kind of see it. And the other side is, like,

Starting point is 00:04:12 we kind of just, like, got finally a sense that this was, like, an actually useful way to measure what was going on with this model. You know what I mean? Like, there's one thing that it's, like, fun and fun follow along, but like internally, like, I think we got more of a sense that, like, you could actually use this as a bit of a measuring stick for what's going on in the model. I've spent, you know, in a how many hours I spent staring at quad play Pokemon. I've switched. I have to have seen and read, like, millions of words that Claude has generated in the course of playing Pokemon over the last eight months.

Starting point is 00:04:40 So, like, you can kind of get a feel for like what's actually going better or what's getting better at and that kind of thing. And with this particular release, like, I think the fact that it got this much better at this kind of reflects a lot of things that we wanted to be true about the model to begin with and and those sort of lined up were like, okay, maybe this is like an interesting way to actually tell people about what's going on here for a crowd that maybe doesn't like quite know as much about software engineering and all the other ways. We've told people about agents in the past. Yeah. Were there any other games that you consider to me seems like Pokemon is good

Starting point is 00:05:10 because it's like, you know, isometric, you know, it's kind of like flat, so you get any scored and it's, it doesn't have too many hidden facts about objects, you know, kind of like everything it's described. Did you consider anything else or was Pokemon just kind of like by far in away the first choice? I didn't, but it's mainly because like Pokemon was the first game I ever got as a kid, right? This is like purely coming out of my own nostalgia. But also like the choice played Pokemon, like I was also something that I cared a lot about a decade ago or whatever that was. Um, at least it's not a decade ago. I think it's actually a year ago. I'm sorry. Yeah, painfully. Um, and I in.

Starting point is 00:05:47 11 years ago. Yeah. February 2014. Yeah. That's nuts. Pokemon Red is 20 years ago. Oh my God. 20, 25 at least.

Starting point is 00:06:02 So yeah, from you, it was that, like, since then, there have been a lot of people in the problem. We were like,

Starting point is 00:06:05 oh, we can do this, we can do this, we can do this. I think there's like a lot of fun things you can do. Pokemon's actually really nice because, like,

Starting point is 00:06:12 if you don't do anything for five seconds, like, there's typically not a consequence by the nature of, like, doing inference on a model every, like, a snapshot of time.

Starting point is 00:06:19 It's actually a pretty good, game to be able to do this with. But yeah, it was mostly just like my love for Pokemon coming through here. You put together a very nice architecture diagram. Do you want to screenshot that so people on YouTube can follow along and that we'll put in the show notes if you are just listening. I know that Vibu had a bunch of questions on that too. Yeah, let's do it. Very, very straightforward questions. Basically, can we just double click into all of it? Yeah, yeah, yeah. It's easy. I found it off Twitch and like no one was talking about it, so I started sharing it around, and I lost the original source, but basically

Starting point is 00:06:54 everything in here is like pure gold. The memory is a little interesting, but yeah, if you want to just go through high level. Yeah, you got it. Yeah, I want to, like, preface that I do not claim this is, like, the world's most incredible agent harness. In fact, like, I explicitly have, like, tried not to like hyper-engineer this to be like the best chance that exists to beat Pokemon. I think it'd be like trivial to build a better computer program to beat Pokemon with quad in the loop. This is like meant to be some combination of like understand what quad's good at and benchmark

Starting point is 00:07:28 like and understand quad alongside a simple agent harness. So what that boils down to is this is like a pretty straightforward tool using agent from my perspective is how I would frame it. So at the end of the day, like the core loop is, is just like having a conversation that rolls out. And it's essentially like you build the prompt, including like everything we've had up till now. You call the model.

Starting point is 00:07:53 It sends back some tool use typically. You resolve those tools. And then talk about summarization, but like basically some, a few different mechanisms to maintain the information you need to do something long running inside the context window. So like what this boils down to is like when you think about what, actual prompt looks like.

Starting point is 00:08:15 It rolls out kind of like this. You've got tool definitions, which describe three tools that I'll get to in a second. A short system prompt, it's like pretty boring. It basically tells the model how to use the tools. And like there are about six facts about Pokemon that I give it and like a few corrective things that I've seen it do like really horribly wrong. And I'm like, hey, you might want to consider doing this a little bit better. But it's like really not a lot of system prompting going on.

Starting point is 00:08:40 We have that knowledge base which referred to you. I'll talk about. is the main way it stores like long-term concepts and memories as it's operating over time. And then the bulk of things is this conversation history, which is, it's like a chain of tool use. There's no like user interjections at all for the most part. So it's like go and then the model uses the tool and then it gets a result back and then it uses another tool and it gets result back. So pretty straightforward. Feel free to like cut me off to if you've got questions along the way, but otherwise I'm going to keep rocking.

Starting point is 00:09:11 Yeah, yeah, go ahead. Cool. Okay, so most of the money of this is just, like, in the tools themselves. When you think about what's going on, it's really like, it can press buttons and it can, like, mess with its knowledge base, and that's about it. I'll talk about navigator separately because it's, like, a patch for how it actually can deal with some of its vision deficiencies. Using the emulator just basically, like, execute a sequence of button presses. It'll say, like, press A, B, left, right, whatever. It gets back a screenshot and screenshot overlaid with coordinates of the game.

Starting point is 00:09:51 These coordinates are used for this navigator tool that I'll try in the second, but it's just basically like help quad get a slightly better of spatial sense of what's going on on a Game Boy screen. I've been through it a lot. Sorry, does it come with the emulator, or are you adding those in? I add that in. Okay. I have somewhat extensively reverse-engineered Pokemon read by this point to, like, extract roughly every bit of possible information from it.

Starting point is 00:10:14 I don't use most of it, but like I have essentially everything you could know about the current state of the game. I have exposed programmatically to be able to tinker with it at this point. I was just reading this diagram. Like, yep, you just get what spaces are walkable based on what's stored in RAM. And I'm like, oh, you definitely reverse engineered this little. Yeah. Good news is we also released Claude this week, if you saw that. And that has been, this would all not be possible without the help of having Quad also go figure out how to do all of this for me.

Starting point is 00:10:43 because I could have done it, but there's a lot of, like, tedious, here are addresses in memory, map that to a Python program that I had no interest in doing. So, thank goodness for Quad Code. So, yeah, it gets these two screenshots. It gets like a small blurb of state, which I read straight from the game. There's a lot of this here. Actually, like, funny enough, the thing that matters is location. Quad will, like, pretty aggressively hallucinate that it succeeded in transitioning between zones.

Starting point is 00:11:13 if you don't like tell it it did not. This just comes down to like literal vision issues. And so like most of the patching of extra help I've given it been like attempts to make it so that it could still play despite not being very good at seeing Game Boy screens in particular. And then it gets like a handful of like reminders. This is this reminders does a decent amount of work. But it's like things like, you know,

Starting point is 00:11:36 remember to use your knowledge base occasionally. And we tell if it gets like stuck, for example. So if you detect that it hasn't moved in 30 spots or 30 time steps, I once saw it see like a red box on the screen that was like the doormat and think it was a text box and spend 12 hours pressing A overnight to try to clear the text box, which you see that happen once and you add in some helpful reminders to not do that. How much knowledge does the model have about the game itself, you know? So for example, types, right?

Starting point is 00:12:11 Yeah. know about types, weaknesses, and things like that, or how much are you trying to put into it? Yeah, if you go to quad.aI, like, it will tell you about, like, some stuff. I have not yet decided if the knowledge that it has about Pokemon is helpful or harmful towards it playing the game. Like, half of the time when it's like, oh, I know this about Pokemon, it then, like, uses that to hallucinate something.

Starting point is 00:12:35 So, for example, at the beginning of the run on Twitch, you saw it, like, go out of the lab and see this NPC in the bottom of Palatown and be like, it's Professor Oak. I found him. And it's like very much not Professor Oak, but like the fact that it has like indexed on this concept is like a little stuff like that that it's like unclear to me where it is. But it clearly has some information about it.

Starting point is 00:12:56 There's like a million game guides about Pokemon sitting on the internet. It's unsurprising that like there's a decent amount of information there. I don't really give it a lot of extra information. It picks things up. I watched on the stream the other day. like it tried to use Thundershock on a geo dude and it failed and it's like hmm i forgot about that that does not work and so like clearly there's like it knows some stuff it's not perfect it it picks some stuff up as it goes through the run ideally for me like i think it's just

Starting point is 00:13:25 interesting to see like what it actually learns as it's playing so the more it does that is the more i'm like actually interested in it yeah the one of our the score members not jungi had a good question about the sense of self yeah like sometimes it gets confused who is the actually playable character in the scene? Like, how do you steer that? Yeah, I think like sometimes it gets confused. It can be applied to many things in quad playing Pokemon, in particular when it's trying to like look at the screen and understand what's going on. So I have like attempted to prompt it all sorts of ways, like you are at this exact coordinate and you're in the middle of the screen and you're wearing a red hat and things like that.

Starting point is 00:14:05 And like that's all neat, but Quad doesn't particularly understand like the middle of a Game Boy screen and a whole bunch of concepts like that, which means, like, you can prompt all around everywhere, but like this kind of like spatial awareness and where something is with respect to something else is something that Quad's still just like not great at in its current incarnation. So one of the side of physicists is sometimes this track of who it is on the screen and thinks there's something else there. I will keep trekking through this. So I hinted at this like other tool that I give it called Navigator.

Starting point is 00:14:35 And this is just like the only other patch that I have for the, the vision issue. So Navigator basically what it does is like Quad can say it wants to go to one of these coordinates that we provide in the screenshot. And then we like automatically press the buttons to get there. It has to be something on the screen. Like I'm not trying to let Claude just like navigate a whole map by asking to politely. But one thing you'll notice if you run it without this tool is if like Quad wants to get from one side of a wall to another side of the wall, it like happily just tries to walk through the wall repeatedly because it doesn't quite have the concept of like what's between it. And I spent a lot of time like prompting around this and it just like isn't, it's just not, it's one of those things not very good at.

Starting point is 00:15:19 So in order to make it somewhat fun to learn from quad playing Pokemon at all, we use this navigator tool, which like helps it actually get around a little bit better. So since we covered a bit about the different tools, the prompting and the strategies, I'm curious how many tokens all this is using. Like there's a part to conversation history and truncating parts of them. once it's using state. Yeah, like, yeah, at a high level, how many tokens is this using? And then can we kind of go into where those are coming from what's being truncated? Yeah, you got it. When you think about the prompts here, essentially like every step, something that looks

Starting point is 00:15:56 like this gets sent. So if you just go through what each of these looks like, everything in the system prompt is probably like a thousand tokens, pretty small, like a handful of paragraphs. knowledge base, I let get up to like 8,000 tokens. So I put some like arbitrary cap on it. So it doesn't go to like, Claude will write, put a whole bunch of BS in there if you just let it keep writing stuff. So like the cap helps constrain it to like try to think about what's actually important a little bit. And then the conversation history, I haven't like kind of finicky, but it basically rolls out, um, 30 messages. That's actually like something you can tune. I've tuned it to be 30 messages about like the

Starting point is 00:16:38 best performance I've gotten. And so what that means is it basically like, use the tool, get a response back. Use a tool, get a response back. It's allowed to do that 30 times. And then at that point, it triggers the summary, which takes that conversation history, summarizes it, makes it the first user message. And then we'd kind of roll back out again. So the bulk of the tokens end up being in the conversation history once it's his longest.

Starting point is 00:17:03 In fact, like this, the bulk past that ends up being these screenshots, which are scaled up a decent amount to fit in. I do actually like, I allowed to see a number of the previous screenshots, but not all of them because you start like, it ends up being a ton of context. If you'd let it see like even 30 turns worth of screenshots. So I'd trim out a few. That's where the bulk of the actual tokens are. So in practice, this rollout ends up like at max ending up around 100,000 tokens, I think,

Starting point is 00:17:31 is where it is like the longest message you ever send to the API on one of these turns. And it will, it will fluctuate. in like summarization depending on state of knowledge base, probably between like 5,000 and 100,000 tokens. And is that like per action state of the game? And roughly, do you have like a high level ballpark estimate of how long this, how much and how long it costs to run this? Like let's say people want to compete.

Starting point is 00:17:55 Yeah, yeah, yeah, like how much will this be? I think you'd really want to think about running this as a side project in terms of the impact on your personal wallet and how much you care about Pokemon. It's not clear to me that without the blessing of Anthropic, I would have decided to take on, take on this project for my own wallet's sake, especially if you want to, like, experiment and, like, try 10 different things. I mean, it's costly. I don't know, like, I haven't spent a lot of time on the exact number.

Starting point is 00:18:26 It's not that hard to estimate if you, like, I just told you a bunch of numbers, you can kind of back it out. But, like, I think to, like, do a lot of experimentation, there's, like, at least thousands of dollars of tokens being consumed. So it's not a, it is not a cheap rollout. Yeah. But yeah. In the scheme, also, how some people use tokens, it's not terrible.

Starting point is 00:18:48 How many turns are you keeping in memory before you summarize? It's 30 right now. Yeah. I've tried more and less. I think like one thing you see a lot when you talk to people building agents is there's like some effective context length that actually like has the model be the smartest. And that seems to very slightly model by model, but for this model, for whatever purpose, like this 30 message, work better than 20 and better than 40.

Starting point is 00:19:16 So kind of plot in between those that it worked pretty reasonably. Yeah. Does that change based on location? Like, how many would you want to give it to get it out of Monmoon? So we got to bring it in plot home. We can't let him stay for another 57 hours. I actually am not sure. Like, I've tried posting, like, I can have a ton of screenshots, like 20 or 30 screenshots at a time, be able to see.

Starting point is 00:19:42 And it's, like, not obvious that, like, that temporal concept is actually super relevant, relevant to it. And again, this is just, like, trust me, as someone who has spent, like, a lot of hours obsessing over this, you can try to prompt quad a lot of different ways to understand how to navigate better. And anything short telling it exactly what to do does not improve. it's like actual navigation. It's just like not a skill it's great at. It's like good enough to to like random walk its way through some of the complex mazes.

Starting point is 00:20:14 And in like good easy areas, it's pretty good at popping around. But yeah, I think I can tell you if there was like a way to prompt this slightly different that would navigate better. I would believe there is something, but it is not like,

Starting point is 00:20:27 it is not an easy lift. Yeah. Yeah, I just asked Cloud AI right now. How do you get through a moon in Pokemon Red? it does have a plan but I don't I don't know I don't know if it's the right I don't know if it's the right plans

Starting point is 00:20:41 I have seen it come up with a lot of answers to that question and most of them are right this is part of the pain when I talk about I'm not sure if its knowledge is better or worse like you see it usually fixate like oh I know the exit is on the eastern wall and it just like spent 12 hours trying that and yeah it's like unclear to me

Starting point is 00:21:01 that we're actually not just like harming it by having it think it knows the answer Yeah. I think that's the interesting part, right? Like, you don't want it to just know the answer. Yeah. The model clearly knows a lot about the game. There's like EV-I-V-maxing.

Starting point is 00:21:14 Pokemon was very, very extreme. But, like, if that's what you wanted, we could just hook it up to a knowledge base, like hook it up to a guide if you know how to be Pokemon Red. But the interesting piece here is actually, like, can it figure out what to do without just memorizing the path through? That's exactly right. Like, that's part of why, you know,

Starting point is 00:21:32 I don't know, part of what I've realized, putting it out in the world as people will draw their line of where purity is anywhere on the spectrum. Like, is it is this cheating? Like, yeah, maybe. Who knows? Like, frankly, like, I don't particularly care. The main insight that I have is, like, when we put this out, like, you learn a lot about what the model is good and bad at by staring at it.

Starting point is 00:21:56 And that's kind of what I like about it. So evaluating the model is kind of separate than your emulator and how it can use an emulator, right? Like, we can always improve those things. I'm curious, as you switched from 3.5 to 3.7 and sort of reasoning models, were there any degradations there? Like, did it kind of get worse at anything? And was the prompting somewhat consistent? Like, a lot of what we've seen with different reasoning models is like, you kind of prompt them differently, right? You tell them what to do, let them figure it out. But, yeah, any, any insights there.

Starting point is 00:22:27 Yeah. Yeah, that's a good question. One thing that's nice about three or sevens on it is it's like this hybrid reasoning model. So like it kind of can do the old thing and the new thing. And it's actually pretty good at just like being an out of the box model and having this like thinking mode where it can spend time reasoning. So I didn't like really run into any like serious degradations. The one thing I'll say is like literally every model that has come out with Pokemon, like the main change that I have made to this agent is deleting prompt stuff. Like there's a whole bunch of like band-aid prompt stuff I've added in the past. It's like trying to like steer it away from doing a lot of the things that it got horribly stuck doing in the past. And as the models

Starting point is 00:23:13 get better, I found that just like making sure it's as simple as possible and giving them as much sort of like free reign to try to solve a problem as possible is useful. And like the way I think about this is I'm like less confident over time that I understand exactly how a model is intelligent, right? Like, it's capable of all of these, like, ridiculous things. It does PhD level stuff in some ways and, like, is unable to see a screen as well as a four-year-old in other ways. But, like, my confidence in, like, exactly what I need to tell it to do to be smart at playing Pokemon is actually, like, really small right now. If I tell it, this is the way you need to solve this problem. That might not actually be the best way for 3.7 sides to solve this problem.

Starting point is 00:23:53 It's, like, just different than I am in terms of how I thinks about these things. I found that just like kind of like pulling some of the unnecessary instructions where I tried to like use my intuitions about what would make the model better out of the prompt overtime is the thing that just like sort of consistently as models got smarter, gotten more juice out of this. I was watching the stream yesterday or the day before and it was a very tense battle. I think they were like down to like 2 HP each and like the opposing Pokemon like missed a scratch or something and it didn't die. And like you could tell it like if I was.

Starting point is 00:24:28 like, wow, it was like very dramatic and I was talking about the game. How, yeah, is there any thought being put into like trying to have it more? Like, do you prompt it to be more rational to let it know that it's not a real life, that it's a game? It's like, it feels like it gets very distressed when they're actually, the Pokemon's are actually going to die. It's funny. They, it knows it's like, you're playing Pokemon Red. Like, it does know that and it has a sense to that, but it clearly rose us some attachment. I'll tell you a fun story. We tell it to nickname its Pokemon now.

Starting point is 00:25:05 It will occasionally do without it, but it's like more fun if it nicknames its Pokemon. So that's like in the prompt is like, it's fun if you nickname Pokemon, you should consider it. And one thing we found when we started doing that is it got more protective of the Pokemon it nicknamed. Like it's pretty obvious. Like when it catches a Pokemon, now that it has a nickname, it will like go heal it right away if it's hurt. And that did not ever happen before.

Starting point is 00:25:27 Which is pretty, like, so there's some cute little things, cute quirks about Quad who really wants to protect its precious nicknamed Pokemon, which is great. So I will say it's kind of normal. Like, like when I was five playing Pokemon Red and, you know, I had 2 HP in a midst of Scratch, that meant everything. That was existential. I agree.

Starting point is 00:25:46 I agree completely. How about skilled transitioning? So one question that I had, so you're playing Pokemon Red, right? So you want to play silver or gold next? Have you thought about how models can kind of learn from these games and, like, store these learnings and then use them again in the future? I'm sure it's not part of the project today, but here's your thoughts. I've thought about it only a little bit, which is, like, I think there's some, like,

Starting point is 00:26:11 when you actually read one of the knowledge bases that it has gained, like, on some of the longer rollouts when they're good. Like, there's actually some, like, pretty decent tidbits about how it should act and try and do things and, like, some of the ways it's succeeded in. And actually, like, one of the things that's most unique about, 3.7 sonnet that I've seen is like it will have like meta commentary on what it's good at and bad at and its knowledge base like I misperceived of this thing and so like I need to be careful doing that again you occasionally see show up there which is um which pretty cool so like I could imagine there being some way to like translate that knowledge base from one game to another I think my knowledge base is frankly like kind of cluegy of an implementation right now like it's like more or less a python dictionary that's appended to the prompt And I think, like, you could, you could find better ways if, like, your goal is to transfer across games and things like that to manage a knowledge base that Quad can actually, like, use more or well in different scenarios. But there's definitely pieces there that, like, I think it would get, be off on a better foot on the next Pokemon game if it had that.

Starting point is 00:27:20 Or even if, like, I were to restart the stream, it would, like, have some, some tidbits that it would probably, like, speed up if it, like, had access to things that I learned in the past that, It's interesting. Yeah. Yeah, I always think of that in card games, you know, like you have the idea of like temple in a card game and it's like, you know, it's the same magic as it is and, you know, Star Wars, Flesh and Blood, all these different things. I feel like games is similar where like learnings you get from Pokemon you can bring over to similar kind of like open world games.

Starting point is 00:27:50 I think it's also like particularly interesting for some of the things that are like how quad learns how to play a game in general where it's like a pressing too many buttons. and once is a bad idea. Like, I lost trouble. What's going on? That kind of thing. Like, definitely is stuff that it has learned that is, like, interesting in a meta way. That it's, like, hard to give it that sense of self necessarily in training, I think, sometimes.

Starting point is 00:28:12 Like, it's hard for it to know, like, what it's getting bad at in some scenarios. But it's interesting to think about how it can learn across things. Well, like, some of this also is due to a simulator, right? So a lot of what's learning is how do I use a simulator? What am I good and bad at? But the model internally should know quite a bit about Pokemon, right? like if you've played Pokemon going from Pokemon red to Emerald to Diamond having played the first one doesn't help you that much in the second right you kind of get the general concept you get what types are good against other types and the model model knows a good bit of this right but it's still interesting to show this is more so like it's shows that knowledge basis kind of help with understanding how to use the emulator right like it struggled and then it figured it out so you know with Pokemon it's like this thing can now learn how to use it's Yeah, which is pretty cool.

Starting point is 00:29:01 That has been like part of what's been fun seeing all as my progress on this thing. I had a bit of a follow-up question to the last one with the last year. So if people want to blow thousands of dollars and want to, you know, improve this a little bit, is there anything else that you'd want to see done, whether that's like improve emulator, try different stuff? Is this just anything that like anyone watching this, you'd kind of hint them towards what you'd want to work on, what they'd want to work on? Yeah, no doubt.

Starting point is 00:29:28 If I had to guess, like, the, the biggest lift that exists around this is probably something around the memory, which I don't think is, like, hyper optimized right now. The nice thing about the memory is, like, it's always in the prompt. Like, it doesn't go away. Like, sometimes if you leave it up to quad to try to, like, read and load and save to memory basis, like, it will underutilize it or forget things. But I think there's probably something there. I will say, all of the many, many hours I've spent tweaking around the edges of this thing. Nothing quite does it like a new model though. Like fundamentally, I think the limitations right now are like some smarts things.

Starting point is 00:30:09 Like I've seen, and I mean this in the kindest way, but I've seen a lot of people in Twitch tell me about ways that they could fix the navigation capabilities with a better prompt. People would be welcome to try, but I would guess that would be like a somewhat fruitless avenue. I don't think, I think it's just not very good at understanding. At the first time, I'll give you a very quick anecdote, which I think is like my favorite for. like why this is particularly hard. I have this clip of Quad leaving Oaks Lab and being like, great, I left Oaks Lab. Now I need to go up to the north end to go to Route 1. And it just like hits up on the D-pad and goes straight back into the lab.

Starting point is 00:30:46 And it's like, shoot, I'm back in the lab. I need to leave and it hits down. It's like, great, I'm out of the lab. Now I can go up to Route 1 and it's straight up. It just like goes up and down 12 times. And it's like, you're not fixing that with a prompt. It just literally doesn't get it. It doesn't understand.

Starting point is 00:31:03 And so it's pretty hard to make like a little around the edge of changes that like make a huge, huge difference. Yeah. I mean, I've always been fascinated by the fact that Twitch Place Pokemon actually beat the game. Yeah. From a, you just look at it and you're like, this cannot possibly work because you have people trying to sabotage it too in the chat. Not everybody's trying to solve it. What? So I just like that up.

Starting point is 00:31:26 It took 16 days and seven hours for Twitch Place Pokemon to be read. How close do you think? we are to a model that can beat it in less than 16 days. And do you think it needs like some core, like model really big jumps or like, do you think it's like we're close? I think it, I think there is model stuff, at least from quad. Like I am confident there's model stuff that needs to happen for it to be like really capable. I can have like four spots in the game stuck in my head.

Starting point is 00:31:55 It's like, I think there's literally no hope it's going to get through that. So I think there's like a gap that's mostly around like a tabillian. to like see and navigate and remember visually like what's going on that I just don't think is like we figured out yet. So to me that's like a pretty big gap. I do expect like I think it's going to keep getting better. Like I have no reason to believe that this is not just like a fundamental like ability to scale, learn and understand problems thing that I think is getting better as we train models to be more capable as sort of these like long horizon tasks. Like I actually do think this is like a pretty reasonable proxy of that and I think it will continue to get better for a little

Starting point is 00:32:33 while. I don't know if there are like affordances around images and videos and stuff like that that we need to figure out to make it work. It's like unclear to me if that's true or not. But yeah, I think we have a little ways before we can beat the game in 16 days. I do not have a lot of faith that the current stream is going to be the ending in Victory Road in 13 days. What's been your favorite moment from like building this to think of the idea to just seeing it play? Any like major highlight? I think like the the hypeest I have been is, uh, when it beat Brock the first time, where I was just like, you know, I've been doing this for eight months.

Starting point is 00:33:08 And then like a few weeks ago, like I kick off a run, wake up the next morning. And it's like, oh my God, oh my God. And it was the other good thing about it is like I woke up at 8 a.m. And I checked my, I have it send me updates to Slack. This is like ridiculous things. But it's like literally like about to start the Brock battle. Like I opened my phone. It's like, oh, this is like happening right now. And it's like a pretty hype way to start a day. I think that was my highlight. I have a lot of like other cute things, like some of the cute nicknames that's done over time and things like that are endearing. But that was like the peak hype for me.

Starting point is 00:33:40 It was like, we beat a gym leader. Like we've got a badge. Like, quad's doing it, you know. A bit of a follow up. So I noticed that you mentioned it eventually started beating multiple gym leaders. Were these all the same run? Was it different ones? Was it?

Starting point is 00:33:54 Yeah. I have like, the run that you saw that's like on the graph we put out alongside I had, like, in our research blog, is like a single run that I have watched, like, get through at least Surge's gym, and then it got a little past that. And the reason that that's where we stopped reporting is because that's, like, the physical amount of time that occurred between when I started it and when we watched the model. That's like, I was a very hyper, hyper up-to-date graph on the best run we had. Awesome.

Starting point is 00:34:24 I know we're running out of time. My last question is, are we going to work on Magic on Cloud Place Magic? on Cloud Place Magic next? Or maybe we can do like the Magic Arena in Trudge House. Yeah, funny story. There was a project I did right before I joined Anthropic that was like training an open source model to like slightly be better at picking draft

Starting point is 00:34:44 or cards in a draft. Like I was training it on like the 17 lands data that exists to like learn how to how to pick cards out of a Pax a little bit better. And I did talk about that in my interview to get hired at Anthropics. So if I've put time into this, I'm ready, I am ready for that project, too, that I have that code sitting around as well somewhere. I really get into all my nerd. Her nerd ML slash gaming hobbies here.

Starting point is 00:35:12 Yeah, no, I'm ready. I don't know if you're planning on open sourcing any of the Pokemon stuff, but if you want to work in open source on the magic stuff, I'll be happy to collaborate. Awesome. We're talking about it. I don't know yet what the plan is. there's like a certain amount of like this is not my day job that I have to figure out how I want to deal with that we'll see yeah um awesome David any parting thoughts anything people have missed no I think like the one thing I do like to drive home when I've been talking about this is like

Starting point is 00:35:43 I really do think like this is just demonstrating like a thing that is going to make agents better with this model you know like this is a very fun way to see it but like I think the thing is that it has some ability to like course correct update and figure things out a little bit better than models have in the past. And even if there's like stuff it's dumb at, like, it tends to have ability to like power through it in a new way. And so I think what it's decided to me is just like, I think there will be some real world stuff that comes out of this model once people play with it. And I'm pretty excited to see like how people take the skills we put on display a little bit here or lack thereof in some cases and figure out how to turn them into actual agents that do stuff.

Starting point is 00:36:21 Have a quick last question on that actually. Is there any, guidance or any way that you like quantitatively measure the evals of this system like a lot of it is vibes a lot of it is how far it gets where it gets stuck but like are there are there any lessons or any specifics about how you measure how it actually does so i've done a lot of like little small tests of like put it in this scenario and see what it does but i like frankly the best test i have is just like run it 10 times on this configuration and like see how quickly it progresses through milestones of the game it means the best thing about games right like It's why the games are such a useful thing.

Starting point is 00:36:57 There's literal benchmarks of gym badges that are moments of progress in a game, which are ways to evaluate what happens. And so I think how quickly it's able to make progress is actually a pretty reasonable, like, eVow, if a slightly expensive one to calculate. It's an integration test, not a unit test. Awesome, David. Thank you for joining. Thank you, Vibu, for filling in on the host site too.

Starting point is 00:37:20 Yeah, my pleasure. Thank you for having, guys. I appreciate it. Awesome. Good to save.

Latent Space: The AI Engineer Podcast - ⚡️How Claude 3.7 Plays Pokémon

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.