Latent Space: The AI Engineer Podcast - ⚡️How Claude 3.7 Plays Pokémon
Episode Date: March 4, 2025Special lightning pod with David Hershey from Anthropic, the person behind Claude Plays Pokémon. Sonnet 3.7 is currently trying to complete Pokémon Red live on Twitch thanks to a special harness tha...t David built so that it can see the screen, navigate through it, remember facts about the game, and more. (Since recording, it has successfully escaped Mt Moon! You can follow along on Twitch: https://www.twitch.tv/claudeplayspokemon) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Hey everyone, welcome back to another Latenspace Lightning Pod.
This is Alessio, partner, and CTOA Decibel.
There's no SWICS today.
We got a special co-host, Vibu, which, if you're part of the Latent Space community on Discord, you're definitely seen.
Welcome, Vibu as a co-host, first time.
What's up, guys?
In that we had David Hershey from Anthropic today, who's the person behind Cloud Place Pokemon.
It's funny, I saw we first DM'd about playing Magic the Gathering together in NSF.
And then people are like...
On all of the different nerd angles, you can get me.
And then people were like, David is the person doing this.
And I was like, okay, I'll DM him.
And then, yeah, it was cool.
We already had a touch point.
So welcome to the show.
This is our second Anthropic episode.
We had Eric Schlons from the sweet agent before.
So welcome.
Thank you.
Glad to be here.
Excited to talk Pokemon.
Yeah.
So let's give a little background on this.
So Sonnet, Trubon, 7 came out a couple weeks ago.
I don't know.
Time goes by this week.
This week, I don't know, man.
It feels like two weeks ago.
And then you had this Cloud Place Pokemon thing that kind of went viral where if people
remember there used to be this thing called Twitch Place Pokemon, where people could
go on Twitch and kind of type in the chat and then busy like figure out what the next
section that the emulator would take us.
What you've done instead is giving it and it's a cloud and basically have Cloud figure out
how to walk through it.
I'm looking at it right now.
So far, it's been stuck in Mount Moon for 52 hours.
poor guy, probably met 15,000 Zubets.
So yeah, let's talk about what gave you the idea for it,
kind of the origin story that we can go through the implementation.
Totally.
Yes, I actually started working on it in like June of last year for the first time.
And for me, so I work with customers at Anthropic.
And I just like really wanted to have some way for myself to be able to like experiment
with agents like in a real way.
Some framework, some harness where I could actually just like go to town and try some different
things and see what actually worked to get caught to do like pretty long running tasks in general.
And so I like had that in one hand.
And then I was like, okay, what is the thing that will make me the most addicted to making
this work?
Like how will I grind the hardest actually trying this?
And Pokemon was like a pretty clear answer.
Someone else that Anthropica actually like tried once to hook it up.
So I had a little bit of like the shell of what I needed to actually put together and
to like kick off what became an obsession a little bit in the coming months.
So yeah, like I played with it in June in the switch.
trying things out. This was like Sona 3.5 came out in June last year, which is when I
started kicked it around. It's very good. But you can see like kind of signs of life, but like not
much really happened. And then ever since then as we released new models, it's sort of been like
the way that I get to know one of our new models a little bit, right? So we released the new version
of Sona 3.5 in October and like use this to like really kind of see like what's it better at.
And it got better. Like you could see it start to like, it could get out of the house somewhat reliably,
which was not always true and it got a starter and it like even named it sometimes.
Like it was like doing stuff.
Not great, but like it could move.
Along the way to like, I'm just like we have a Quad Place Pokemon Slack channel.
Like I'm sort of just like giving people updates.
So over time as I'm like posting JIFs and up on these updates,
like I'm this is like slightly growing a popularity of a cult following internally
of people who are someone interested.
But then like a couple of weeks ago, I was bashing an early version of Sonnet 3.7 and it
just like, you can just tell it had like, it was a little different. It's clearly not still good.
As you said at the top, like, it's, it's in Mount Moon for its 50-something hour. This is a little bit
worse than average from what I've seen so far by now, but like, this is like, you know,
about on brand. It doesn't really have a great sense of direction. It's pretty bad at seeing the
screen, stuff like that. But like, it plays the game, you know? Like, it gets Pokemon. It catches
Pokemon. Like, it caught its first Pokemon. It got out of the radio the first time. Like, a whole
bunch of stuff happened for the first time. We're like, could squint and see a thing play in the game.
And yeah, like, posting updates, obviously internally. It was very fun. Like, people were just,
like, kind of going wild at the fact that this was actually happening finally. And it was, like,
entertaining enough. They were like, like, you could kind of see it. And the other side is, like,
we kind of just, like, got finally a sense that this was, like, an actually useful way to measure what
was going on with this model. You know what I mean? Like, there's one thing that it's, like, fun and fun follow
along, but like internally, like, I think we got more of a sense that, like, you could actually
use this as a bit of a measuring stick for what's going on in the model.
I've spent, you know, in a how many hours I spent staring at quad play Pokemon.
I've switched.
I have to have seen and read, like, millions of words that Claude has generated in the
course of playing Pokemon over the last eight months.
So, like, you can kind of get a feel for like what's actually going better or what's
getting better at and that kind of thing.
And with this particular release, like, I think the fact that it got this much better at
this kind of reflects a lot of things that we wanted to be true about the model to begin
with and and those sort of lined up were like, okay, maybe this is like an interesting way to
actually tell people about what's going on here for a crowd that maybe doesn't like quite
know as much about software engineering and all the other ways. We've told people about agents
in the past. Yeah. Were there any other games that you consider to me seems like Pokemon is good
because it's like, you know, isometric, you know, it's kind of like flat, so you get any scored and
it's, it doesn't have too many hidden facts about objects, you know, kind of like everything it's
described. Did you consider anything else or was Pokemon just kind of like by far in
away the first choice? I didn't, but it's mainly because like Pokemon was the first game I
ever got as a kid, right? This is like purely coming out of my own nostalgia. But also like
the choice played Pokemon, like I was also something that I cared a lot about a decade ago or
whatever that was. Um, at least it's not a decade ago. I think it's actually a year ago. I'm sorry.
Yeah, painfully. Um, and I in.
11 years ago. Yeah.
February 2014.
Yeah.
That's nuts.
Pokemon Red is 20 years ago.
Oh my God.
20,
25 at least.
So yeah,
from you,
it was that,
like,
since then,
there have been a lot of people
in the problem.
We were like,
oh,
we can do this,
we can do this,
we can do this.
I think there's like a lot of fun things you can do.
Pokemon's actually really nice
because,
like,
if you don't do anything for five seconds,
like,
there's typically not a consequence
by the nature of,
like,
doing inference on a model every,
like,
a snapshot of time.
It's actually a pretty good,
game to be able to do this with. But yeah, it was mostly just like my love for
Pokemon coming through here. You put together a very nice architecture diagram. Do you want to
screenshot that so people on YouTube can follow along and that we'll put in the show notes
if you are just listening. I know that Vibu had a bunch of questions on that too. Yeah, let's do it.
Very, very straightforward questions. Basically, can we just double click into all of it?
Yeah, yeah, yeah. It's easy. I found it off Twitch and like no one was
talking about it, so I started sharing it around, and I lost the original source, but basically
everything in here is like pure gold. The memory is a little interesting, but yeah, if you want to
just go through high level. Yeah, you got it. Yeah, I want to, like, preface that I do not claim
this is, like, the world's most incredible agent harness. In fact, like, I explicitly have, like,
tried not to like hyper-engineer this to be like the best chance that exists to beat
Pokemon.
I think it'd be like trivial to build a better computer program to beat Pokemon with
quad in the loop.
This is like meant to be some combination of like understand what quad's good at and benchmark
like and understand quad alongside a simple agent harness.
So what that boils down to is this is like a pretty straightforward tool using agent
from my perspective is how I would frame it.
So at the end of the day, like the core loop is,
is just like having a conversation that rolls out.
And it's essentially like you build the prompt,
including like everything we've had up till now.
You call the model.
It sends back some tool use typically.
You resolve those tools.
And then talk about summarization,
but like basically some, a few different mechanisms
to maintain the information you need to do something long running
inside the context window.
So like what this boils down to is like when you think about what,
actual prompt looks like.
It rolls out kind of like this.
You've got tool definitions, which describe three tools that I'll get to in a second.
A short system prompt, it's like pretty boring.
It basically tells the model how to use the tools.
And like there are about six facts about Pokemon that I give it and like a few corrective
things that I've seen it do like really horribly wrong.
And I'm like, hey, you might want to consider doing this a little bit better.
But it's like really not a lot of system prompting going on.
We have that knowledge base which referred to you.
I'll talk about.
is the main way it stores like long-term concepts and memories as it's operating over time.
And then the bulk of things is this conversation history, which is, it's like a chain of tool use.
There's no like user interjections at all for the most part.
So it's like go and then the model uses the tool and then it gets a result back and then it uses another tool and it gets result back.
So pretty straightforward.
Feel free to like cut me off to if you've got questions along the way, but otherwise I'm going to keep rocking.
Yeah, yeah, go ahead.
Cool.
Okay, so most of the money of this is just, like, in the tools themselves.
When you think about what's going on, it's really like, it can press buttons and it can, like, mess with its knowledge base, and that's about it.
I'll talk about navigator separately because it's, like, a patch for how it actually can deal with some of its vision deficiencies.
Using the emulator just basically, like, execute a sequence of button presses.
It'll say, like, press A, B, left, right, whatever.
It gets back a screenshot and screenshot overlaid with coordinates of the game.
These coordinates are used for this navigator tool that I'll try in the second,
but it's just basically like help quad get a slightly better of spatial sense of what's going on on a Game Boy screen.
I've been through it a lot.
Sorry, does it come with the emulator, or are you adding those in?
I add that in.
Okay.
I have somewhat extensively reverse-engineered Pokemon read by this point
to, like, extract roughly every bit of possible information from it.
I don't use most of it, but like I have essentially everything you could know about the current state of the game.
I have exposed programmatically to be able to tinker with it at this point.
I was just reading this diagram.
Like, yep, you just get what spaces are walkable based on what's stored in RAM.
And I'm like, oh, you definitely reverse engineered this little.
Yeah.
Good news is we also released Claude this week, if you saw that.
And that has been, this would all not be possible without the help of having Quad also go figure out how to do all of this for me.
because I could have done it, but there's a lot of, like, tedious, here are addresses in memory,
map that to a Python program that I had no interest in doing.
So, thank goodness for Quad Code.
So, yeah, it gets these two screenshots.
It gets like a small blurb of state, which I read straight from the game.
There's a lot of this here.
Actually, like, funny enough, the thing that matters is location.
Quad will, like, pretty aggressively hallucinate that it succeeded in transitioning between zones.
if you don't like tell it it did not.
This just comes down to like literal vision issues.
And so like most of the patching of extra help I've given it
been like attempts to make it so that it could still play
despite not being very good at seeing Game Boy screens in particular.
And then it gets like a handful of like reminders.
This is this reminders does a decent amount of work.
But it's like things like, you know,
remember to use your knowledge base occasionally.
And we tell if it gets like stuck, for example.
So if you detect that it hasn't moved in 30 spots or 30 time steps,
I once saw it see like a red box on the screen that was like the doormat and think it was a text box
and spend 12 hours pressing A overnight to try to clear the text box,
which you see that happen once and you add in some helpful reminders to not do that.
How much knowledge does the model have about the game itself, you know?
So for example, types, right?
Yeah.
know about types, weaknesses, and things like that,
or how much are you trying to put into it?
Yeah, if you go to quad.aI, like, it will tell you about, like, some stuff.
I have not yet decided if the knowledge that it has about Pokemon is helpful or harmful
towards it playing the game.
Like, half of the time when it's like, oh, I know this about Pokemon,
it then, like, uses that to hallucinate something.
So, for example, at the beginning of the run on Twitch, you saw it, like, go out of the lab
and see this NPC in the bottom of Palatown and be like,
it's Professor Oak.
I found him.
And it's like very much not Professor Oak,
but like the fact that it has like indexed on this concept is like a little
stuff like that that it's like unclear to me where it is.
But it clearly has some information about it.
There's like a million game guides about Pokemon sitting on the internet.
It's unsurprising that like there's a decent amount of information there.
I don't really give it a lot of extra information.
It picks things up.
I watched on the stream the other day.
like it tried to use Thundershock on a geo dude and it failed and it's like hmm i forgot about that
that does not work and so like clearly there's like it knows some stuff it's not perfect it
it picks some stuff up as it goes through the run ideally for me like i think it's just
interesting to see like what it actually learns as it's playing so the more it does that is the
more i'm like actually interested in it yeah the one of our the score members not jungi had a good
question about the sense of self yeah like sometimes it gets confused who is the actually
playable character in the scene? Like, how do you steer that?
Yeah, I think like sometimes it gets confused. It can be applied to many things in quad playing
Pokemon, in particular when it's trying to like look at the screen and understand what's going on.
So I have like attempted to prompt it all sorts of ways, like you are at this exact
coordinate and you're in the middle of the screen and you're wearing a red hat and things like that.
And like that's all neat, but Quad doesn't particularly understand like the middle of a Game Boy screen
and a whole bunch of concepts like that, which means, like, you can prompt all around everywhere,
but like this kind of like spatial awareness and where something is with respect to something else
is something that Quad's still just like not great at in its current incarnation.
So one of the side of physicists is sometimes this track of who it is on the screen and thinks
there's something else there.
I will keep trekking through this.
So I hinted at this like other tool that I give it called Navigator.
And this is just like the only other patch that I have for the,
the vision issue. So Navigator basically what it does is like Quad can say it wants to go to
one of these coordinates that we provide in the screenshot. And then we like automatically press
the buttons to get there. It has to be something on the screen. Like I'm not trying to let Claude
just like navigate a whole map by asking to politely. But one thing you'll notice if you run it
without this tool is if like Quad wants to get from one side of a wall to another side of the wall,
it like happily just tries to walk through the wall repeatedly because it doesn't quite have the concept of like what's between it.
And I spent a lot of time like prompting around this and it just like isn't, it's just not, it's one of those things not very good at.
So in order to make it somewhat fun to learn from quad playing Pokemon at all, we use this navigator tool, which like helps it actually get around a little bit better.
So since we covered a bit about the different tools, the prompting and the strategies, I'm curious how many tokens all this is using.
Like there's a part to conversation history and truncating parts of them.
once it's using state.
Yeah, like, yeah, at a high level, how many tokens is this using?
And then can we kind of go into where those are coming from what's being truncated?
Yeah, you got it.
When you think about the prompts here, essentially like every step, something that looks
like this gets sent.
So if you just go through what each of these looks like, everything in the system prompt
is probably like a thousand tokens, pretty small, like a handful of paragraphs.
knowledge base, I let get up to like 8,000 tokens. So I put some like arbitrary cap on it. So it doesn't go to like,
Claude will write, put a whole bunch of BS in there if you just let it keep writing stuff. So like the
cap helps constrain it to like try to think about what's actually important a little bit.
And then the conversation history, I haven't like kind of finicky, but it basically rolls out,
um, 30 messages. That's actually like something you can tune. I've tuned it to be 30 messages about like the
best performance I've gotten.
And so what that means is it basically like, use the tool, get a response back.
Use a tool, get a response back.
It's allowed to do that 30 times.
And then at that point, it triggers the summary, which takes that conversation history,
summarizes it, makes it the first user message.
And then we'd kind of roll back out again.
So the bulk of the tokens end up being in the conversation history once it's his longest.
In fact, like this, the bulk past that ends up being these screenshots, which are scaled
up a decent amount to fit in.
I do actually like, I allowed to see a number of the previous screenshots, but not all of them
because you start like, it ends up being a ton of context.
If you'd let it see like even 30 turns worth of screenshots.
So I'd trim out a few.
That's where the bulk of the actual tokens are.
So in practice, this rollout ends up like at max ending up around 100,000 tokens, I think,
is where it is like the longest message you ever send to the API on one of these turns.
And it will, it will fluctuate.
in like summarization depending on state of knowledge base, probably between like
5,000 and 100,000 tokens.
And is that like per action state of the game?
And roughly, do you have like a high level ballpark estimate of how long this,
how much and how long it costs to run this?
Like let's say people want to compete.
Yeah, yeah, yeah, like how much will this be?
I think you'd really want to think about running this as a side project in terms of the
impact on your personal wallet and how much you care about Pokemon.
It's not clear to me that without the blessing of Anthropic, I would have decided to take on,
take on this project for my own wallet's sake, especially if you want to, like, experiment and, like,
try 10 different things.
I mean, it's costly.
I don't know, like, I haven't spent a lot of time on the exact number.
It's not that hard to estimate if you, like, I just told you a bunch of numbers, you can
kind of back it out.
But, like, I think to, like, do a lot of experimentation, there's, like,
at least thousands of dollars of tokens being consumed.
So it's not a, it is not a cheap rollout.
Yeah.
But yeah.
In the scheme, also, how some people use tokens, it's not terrible.
How many turns are you keeping in memory before you summarize?
It's 30 right now.
Yeah.
I've tried more and less.
I think like one thing you see a lot when you talk to people building agents is there's
like some effective context length that actually like has the model be the smartest.
And that seems to very slightly model by model, but for this model, for whatever purpose,
like this 30 message, work better than 20 and better than 40.
So kind of plot in between those that it worked pretty reasonably.
Yeah.
Does that change based on location?
Like, how many would you want to give it to get it out of Monmoon?
So we got to bring it in plot home.
We can't let him stay for another 57 hours.
I actually am not sure.
Like, I've tried posting, like, I can have a ton of screenshots, like 20 or 30 screenshots at a time, be able to see.
And it's, like, not obvious that, like, that temporal concept is actually super relevant, relevant to it.
And again, this is just, like, trust me, as someone who has spent, like, a lot of hours obsessing over this,
you can try to prompt quad a lot of different ways to understand how to navigate better.
And anything short telling it exactly what to do does not improve.
it's like actual navigation.
It's just like not a skill it's great at.
It's like good enough to to like random walk its way
through some of the complex mazes.
And in like good easy areas,
it's pretty good at popping around.
But yeah,
I think I can tell you if there was like a way
to prompt this slightly different
that would navigate better.
I would believe there is something,
but it is not like,
it is not an easy lift.
Yeah.
Yeah, I just asked Cloud AI right now.
How do you get through a moon in Pokemon Red?
it does have a plan
but I don't I don't know
I don't know if it's the right
I don't know if it's the right plans
I have seen it come up with a lot of answers
to that question and most of them are right
this is part of the pain
when I talk about I'm not sure if its knowledge is better or worse
like you see it usually fixate like oh I know the exit
is on the eastern wall and it just like spent
12 hours trying that
and yeah it's like unclear to me
that we're actually not just like harming it
by having it think it knows the answer
Yeah.
I think that's the interesting part, right?
Like, you don't want it to just know the answer.
Yeah.
The model clearly knows a lot about the game.
There's like EV-I-V-maxing.
Pokemon was very, very extreme.
But, like, if that's what you wanted,
we could just hook it up to a knowledge base,
like hook it up to a guide if you know how to be Pokemon Red.
But the interesting piece here is actually, like,
can it figure out what to do without just memorizing the path through?
That's exactly right.
Like, that's part of why, you know,
I don't know, part of what I've realized,
putting it out in the world as people will draw their line of where purity is anywhere on the spectrum.
Like, is it is this cheating?
Like, yeah, maybe.
Who knows?
Like, frankly, like, I don't particularly care.
The main insight that I have is, like, when we put this out, like, you learn a lot about
what the model is good and bad at by staring at it.
And that's kind of what I like about it.
So evaluating the model is kind of separate than your emulator and how it can use an
emulator, right?
Like, we can always improve those things.
I'm curious, as you switched from 3.5 to 3.7 and sort of reasoning models, were there any
degradations there? Like, did it kind of get worse at anything? And was the prompting somewhat
consistent? Like, a lot of what we've seen with different reasoning models is like, you kind of prompt
them differently, right? You tell them what to do, let them figure it out. But, yeah, any, any insights there.
Yeah. Yeah, that's a good question. One thing that's nice about three or sevens on it is it's like
this hybrid reasoning model. So like it kind of can do the old thing and the new thing.
And it's actually pretty good at just like being an out of the box model and having this like
thinking mode where it can spend time reasoning. So I didn't like really run into any like serious
degradations. The one thing I'll say is like literally every model that has come out with Pokemon,
like the main change that I have made to this agent is deleting prompt stuff. Like there's a whole
bunch of like band-aid prompt stuff I've added in the past. It's like trying to like steer it
away from doing a lot of the things that it got horribly stuck doing in the past. And as the models
get better, I found that just like making sure it's as simple as possible and giving them as much
sort of like free reign to try to solve a problem as possible is useful. And like the way I think
about this is I'm like less confident over time that I understand exactly how a model is
intelligent, right? Like, it's capable of all of these, like, ridiculous things. It does PhD level
stuff in some ways and, like, is unable to see a screen as well as a four-year-old in other ways.
But, like, my confidence in, like, exactly what I need to tell it to do to be smart at playing
Pokemon is actually, like, really small right now. If I tell it, this is the way you need to
solve this problem. That might not actually be the best way for 3.7 sides to solve this problem.
It's, like, just different than I am in terms of how I thinks about these things. I found that
just like kind of like pulling some of the unnecessary instructions where I tried to like use
my intuitions about what would make the model better out of the prompt overtime is the thing
that just like sort of consistently as models got smarter, gotten more juice out of this.
I was watching the stream yesterday or the day before and it was a very tense battle.
I think they were like down to like 2 HP each and like the opposing Pokemon like missed
a scratch or something and it didn't die.
And like you could tell it like if I was.
like, wow, it was like very dramatic and I was talking about the game. How, yeah, is there any
thought being put into like trying to have it more? Like, do you prompt it to be more rational to
let it know that it's not a real life, that it's a game? It's like, it feels like it gets very
distressed when they're actually, the Pokemon's are actually going to die. It's funny.
They, it knows it's like, you're playing Pokemon Red. Like, it does know that and it has a sense
to that, but it clearly rose us some attachment.
I'll tell you a fun story.
We tell it to nickname its Pokemon now.
It will occasionally do without it, but it's like more fun if it nicknames its Pokemon.
So that's like in the prompt is like, it's fun if you nickname Pokemon, you should consider it.
And one thing we found when we started doing that is it got more protective of the Pokemon
it nicknamed.
Like it's pretty obvious.
Like when it catches a Pokemon, now that it has a nickname, it will like go heal it right away
if it's hurt.
And that did not ever happen before.
Which is pretty, like, so there's some cute little things,
cute quirks about Quad who really wants to protect its precious nicknamed Pokemon,
which is great.
So I will say it's kind of normal.
Like, like when I was five playing Pokemon Red and, you know,
I had 2 HP in a midst of Scratch, that meant everything.
That was existential.
I agree.
I agree completely.
How about skilled transitioning?
So one question that I had, so you're playing Pokemon Red, right?
So you want to play silver or gold next?
Have you thought about how models can kind of learn from these games and, like, store these
learnings and then use them again in the future?
I'm sure it's not part of the project today, but here's your thoughts.
I've thought about it only a little bit, which is, like, I think there's some, like,
when you actually read one of the knowledge bases that it has gained, like, on some of the longer
rollouts when they're good.
Like, there's actually some, like, pretty decent tidbits about how it should act and try
and do things and, like, some of the ways it's succeeded in.
And actually, like, one of the things that's most unique about,
3.7 sonnet that I've seen is like it will have like meta commentary on what it's good at and bad at and its knowledge base like I misperceived of this thing and so like I need to be careful doing that again you occasionally see show up there which is um which pretty cool so like I could imagine there being some way to like translate that knowledge base from one game to another I think my knowledge base is frankly like kind of cluegy of an implementation right now like it's like more or less a python dictionary that's appended to the prompt
And I think, like, you could, you could find better ways if, like, your goal is to transfer across games and things like that to manage a knowledge base that Quad can actually, like, use more or well in different scenarios.
But there's definitely pieces there that, like, I think it would get, be off on a better foot on the next Pokemon game if it had that.
Or even if, like, I were to restart the stream, it would, like, have some, some tidbits that it would probably, like, speed up if it, like, had access to things that I learned in the past that,
It's interesting.
Yeah.
Yeah, I always think of that in card games, you know, like you have the idea of like temple
in a card game and it's like, you know, it's the same magic as it is and, you know,
Star Wars, Flesh and Blood, all these different things.
I feel like games is similar where like learnings you get from Pokemon you can bring over
to similar kind of like open world games.
I think it's also like particularly interesting for some of the things that are like how
quad learns how to play a game in general where it's like a pressing too many buttons.
and once is a bad idea.
Like, I lost trouble.
What's going on?
That kind of thing.
Like, definitely is stuff that it has learned that is, like, interesting in a meta way.
That it's, like, hard to give it that sense of self necessarily in training, I think, sometimes.
Like, it's hard for it to know, like, what it's getting bad at in some scenarios.
But it's interesting to think about how it can learn across things.
Well, like, some of this also is due to a simulator, right?
So a lot of what's learning is how do I use a simulator?
What am I good and bad at?
But the model internally should know quite a bit about Pokemon, right?
like if you've played Pokemon going from Pokemon red to Emerald to Diamond having played the first one doesn't help you that much in the second right you kind of get the general concept you get what types are good against other types and the model model knows a good bit of this right but it's still interesting to show this is more so like it's shows that knowledge basis kind of help with understanding how to use the emulator right like it struggled and then it figured it out so you know with Pokemon it's like this thing can now learn how to use it's
Yeah, which is pretty cool.
That has been like part of what's been fun seeing all as my progress on this thing.
I had a bit of a follow-up question to the last one with the last year.
So if people want to blow thousands of dollars and want to, you know, improve this a little bit,
is there anything else that you'd want to see done, whether that's like improve emulator,
try different stuff?
Is this just anything that like anyone watching this, you'd kind of hint them towards what you'd want to work on,
what they'd want to work on?
Yeah, no doubt.
If I had to guess, like, the, the biggest lift that exists around this is probably something around the memory, which I don't think is, like, hyper optimized right now.
The nice thing about the memory is, like, it's always in the prompt.
Like, it doesn't go away.
Like, sometimes if you leave it up to quad to try to, like, read and load and save to memory basis, like, it will underutilize it or forget things.
But I think there's probably something there.
I will say, all of the many, many hours I've spent tweaking around the edges of this thing.
Nothing quite does it like a new model though.
Like fundamentally, I think the limitations right now are like some smarts things.
Like I've seen, and I mean this in the kindest way, but I've seen a lot of people in Twitch tell me about ways that they could fix the navigation capabilities with a better prompt.
People would be welcome to try, but I would guess that would be like a somewhat fruitless avenue.
I don't think, I think it's just not very good at understanding.
At the first time, I'll give you a very quick anecdote, which I think is like my favorite for.
like why this is particularly hard.
I have this clip of Quad leaving Oaks Lab and being like, great, I left Oaks Lab.
Now I need to go up to the north end to go to Route 1.
And it just like hits up on the D-pad and goes straight back into the lab.
And it's like, shoot, I'm back in the lab.
I need to leave and it hits down.
It's like, great, I'm out of the lab.
Now I can go up to Route 1 and it's straight up.
It just like goes up and down 12 times.
And it's like, you're not fixing that with a prompt.
It just literally doesn't get it.
It doesn't understand.
And so it's pretty hard to make like a little around the edge of changes that like make a huge, huge difference.
Yeah.
I mean, I've always been fascinated by the fact that Twitch Place Pokemon actually beat the game.
Yeah.
From a, you just look at it and you're like, this cannot possibly work because you have people trying to sabotage it too in the chat.
Not everybody's trying to solve it.
What?
So I just like that up.
It took 16 days and seven hours for Twitch Place Pokemon to be read.
How close do you think?
we are to a model that can beat it in less than 16 days.
And do you think it needs like some core, like model really big jumps or like,
do you think it's like we're close?
I think it, I think there is model stuff, at least from quad.
Like I am confident there's model stuff that needs to happen for it to be like really capable.
I can have like four spots in the game stuck in my head.
It's like, I think there's literally no hope it's going to get through that.
So I think there's like a gap that's mostly around like a tabillian.
to like see and navigate and remember visually like what's going on that I just don't think is like
we figured out yet. So to me that's like a pretty big gap. I do expect like I think it's going to
keep getting better. Like I have no reason to believe that this is not just like a fundamental like
ability to scale, learn and understand problems thing that I think is getting better as we
train models to be more capable as sort of these like long horizon tasks. Like I actually do think
this is like a pretty reasonable proxy of that and I think it will continue to get better for a little
while. I don't know if there are like affordances around images and videos and stuff like that
that we need to figure out to make it work. It's like unclear to me if that's true or not.
But yeah, I think we have a little ways before we can beat the game in 16 days. I do not have a lot
of faith that the current stream is going to be the ending in Victory Road in 13 days.
What's been your favorite moment from like building this to think of the idea to just seeing
it play? Any like major highlight?
I think like the the hypeest I have been is, uh,
when it beat Brock the first time, where I was just like, you know, I've been doing this for eight months.
And then like a few weeks ago, like I kick off a run, wake up the next morning. And it's like,
oh my God, oh my God. And it was the other good thing about it is like I woke up at 8 a.m.
And I checked my, I have it send me updates to Slack. This is like ridiculous things. But it's like literally like about to start the Brock battle.
Like I opened my phone. It's like, oh, this is like happening right now. And it's like a pretty hype way to start a day.
I think that was my highlight.
I have a lot of like other cute things, like some of the cute nicknames that's done
over time and things like that are endearing.
But that was like the peak hype for me.
It was like, we beat a gym leader.
Like we've got a badge.
Like, quad's doing it, you know.
A bit of a follow up.
So I noticed that you mentioned it eventually started beating multiple gym leaders.
Were these all the same run?
Was it different ones?
Was it?
Yeah.
I have like, the run that you saw that's like on the graph we put out alongside
I had, like, in our research blog, is like a single run that I have watched, like, get through
at least Surge's gym, and then it got a little past that.
And the reason that that's where we stopped reporting is because that's, like, the
physical amount of time that occurred between when I started it and when we watched the model.
That's like, I was a very hyper, hyper up-to-date graph on the best run we had.
Awesome.
I know we're running out of time.
My last question is, are we going to work on Magic on Cloud Place Magic?
on Cloud Place Magic next?
Or maybe we can do like the Magic Arena in Trudge House.
Yeah, funny story.
There was a project I did right before I joined Anthropic
that was like training an open source model
to like slightly be better at picking draft
or cards in a draft.
Like I was training it on like the 17 lands data that exists
to like learn how to how to pick cards out of a Pax a little bit better.
And I did talk about that in my interview
to get hired at Anthropics.
So if I've put time into this, I'm ready, I am ready for that project, too, that I have that code sitting around as well somewhere.
I really get into all my nerd.
Her nerd ML slash gaming hobbies here.
Yeah, no, I'm ready.
I don't know if you're planning on open sourcing any of the Pokemon stuff, but if you want to work in open source on the magic stuff, I'll be happy to collaborate.
Awesome.
We're talking about it.
I don't know yet what the plan is.
there's like a certain amount of like this is not my day job that I have to figure out how I want to
deal with that we'll see yeah um awesome David any parting thoughts anything people have missed
no I think like the one thing I do like to drive home when I've been talking about this is like
I really do think like this is just demonstrating like a thing that is going to make agents better
with this model you know like this is a very fun way to see it but like I think the thing is that
it has some ability to like course correct update and figure things out a little bit better
than models have in the past. And even if there's like stuff it's dumb at, like, it tends to have
ability to like power through it in a new way. And so I think what it's decided to me is just like,
I think there will be some real world stuff that comes out of this model once people play with it.
And I'm pretty excited to see like how people take the skills we put on display a little bit here
or lack thereof in some cases and figure out how to turn them into actual agents that do stuff.
Have a quick last question on that actually. Is there any,
guidance or any way that you like quantitatively measure the evals of this system like a lot of it is
vibes a lot of it is how far it gets where it gets stuck but like are there are there any lessons or
any specifics about how you measure how it actually does so i've done a lot of like little small
tests of like put it in this scenario and see what it does but i like frankly the best test i have is
just like run it 10 times on this configuration and like see how quickly it progresses through
milestones of the game it means the best thing about games right like
It's why the games are such a useful thing.
There's literal benchmarks of gym badges that are moments of progress in a game,
which are ways to evaluate what happens.
And so I think how quickly it's able to make progress is actually a pretty
reasonable, like, eVow, if a slightly expensive one to calculate.
It's an integration test, not a unit test.
Awesome, David.
Thank you for joining.
Thank you, Vibu, for filling in on the host site too.
Yeah, my pleasure.
Thank you for having, guys.
I appreciate it.
Awesome.
Good to save.
