Programming Throwdown - 180: Reinforcement Learning
Episode Date: March 17, 2025Intro topic: GrillsNews/Links:You can’t call yourself a senior until you’ve worked on a legacy projecthttps://www.infobip.com/developers/blog/seniors-working-on-a-legacy-projectRecraft mi...ght be the most powerful AI image platform I’ve ever used — here’s whyhttps://www.tomsguide.com/ai/ai-image-video/recraft-might-be-the-most-powerful-ai-image-platform-ive-ever-used-heres-whyNASA has a list of 10 rules for software developmenthttps://www.cs.otago.ac.nz/cosc345/resources/nasa-10-rules.htmAMD Radeon RX 9070 XT performance estimates leaked: 42% to 66% faster than Radeon RX 7900 GREhttps://www.tomshardware.com/tech-industry/amd-estimates-of-radeon-rx-9070-xt-performance-leaked-42-percent-66-percent-faster-than-radeon-rx-7900-gre Book of the ShowPatrick: The Player of Games (Ian M Banks)https://a.co/d/1ZpUhGl (non-affiliate)Jason: Basic Roleplaying Universal Game Enginehttps://amzn.to/3ES4p5iPatreon Plug https://www.patreon.com/programmingthrowdown?ty=hTool of the ShowPatrick: Pokemon Sword and ShieldJason: Features and Labels ( https://fal.ai )Topic: Reinforcement LearningThree types of AISupervised LearningUnsupervised LearningReinforcement LearningOnline vs Offline RLOptimization algorithmsValue optimizationSARSAQ-LearningPolicy optimizationPolicy GradientsActor-CriticProximal Policy OptimizationValue vs Policy OptimizationValue optimization is more intuitive (Value loss)Policy optimization is less intuitive at first (policy gradients)Converting values to policies in deep learning is difficultImitation LearningSupervised policy learningOften used to bootstrap reinforcement learningPolicy EvaluationPropensity scoring versus model-basedChallenges to training RL modelTwo optimization loopsCollecting feedback vs updating the modelDifficult optimization targetPolicy evaluationRLHF & GRPO ★ Support this podcast on Patreon ★
Transcript
Discussion (0)
programming throwdown episode 180 reinforcement learning. Take it away, Patrick.
Welcome to another episode.
This is going to be a good one.
Excited to be here actually, because this is a topic I have been meaning to learn about and Jason has agreed to be put on his professor hat, robe.
I don't know what it is a professor wear.
I got hooded when I got the PhD.
I got hooded, which I thought would be an actual hood, but it's really just a sash.
Wait, what is getting hooded?
That's like what you get when you get, I don't know about this.
Okay.
So when you get a PhD, you get hooded, which means you go through the same ceremony as
the master's students, or I think the same ceremony is everybody, but you get a hood,
which is actually a sash and your PhD advisor actually puts the sash around
you over you as part of the ceremony.
Okay.
I feel like maybe I've heard that term, but I always just kind of had some weird
probably bad association with hood winked.
But, uh, anyways, okay.
Where are we off topic?
Anyways, it's fine because I actually, I actually, the first thing off topic. Anyway, so just-
That's fine because I actually, the first thing I think of is actually, because I grew
up in the inner cities.
I was like, okay, we're going back to my childhood here.
Oh, okay, interesting.
Okay, wow.
All right.
So today we've learned there's many associations of the word hood.
So okay, we didn't even talk about cars yet. So, you get a new upgraded carbon fiber hood for your car.
You get hooded.
Okay, we have to have people like, what is going on?
What are we listening to?
Oh God.
Well, that's how you know we're not the AI.
They would not be this off topic.
That's true.
They would definitely stick to the script.
The AI is not allowed to say this stuff.
You would definitely be pushing the script. They are not allowed to say this stuff.
You would definitely be pushing the down vote button, you know, oh wait, people probably are now.
That's why it's not live.
Okay.
So for, and I'll keep it brief because actually I want to get to the meat of the story today.
Oh, no pun intended, but talking about cooking outside, I had a grill on my like back patio which I would use to
cook cook food occasionally. It was uses these little pellets of wood so it's
called like a pellet grill so like pellets feed down and it burns it and
makes the heat and has an electronic controller and I would do some like you
know smoking on it and some some grilling anyways it, it broke and it's old so, you know, okay,
fine. So I went to go like, okay, I can get a new grill. I did not know. There are so many different
kinds of grills that are, you know, like popular now. And I feel like growing up, my parents
always just had a, yours are kind of one of two things. You were the charcoal Weber grill, you
know, with the like the bowl and the charcoals or you had the propane grill grill, you know, with the like the bowl and the charcoals, or you had the propane grill that, you know, like you had the tank and
you hooked up the hose and that was the two.
But now there are like, you know, all sorts of things where it's, you know,
infrared cookers where the propane goes into like a, some sort of like catalyst
something and like turns out like the patio heaters that, you know, are at
restaurants sometimes.
Yeah. It's like, I don't know.
And then there's these like egg shaped,
I guess they call them Komodo girls, like big green egg and Komodo.
And they're like big ceramic things.
And then you can get like various kinds of cabinet smoke. Like anyways, I,
I just, maybe I'm naive in my,
I just bought something straightforward and simple. And then I like, oh, I'm overwhelmed by the tyranny of choice.
That's just all I was going to say.
If you've never kind of looked up grill technology, it's actually kind of crazy.
There's got a lot of choices here.
So now I don't know what to pick.
That's wild.
I have a propane grill and then I have a smoker.
I have a separate electric smoker that takes the pellets and smokes meat.
Um, but yeah, I think the green egg can do both.
So it's like a two in one.
Um, and yeah, it's just wild.
And then there's like pizza ovens now people are doing.
So like when I went to go look at the grills at the hardware store, it was like,
there are also pizza ovens here and yeah, I, okay.
Yeah. My neighbor has a pizza oven and I think he's used it twice in four years.
Well, I mean, you know, how many times do you eat pizza?
I mean, and also it's like, if you eat pizza,
you're often in a hurry, so you're either ordering it to go
or you're doing the regular oven because you're in a hurry.
Okay, all right, so you're down on the pizza oven.
That's a short on the pizza oven stocks for Jason.
Yeah, I mean, I'm not a big fan of the pizza oven.
I think, I've only ever endorsed one stock
on my entire programming through down career
and that was Data Dog.
And I think it's like up just as much as everything else.
I made one like in hindsight, in five years of hindsight, like relatively neutral endorsement.
Everyone's bracing themselves for the meme coin announcement now.
Oh, we need a programming throw down coin.
No, we don't. I'm not. No, I'm not.
Right.
Pulling people.
Okay. I know. Okay. We got to keep going. We got to keep.
We are not. We are not. Rug pulling people.
We went the opposite way. We stopped doing ads.
So it's like the opposite of rug pulling people.
All right, time for news of the show.
So I've got the first one,
and this is an article entitled,
you can't call yourself a senior
until you've worked on a legacy project.
So talking about what is a senior engineer,
this is like an age old debate, whatever.
Anyways, this person was kind of pointing out how they hadn't really
worked on legacy code base. There's some specifics here of their thing and
if you want to go read it, go to article. And the point though is
pretty interesting that they kind of rightly wanted to avoid working on a
legacy code base. They ended up kind of doing it, they
were right, they didn't like it, but they actually learned a bunch of stuff that
they didn't. And I think a couple interesting takeaways for me from the
story and just you know thinking on the topic is about regardless of the label
of senior, like just growing as an engineer is no matter what work you're
doing, finding the takeaways that are applicable.
And lots of analogies, the one I've taken to using recently,
just for myself and for others that I talked to about this,
is like just really trying to compound the growth.
So not just thinking like, hey, how do I do this thing?
But like how do I think about additively like applying things I've learned before
in a way that like my growth sort of grows on top of itself
and you're sort of stacking it up.
And sometimes you need to widen the base, right,
of expanding into new things,
but other times you're trying to build up
and trying to apply these different experiences.
And so I think partly this plays into that.
And then there's an observation here specifically
about legacy code bases in your place
of work and understanding why maybe something isn't done that way anymore or why the stuff that you see
in like the the kind of current pieces of tech are how they got there right people will say oh
it's organic growth or you know whatever you of get there. But I think there is something different between saying, this is the current
recommended practice and I have done it the other way and I will tell you the
other way sucks, like we're doing it this way.
Those two things come from slightly different places and understanding why you
do something, not just, there is value in actually just not knowing, well, this is
kind of bad, but when you get into style guidelines and stuff, right?
I think just picking away and having everyone do it is, is useful because it,
it really does matter. Um,
but then there are things also that even if you don't always know exactly why,
uh, eventually kind of figuring them out and digging in. So the one I always use,
uh, in, uh, in C plus plus is the turn area operator. So you can write
this Boolean expression, put the question mark, and then the thing that is true first, and then a
colon, and then thing that is false if it's false second. And you can use this, and we have it banned
in our code base. And the reason why is it literally does nothing unique. You can't, okay,
there's like some very rare, you know, use case.
Someone could come up with it, you know, in a const expression or something.
But for most part, you're just simplifying writing an if-else statement.
But the cognitive load to read the ternary operator, make sure you understand what it does.
And that a new engineer showing up has the same practiced expertise at reading that.
Like, why? Like, Like why just because features are there
doesn't mean you have to use them.
And I try to explain this to people,
but I would argue even not knowing my explanation
and still doing it leads to good practices,
but knowing why and having tried using all of the
whiz bang features from the latest, you know,
C++ update constantly and refactoring code
just to rewrite it
into those features. Having done that once and burning your hands probably
teaches some lessons and so code bases can be really useful. Totally, totally
agree. I mean the equivalent in Python, which is even more confusing, they have a
ternary operator where you can say like x equals three if foo is true else five.
So it's a ternary operator,
but that you switch the first and the second position.
So it's like even harder to read.
And you know, oh yeah.
And so I remember like,
there was someone on my team who would do this a lot,
like all over the place.
And you know, I let it go.
I didn't really push back on it because to your point,
like it's not until you ban it, it's not banned.
And so you can't really say like, don't do this
because you have no like moral grounds
other than your intuition.
And then it was a disaster.
And so like people just kept getting burned
by these like really long in line,
you know, like conditions, right? And so now like I can ban it and I don't feel insecure about it or
feel like hesitant about it. I don't feel like, oh, you know, it's not really against the rules.
Now it's like, no, like I've done this. I've seen people like cause all sorts of issues and some issues went to prod.
And so we're not doing it. Now that the one of the tough things is, you know, if you're talking to
folks who don't have that experience, you have to like ban it in a way that is shows empathy and
doesn't like create any resentment or anything. And there is this balance that I think is useful,
but often gets brushed away that bringing new folks
and you actually want them to feel empowered to question.
So when they see that there's a ban on this and they say,
I love the ternary operator because it makes me look cool.
You know, they're going to say that part,
but you know, I love the ternary operator, you know,
why is it banned?
I don't think it should be banned.
And you actually want to take time to explain them and in some cases be willing
to hear them out and maybe, you know, adapt your practice or be flexible.
But in other times, like you said, I think the word confidence there is like, no,
we've done this, like I hear you, but you're just going to have to trust me that
like we've tried it the other way and the other way, like not banning it leads to problems.
Yeah.
Yeah, totally.
Um, yeah, I mean, I, just to wrap this up, a question I always ask in interviews, uh,
if I'm doing a technical design interview, I'll always start with the question of like,
tell me a time you refactored something and why did you refactor it? Like what led to the decision to do a big refactor?
And that usually like opens up all sorts of interesting things because you know,
people, you know, the, the worst answer is the one that totally neglects the
conflict that comes from scarcity, right?
It's like, you don't have enough time, but the code is garbage.
And it's like, so it's like that creates conflict
and then you have to resolve that conflict
one way or the other.
That's interesting.
If someone's like, yeah, you know, I rewrote it
and it was the right thing to do.
And everyone agreed with me from day zero
to the day I wrote it.
And everyone praised me at the end.
It's like, okay, well, you know,
that's a little unrealistic. I was gonna ask you, does anyone ever tell you,
because I didn't write the code. So therefore it could of course be better. I've got that.
That's how people really do, but I would be surprised if someone said it. Yeah. And people
are like, oh, it was, you know, another team and that team, you know, we inherited their code and it was garbage.
And so I rewrote it all.
And it's like, okay, not the, not the best answer.
Um, my new story is recraft might be the most powerful AI
image platform I've ever used.
Here's why.
And it's a Tom's guide article.
Um, honestly, like recraft is definitely the most powerful AI
image system I've ever used.
I found out about it yesterday.
I haven't ever heard about it.
Yeah, this is very, obscures is the right word,
but like I'm really into this stuff, like generative AI,
I'm following it closely,
and I hadn't heard about it until yesterday.
It can do some amazing things.
For one, it can produce vector art, like SVGs.
Now the SVGs are...
Like if you or I were to create a stop sign, for example, we'd create like a white octagon and then we'd create a red octagon inside the white octagon, and that's how we'd get a stop sign with a border around it. Right. But if you use this program, it's going to give you like a red octagon and then a
bunch of white polygons around it.
You see what I'm saying?
Like it doesn't have a concept of like each edge would be.
Yeah.
Yeah.
It's all one layer.
So that's not ideal, but it's a step in the right direction.
It's the only thing I've ever seen that really will give you an SVG.
Like even if they're doing it post-hoc or something.
It's extremely responsive to the prompt. So for example, one thing I've tried with a lot of these AI systems is I've said like a person holding nothing in their hand, because I'll say like a
person doing XYZ and I'll get a person like holding a phone in their hand.
And I'll like, and I'll type the same prompt
like with nothing in their hand.
And then they'll have two phones in their hand.
And then it's like, it's like, it has a hard time
especially with negatives.
So then I'll try like a person unarmed, you know?
So it's like, there's not like a negative there
and it'll still like not work. But with this system, it's like very good at following instructions, even negatives.
So it's phenomenal. The other part of it is it's got this cool workflow where you can take an image
and then you can say, okay, now put a phone in their hand and it'll make like a new image of the same person with a phone in their hand, as opposed to like, you know, getting a totally different person.
So it's really cool. Very, you know, easy to use, relatively cheap. So I would highly recommend folks check it out. It's pretty neat.
folks check it out it's pretty neat. What are they using kind of under the hood do you know like are they using their own so some of these craft up other stable diffusion whatever you know
is this like a layer on top of stuff it sounds really good but yeah this is a totally custom
thing. It's a pretty big model it's a 20 billion actually the 20 billion is the v2 there's a v3 model which i think is even bigger
um they they're not open source so we don't really know what they're doing they might have
a blog post about it i haven't seen one yet um i'm assuming it's the same type of technology
where you're doing like self self-attention and then you're doing like you know masking and trying
to uncover masks and whatever like the the i don't think they're really pushing the envelope on the the base model but
then they built a bunch of really impressive things on top of it that's awesome i think that's one of
the debates like where's the magic is it in the ui and the like you know higher level abstractions
or is it in the the base, the deep stuff, you know,
I don't know that we haven't answered.
I think there's lots of opinions, but.
I will say, I would say that I tried the Dolly.
When was Dolly popular?
Oh, that's, that's a four year and a half, two years ago.
No, that long?
I think so.
The first Dolly?
No, I don't know.
Like whenever it was making the big rounds,
I feel like it was maybe two, three years ago.
Oh, okay. Okay. Maybe four years ago. Anyways, I recently't know like whenever was making the big rounds. I feel like it was maybe two three years ago Okay, okay, maybe four years ago. I
Anyways, I recently did download on I have a MacBook Air and I downloaded one of the like on device
Stable diffusion to run flux and they just have like an app you can download
So you don't have to do it the command line with Olamma or something
You just like download that and it will download the model from I think a hugging facema or something, you just download that app and it will download the model from, I think a hugging face link or something.
And it downloads the flux and you'll generate it
and it takes, I don't know, it's like 15 to 20 seconds
or something to generate an image.
But it's crazy that first of this image are much better
than when Dolly was making the rounds initially.
You kind of wrote it off, it didn't really obey your prompts.
It would make cool pictures, but anyways,
so now the flux stuff and it runs like on my computer
and it's free. Like the models are open source,
the program's free. So it's running locally.
There's no subscription, you know, and it, you know,
obviously it has to be using less power because it doesn't
have an external GPU or anything.
Yeah, the Flux models are super, super impressive.
There's a package called M-Flux, which is flux optimized for MPS, optimized for the
Apple processor.
So you can use M-flux and it'll run in like half the time or a third of the time or something.
It's really impressive.
And this re-craft thing really takes it to another level.
So to your point, it'll only be a matter of time
before there's an open source version of recraft.
But at the moment, they have a monopoly.
Well, that's another debate the open source was closed.
But we'll just keep moving on.
Yeah, right.
Open weight, yeah.
Oh, oh, yeah.
So, okay, no, no, no, okay.
No, okay, respected.
My next one is NASA has a list of 10 rules
for software development.
And this person is taking the sort of like publicly disclosed
a list of software rules that I've bumped into before
being an embedded engineer previously in my career
that for writing C code,
but they've tried to extend it to C++ as well,
there's like a set of embedded guidelines for not doing.
And this individual is taking a sort of,
I'll say a kind of critique of some of them
and why maybe they don't make much sense or other stuff.
But if you've never seen them before,
I will say it is somewhat interesting to,
and it's a lot harder if
you use something like Python or Java to, I guess, derive value maybe from it, but
if you've ever programmed in C or C++ before or you use Rust, probably
applicable as well, or one of the other sort of systems programming languages, go
kind of looking through there and seeing like how would you approach if these were your rule sets. Leaving aside, I guess this
the blog post is kind of talking about how maybe the rule set could be improved or doesn't make
the most sense but I'll say if someone if you showed up on a job and this was the restrictions
because it was a contract that you were trying to honor like how would you accomplish this so things
like never using dynamic memory allocation.
And then, you know, there's kind of two approaches
you end up with.
One is, so I can see C++, the standard library
generally just uses lots of allocation under the hood.
So you end up with, you know, of course you can't use that.
So some people do a lot of like static sizing of things
up front and trying to, you know, have all of their things of an unknown size. Other people.
Isn't there like a concept in C++ called an arena? Where like you define like a thousand
spots? Yeah. Okay.
Yeah. So then other people use a memory pool. An arena is like a kind of memory pool where
they basically write their own sort of very thin sort of memory management, but it doesn't
necessarily suffer
from some of the same problems.
And so you can use containers that adapt to that.
But if you think in your head,
how would you make sure that your code
never did some of these things?
Never had a loop that couldn't exit, right?
So all loops need to have an upper bound.
And just code coverage, how would that work?
These kinds of things,
some of them are, again,
pretty restrictive.
Every function has to be smaller than can fit on a single piece
of paper with sizing given.
And it's like, well, that is probably good practice,
but maybe sometimes you want to change it
to be one thing or another.
But it is worth reading if you've never read an embedded rule set like this before.
They're not uncommon.
And you can occasionally, if you work in embedded space,
bump into places where this is the style.
There has been an explosion in processing power and real time
operating systems and just the complexity and abilities
of the processors.
But I still think there are some places
where there are probably many lines of code
being written having to follow these guidelines.
Yeah, this is fascinating.
I mean, this is a whole new universe for me.
But this is really interesting.
I mean, this is definitely a person who is kind of like, I think the general critique
here is on C. Like this person is like, I draw that you should use another language.
Yeah, like use Ada, not C. This is just a valid commentary, but yeah.
All right.
So my next news story is AMD Radeon RX 9070 XT performance estimates leaked.
Okay, so I want to go do a little rant here.
I hate kind of complaining about products.
Like I feel like that's maybe like not the best use of the show, but I bought a PC, like
a mini PC with a Radeon and I used it for a little while, it was okay.
The drivers were really buggy.
I had to go into safe mode and some stuff
to get it working.
But then I got it working.
You know how you can plug USB to DisplayPort
or USB to HDMI?
You have these cables.
I don't actually know how they work under the hood,
but there's some magic that allows you to go
from a USB-C port to right into your display, right?
I think it's like something called like display port
pass through or something.
Anyways, I plug one of these cables in,
pop, the GPU blows up.
Like hardware blow up dead.
Yeah.
Where did the it come from?
It's a cable that I use in my MacBook pro. Like I've used my,
I've used this cable for years. Okay. So it's the cables is fine. At least for the Mac book plugged in as many PC and it just popped.
And I guess like, I mean, I'm kind of dogpiling here.
I almost feel bad, but like, you know,
George Hots
has this post on Twitter about this.
Like he's trying to build these tiny boxes
that use like that run, that run PyTorch
and they use CUDA or, or, or Rockham,
which is AMD's equivalent, but like the AMD, like, you know,
anything above the hardware just sucks.
And it's kind of disappointing because everyone wants there to be an alternative.
If nothing else, not only for price, but maybe they could do something interesting.
Maybe they could make a card that has like a gigabyte of RAM and is not very fast, but just has a ton of RAM.
When there's multiple people in the pond, like there's multiple ideas, right? But I was just super disappointed. I mean, I'm one of like a
long line of people now who will just not buy these AMD cards. And I guess maybe just to turn
this into a question, like, like how does AMD kind of like recover from this? I kind of feel like
of like recover from this, I kind of feel like if I was them, I would hire like some software person,
like some person who's really high up in the stack
to like lead like a whole branch of the company
to just like go through and just ruggedize everything
from a software perspective.
So I feel like this is more in your area,
because when it popped, I was just shell shocked.
So what causes hardware to kind of fail like that
and not be tested, and what do you think you would do?
Like if you were CEO of AMD, Patrick,
how would you save this?
Oh my gosh.
You know, as much as it's been in the news with Nvidia GPUs and stuff
and the scarcity and the crypto stuff and now the AI stuff,
I don't know, I'm not super up on where the profit margins
come from, I know AMD of course makes processors as well,
and I actually have AMD processor and GPU
in the PC I built.
And I feel like I got, you know,
the Ryzen processor was like a good value for dollar
over the Intel one at the time.
Maybe, you know, they used to be two separate companies.
Now like they're combined.
I don't know how internally their company is structured,
where the profit margins are on GPU.
Like you said, you may say, oh, there's a large group of people who would really buy a large
amount of memory, not that much processing power. But it's possible that the die costs and the spin
up for that, especially when Nvidia could basically pay premium for any foundry costs
because they have, you know, far more supply than demand.
If you're sitting in second place, I don't know,
you may be forced to basically pay more for foundry costs
and things in order to be able to like,
get your chips made, right?
So it may make it difficult.
They may not have as much freedom
as they otherwise would want.
And I think the software drivers are hard as well
because there's so many people you need to appease, right?
You need to appease like people like you're saying,
like I just wanna plug it in and get a monitor working.
But then the video game people are like,
I want variable frame rates.
And then you need to appeal to the video game developers
who are needing to optimize the outputs on your card
and the wrappers that sit on top of it,
so OpenGL or DirectX or whatever.
There's all this different stuff swirling around,
and I actually just feel like this,
I don't even, that space just seems so complicated.
I get to plug into a variety of motherboards, there's a variety of power situations, a
variety of connectors, a variety of it, like the amount of compatibility you
need on a GPU almost on to say like on par, probably more with than almost any
other part of the PC is, is just actually bonkers.
That needs to be compatible with all manner of software, all manner of OSs, all manner of hardware,
internal to the computer, external to the computer.
That's a lot.
Maybe that stuff's all really robust and ruggedized,
but I imagine just a ton of time
to get nailed down exactly right.
Yeah, that's a really good call out.
I wonder, and also like a lot of these things are very thankless.
It's like the guy who makes sure that you could plug the USB-C to display port versus the...
Because I had this machine working, display port to display port.
So I know that the machine worked,
but then as soon as I plugged in this other way,
I heard a audio pop and it was done.
So like you have to have a person to like test
all these different ways.
And then like, unless it breaks,
that person is not adding any value.
Like they're just reducing risk
and you don't know what's risky, what's not risky.
So this is one of these things,
it's almost like a really high level
of kind of performance management,
like value of the company kind
of thing that you have to fix.
And then on your specific issue, I mean, it could be a faulty card that, you know, hopefully
they would want to replace.
And then the question is, if you did it again with the same cable with the new card, would
it happen again?
I don't know.
I mean, if I would, I wouldn't risk it, but like, there's two failure modes there.
There's a failure mode of an individual card,
and then there's a design flaw that like every card
where you did that, it would sort of break.
And I don't, from where I am, I just don't know enough
about which specific problem it is.
But certainly, like you said,
it makes people have very bad sentiment.
But even if you look at the, and again, not knocking,
but like the Nvidia consumer cards
that were having like the plug you plug into the GPU
to give it extra power directly from your power supply
had so much power going through it
that they were like melting,
they need to like have it keep pulling on it.
Like, again, like you said, this is sort of thankless,
right, the dude or dudette who is trying to like design
the interface for some wire,
copper cable to come in and deliver power.
And all of a sudden they never, anyone ever thought about them in their whole
entire lives.
And now they're like, you know, front page of social media because these
expensive graphics cards are melting.
You know, oh man, this is going to be a distraction with the distraction.
But like one thing that's really interesting is like
what jobs are of that sort?
Where like if you do a good job, nobody cares.
Or like if you do a good job, nobody notices.
And what are the jobs?
Because it feels to me like in every career,
there exists like maybe that's,
maybe I'm trying to stretch too far here. In many careers, there exists, like maybe that's, maybe I'm trying to stretch too far here.
In many careers, there exists like jobs where you get praised for doing a good job
and nothing happens if you do nothing.
So like jobs where you're trying to like bring in more business or raise the
profit of a company or something.
And then there's, there's jobs where like you're trying to keep the lights on. And so people are kind of, it's really hard for people not to ignore you until something goes wrong. It feels like there's these two kinds of jobs. And it feels to me like the former is almost always like, better than the latter in terms of like your satisfaction and a lot of other things.
I was having this conversation and I won't give the details
because it'll make it sound like a political statement
one way or another and not trying to make it.
But anything where you're, and in this case,
it was government officials,
but anything you're dealing with a probabilistic event.
So something may happen, may not happen.
Even if it happens, the certainty isn't well known, right?
You think like weather forecasting, you know, whatever, like.
Yeah.
And when you get it right, it's sort of like, no one, you
could just do nothing, right?
And it would probably just be fine.
And then that one time, you know, it's, it's out of standard deviations,
like very high. And then everybody's like, why did you do? And it's like, well, it's
true. Probably could have done better or done these other things, but you could have been
doing those other things every single time and it either still wouldn't have made a difference
or wouldn't have been relevant. And so to your point, anytime you're dealing with something where, you know, like think
preparing for a holiday rush on a server that's doing e-commerce, right?
You could spend tons of money like building up, you know, extra CDNs and having flexible
compute and then, you know, like people come, they shop, there's no outages, no one knows
if you did a great job or a bad job,
but it didn't crash.
So like there's that, but if it crashes,
certainly you're getting hauled in
and told how much millions of dollars
you lost the website, right?
Yeah, I mean, it's pretty tragic that, you know,
things are just set up that way,
but I don't know if there's not any clear solution
or anything.
All right.
Well, book of the show.
Book of the show.
What's your book?
All right.
My book, this is going to be a little bit different, but I've not been reading as
much as I should or I want to, but I did just start because I have never read any
books by this author before and I often see it recommended in science fiction.
So I decided I'm going to read a book and tried to find the recommendation and this is
the recommendation I got and that is a book by Ian M. Banks and I chose The
Player of Games. Have you read any Ian M. Banks books? I have not. Okay, yeah, neither
have I. So I am trying to embark on reading this one. Apparently some of the
books can be a little hard to read which is okay. But this one people are saying is a good
introduction. There's not, from what I've seen online without trying to read any
spoilers, apparently there's not like a strong order you need to read the book
since. It's not necessarily the first book. But it is an often recommended one.
So I'm starting here. That's not really a good recommendation
because I don't know whether to tell you it's good or bad
other than every other person I saw on the internet.
This seemed to be bubbling to the top
as a good starting place.
So I'm embarking on a journey here.
Cool, that sounds awesome.
Yeah, I might check that out.
I have a few books queued up that I have to get to
and then I'll check that out.
My book of the show
is Basic Role-Playing Universal Game Engine which is a reference book. It's
written by a couple of people who have been making tabletop games for
their whole adult lives and they've made a bunch of them. I think some of them
even from like the 70s and the 80s so it's bunch of them. I think some of them, even from like the 70s and the 80s,
so it's kind of wild.
I mean, some of the stories of the of the creators, but, but basically
they synthesized.
So there's this question in general, like think about any art form.
You always think about like, what is the essence of this art?
You know, like you might as an artist, like draw a bunch of things
and then think to yourself, OK, what is like the essence? Like if I had to reduce something to its like most basic form, what would that be?
And so these guys got together and thought, well, if we had to reduce these tabletop games, because there's a lot of like lore built into it, you know, like so many games have magic missile. Why? Because Dungeons and Dragons 1.0 had magic missile, but like what really is a magic
missile?
I guess it's like an arrow made out of magic, right?
Or something, but like, you know, pair these things down to their essence and, and
just explain like it literally like the first, the first page of the book is like the
point of an RPG is for your players to have fun.
So it's like, you know, it's like, let's start from like the first principle.
The first principle is that people should be having fun.
And it kind of builds up.
And so it's a combination of an instruction guide to making a game engine and a reference manual of like, here's a list of like hundreds
of skills from like all these skill based games we've ever seen.
I've kind of to be honest been skipping over a lot of the, you know, here's a list like
of like a million different types of armor because it's not what I'm interested in.
And what I'm interested in is like,
how do people create these game engines
and how do they keep them balanced
and how do they keep them interesting?
And so when I say game engine, just to be clear,
it is programming throwdown,
but this has nothing to do with programming.
It's literally like the math that like,
you could use this for like a card game
or a tabletop game or really anything.
It's like the math that keeps people kind of on the edge of their seat, right?
So it's like how do you have all these different options for your players but still keep them
kind of on the edge of their seat?
And at the same time, how do you do that in a minimalist fashion
where it's not like,
okay, I'm now gonna have to roll like 45 dice
to like build my character or whatever.
So these people tackle a lot of that.
And I'm about,
I wanna say maybe about halfway through,
as I said, skipping a lot of the pure reference stuff.
And it's really interesting.
I'm having a really good time reading it.
I've never played an in-person tabletop RPG or even,
I mean, I probably played a video game that somewhere
under the hood was running some sort of like role checks
and chance checks or something.
I learned the other day, a spoiler,
something I'm gonna talk about in a few minutes,
that Pokemon was actually doing that
when the Pokeball rattles, it's doing like a, you know,
probability check and it can fail at each of the things.
And then that's when the Pokemon came out, which I didn't know.
And maybe I'm completely wrong, but that's sort of like what the internet was telling
me, which is in line to your point with rolling a dice and getting certain values.
But never had the occasion to play one, but I'm endlessly fascinated by, like you
said, the sort of crafting of the stories and the storytelling and the fact that it's
a less game than, you know, a board game with rigid rules and more about, like
you said, having an adventure together, making it fun and entertaining and
collaborate, like collaboratively doing something which is somewhat still gaming but is also you are sort of being
flexible on the fly as well to you know keep it fun. Yeah exactly yeah exactly
like how do you let each person at the table have a unique character that
brings something unique while still like being able to handle a person not being there.
It's like, oh, you know, Jim's wife is having a baby,
so this chest has to stay locked.
It was the one at the keys.
He's sleeping in the inn.
Yeah, or he's the only one with lock picks or something.
Yeah, so the book is really interesting.
I'd recommend folks check it out.
If nothing else, it's a nice book to have
on your coffee table,
because it has kind of a provocative title,
Universal Game Engine.
All right, well, I spoiled it,
but tool of the show for me is a video game,
and I, for whatever reason,
skipped every modern Pokemon video game.
And so I think the last one I actually legit played
was when I got Pokemon Red in my Game Boy as a child
and played that to no end using,
and I was trying to describe this to my kids,
and I had to go to, when we would go shopping,
I liked the Walmart or Kmart,
and I would look in the strategy guide for why I was stuck.
And so I would go with my mom so I could go to the, you know,
video game section and like open the strategy guide and like, look,
because I wouldn't just buy it.
I probably should have just bought it.
But anyways, I didn't go home and, you know, get through anyway.
So so Pokemon, right?
Anyways, I've been aware I've, you know, dabbled various times,
but I hadn't really sat down and played,
but I was sitting down and playing Sword and Shield.
I was playing the Shield variant,
but not super important on the Switch.
And I just hadn't done that in a really long time.
And I know it's a pretty big departure for the series,
but it was really kind of fun.
Like I was really into it.
I realized now that the game is easy, like it's not supposed to be challenging
to actually, you know, quote unquote beat the game.
So it's not that much of an accomplishment, but not a great time.
And if you've ever been interested and you, you know, have a switch or whatever,
we definitely recommend checking out one of the newer ones, sword and shield.
I guess the other one I'm going to try now is Scarlet. Scarlet and I think violet it is.
Um, but.
Pearl or something.
Uh, yeah.
Okay.
I did.
Or that's a different, I think Pearl.
I think that's a remake.
I think that's a remake.
Oh, okay.
They did like a remake.
Yeah.
Like, uh, so anyways, if you hadn't checked one of those out, they definitely
went and some of them, like a little bit more with open world sections and you can
kind of control how often you get into a battle versus you know just wandering around in a set of grass
until it happens. So definitely some quality of life improvements over the old ones that make
it less frustrating and you know ability to save sort of everywhere you want. If you never checked
one out I guess this is me telling you the obvious thing of like, it's a thing and it's, it's kind of fun.
Yeah.
I played it with my kids and, uh, it's been maybe a year or two and, and, you
know, they would, uh, get frustrated.
And so I'd help them like, kind of optimize their characters a little bit.
Um, but more or less they could get through it eventually.
And yet the other thing is it's all the bosses and everything.
As far as I know, they have static levels.
So, you know if
if you're kind of like like my kids are just running around kind of aimlessly for a while
so their pokemon were like super over leveled and that made the game even easier than if you're
trying to like speed run it but yeah that game is awesome I think uh I think the open world added a lot actually,
like being able to really like see the enemies
and they run into you physically and then the fight starts.
Like that really added a lot, I think.
Yeah, and I think that it goes crazy deep though.
Once you like look on the internet,
there's all this like each Pokemon you catch
has different stats and there's a Rotoma.
I never paid attention to it other than like it has a type and certain moves or whatever. Very basic level strategy and that
was fine to get through the game. But when you look online you find out, oh yeah the competitive
stuff and people playing online and whatever that each Pokemon you catch has like different
base stats that have been rolled for that character. I mean it's not actual dice but
probabilistically generated and so some are better than, even if they're the same level.
All right.
So Patrick, do you want me to waste hours and hours of your life?
Is it going to be fun?
It's going to be fun.
Okay.
So, uh, later on go on YouTube.
Okay.
There's this guy who really understands the Pokemon mechanics and purposely.
So there's a, there's a Pokemon, I think it's a web based game.
It's probably not legal.
It's probably already shut down or something, but there are maybe it's
sanctioned out at a, but there's this web based game where you can just do
Pokemon battles with other people.
And there's a, the ELO like for chess and everything, you rise up the ranks.
And so it's, it's literally just a battling part of Pokemon.
And so, um, this guy who really understands the mechanics, he makes
builds that are very unintuitively strong and he plays people who, um, uh, and he
must play a lot of people, but inevitably he ends up playing someone who, who starts
off like making fun of him and like, Oh, like you just have one Pokemon.
Like why didn't you build the other four Pokemon?
Haha.
You're so trash or whatever.
And then he wrecks them and they start raging and they start like, and then they
won't make their final moves and he's like, Hey, your time's running out.
And they just get so pissed and everything.
It's like the people who get the scammers upset or whatever.
It's that, but for gaming trolls and it is hilarious.
Oh dear, okay.
Now you have to send this to me,
but I feel like I'm gonna not like you for doing it.
Yeah, I mean, I don't know how many videos he has.
I'm pretty sure I've watched like five or six of them.
They're really, really funny.
I'm pretty sure I've watched like five or six of them. They're really, really funny.
Okay.
All right.
So, oh, my tool of the show is features and labels
or fowl.ai.
There's a bunch of alternatives.
There's together.ai, there's fireworks.ai,
there's a bunch of them.
But basically these are people who are kind of a middleman between you and the AI models
So they'll host the open source ones
They'll often have agreements with the closed source ones so you can run like the Google image and you know, it's you
Otherwise you'd have to use some proprietary Google API or whatever. So you think of these like a middle layer and they often charge you
you know, per thing that you do versus like you having to rent a machine for an hour, right?
The thing that, that, so the reason I picked FAL is I actually know the founders. So I, I'll just
put it right out there and say, I don't know if FAL is any better than any of the other ones,
but the user interface is really nice.
They have like a playground mode where you can just build things on the web.
And then you can click on the API button and get the Python code if you wanted to
make that programmatic.
The other thing they did, which I thought was really clever UX, you know, as
engineers, especially as people who have GPUs or maybe an M2 MacBook or something, we think
to ourselves like, yeah, I mean, I should just run Flux myself.
Like Patrick's run Flux, I've run Flux, right?
But when you go to Fowl, they're like, yeah, so you can run this model like 87 times for
a dollar.
Like basically for every model, it tells you how many times you can run it for a dollar and that to me is like really powerful because like often like I have some code
that I have right now and I have a local version of Flux and then I have the
foul version of Flux and you can just like toggle between one and the other
and and you know like I'll want to run sometimes I'll think I'll run the local version because I'm going to work like I'll want to run, sometimes I'll think, oh, I'll run the local version
because I'm going to work and I'll just let it run
and it'll be done when I get back.
But then I'm like, yeah, it'll be done when I get back
or I could spend like $7
and this is just like done in like a second.
So it's like, they did a good job of kind of like
really laying out the economics,
which are themselves startling, you know, how the economics have changed for AI.
But they just put it right out there.
And they recently had something, they had something where they
they're able to do some caching of the,
I think, caching of the tiles.
The way the image transformers work is, you know,
breaks your image up into tiles. And I think they're caching tiles that are very similar
or something. I don't know. It's something I don't remember off top of my head, but it made the price
even cheaper. So I guess long story short, check out these folks. They're all awesome. I know the
fireworks people too. All these services are great I know the fireworks people too, that all these services
are great and there's an economy of scale that you can really take advantage of. Yeah, I mean, I think
all of it from like, someone was asking me with 3D printing, how much would it cost you to print
this? You know, I saw it in a shop or something and I was like, well, most obvious thing is how much
plastic it takes. But then you start thinking about it. There's depreciation of your machine, like wear and tear on your machine.
There's like the power to run it.
There's my time to like walk out.
Mine's in the garage.
Like walk out to the garage and like get it off or clean the, you know, bill plate.
And so I think what you're saying is interesting too, that running it locally
to me is, I guess I just cheat by, I don't want
to give places my credit card.
I don't know.
And I have stuff and so I feel like I should use the stuff I have.
But you're right.
Like by the time you factor in the power to run it, like it's not free and you know, your
computer getting hot and the time taken.
And so the economies of scale, these really big, you know, server clusters
dedicated to this AI stuff, it's really kind of amazing
even with how expensive those really high end GPUs are.
Yeah, yeah, totally.
I think running it yourself is great.
Everyone should learn how to do it.
Definitely not discounting that, but check out these folks
and there's similar folks too.
It's a really neat service.
If you have something that you then say, oh, I need to run this like 200 more times.
Um, you know, your time is also really important.
So, um, all right, on to our topic reinforcement learning.
Um, so a bit of background here is like the opposite of like assembly language show where in this
case like this is my background, my area that I know a lot about.
I'll kind of dive into it and then Patrick is going to play the role of you folks and
stop me anytime I say something that is a buzzword in my community or doesn't make sense or something.
So I'll start really broad. There's basically three types of AI. There's supervised learning, there's unsupervised learning, and there's reinforcement learning.
So supervised learning is where you have the right answer is right there.
So, so someone gives you a picture and they draw a box around the stop sign.
They're like, there's the stop sign.
And your job is to learn a function that maps the picture to the bounding box of
the stop sign. And you're given a lot of these as ground truth, right?
the stop sign and you're given a lot of these as ground truth, right?
And then you're also given a second set that you're purposely not meant to train on called the holdout set.
And if your training did really well, then you're able to interpolate between
all the other stop signs that could exist in the universe.
And so after training, if I was to give you a new image
you've never seen before with a stop sign in it, you could draw the box around the stop
sign. That's supervised learning. And under the hood that works through what we call a
loss function. So a loss function takes the output of your model. So in this case, maybe
it's a bunch of hypothetical bounding boxes. It takes the ground truth, which is the actual bounding box, and it turns all of that into
a number where the further the number is away from zero, the worse you got, the worse you
did.
And so a zero loss would be perfectly nailed that bounding box.
Now a zero loss might not be good, right?
Because you want the model to have some uncertainty. Like for example, imagine we're playing paper
rock scissors, right? And you play paper and I play scissors with like 100% certainty. That's
actually not good, right? Because although I won in this game, you know, we know that someone who
plays scissors a hundred percent of the time is not playing an optimal paper
rock scissors, right?
You could just, you could learn that and then just play, uh, wait, did I get it
wrong?
Anyways, you know, the analogy, you could just play whatever
counters what I just said, and then just win, right?
So, so often you're going to output a mixed answer, right?
A distribution of answers.
And so you're always going to have some amount of loss.
Um, but through, through what's called the learning rate, you don't like
totally change your line of thinking every time an example is
presented, right? You're just slowly moving in different directions as
examples are presented and if the learning rate is low enough and all the
hundred other things kind of stars align then you will create like a mixed, you
know, a mixed response that is optimal. So that's supervised learning.
Did I get that right? Any questions about that part of it?
Or did that make sense?
So you said, the only question I had is you were saying, so this makes sense
that you have the thing that you're trying to match and then you're tested on
something else, but you said interpolate between the results, but it should be
possible even with supervised learning, not strictly like interpolation to me means like
between the points given,
but you should even like in your stop sign example,
like you were mentioning for stop signs that are new,
the idea is hopefully you would also understand
that those should be heavy bounding boxes put around them.
Yeah, right.
So what you're hoping is that you could imagine
like a manifold, like a stop sign space.
And in that space, there's like a whole bunch of different kinds of stop signs.
And so inside of that space, there's stop signs that like look really different.
But hopefully they're like, they're within the space of stop signs that you've
already seen. So like an example where extrapolation doesn't happen is so we've
seen this with Waymo where people will wear a t-shirt with a stop sign on it
and that's not it so that's an example of extrapolation. And in that case, the model doesn't really know what to do. So it,
it, it thinks it's a stop sign. So, so really,
so there's a whole area around what's called out of bounds, um,
detection and out of bounds prediction. And long story short,
that's a very, very hard topic, but really important.
But by default, supervised learning will interpolate
at a really high dimensional space, right?
Interpolate between all these things that's seen,
but if you give it something totally new,
it's gonna have trouble.
Got it.
Okay, so unsupervised learning is where you don't have a loss.
Like there's not a ground truth, but what you do have is something that's kind of stateless and easy to evaluate.
So like the most common example is clustering.
So there often isn't like known a perfect clustering.
Like you might have millions of documents
and you wanna break them into a thousand clusters,
each one having 3000 documents.
And you want the entropy of each of those clusters
to be really small.
So you want all of them to be like really close together.
So you might never know the perfect clustering,
like you do at supervised learning,
but it's like trivial to like evaluate.
So I can like show you a set of clusters,
you could put the documents in the clusters
and come back with a score.
This clustering has a score of seven,
then I can make some changes.
They say, oh, this clustering has a score of eight.
It's a little better.
That makes some changes.
Clustering has a score of nine, et cetera, et cetera.
And so that's an example of unsupervised learning.
And so you're not even really trying to figure out
the best way to cluster.
Like a human is doing that,
but the computer is just kind of following the instructions
and then over time getting a better and better clustering.
And you can measure that. And so you might never get to the optimal,
but you can get closer. So unsupervised learning is a little bit
trickier in a sense that you don't have a ground truth.
Now reinforcement learning is in my opinion,
the hardest of these areas,
not in terms of you have to be the smartest to do it
or anything, but the hardest in terms of
getting good results is the most difficult.
And this is because you have all the challenges
of unsupervised learning,
where you don't have a perfect game
of go or a perfect game of chess to reference, but you also are making decisions.
In the unsupervised learning case, you're not really making any decisions.
There's a human making decisions or a human written algorithm making decisions and you're
just evaluating them.
But here you have to make the decision.
So it's like, here's a set of clusters.
How do I make them better?
And then you do that and then did they actually get better?
So if you were actually on the fly designing your own clustering
algorithm with AI, then that's reinforcement learning.
But the stuff that we talk about when we say at a high level like, oh, flux or this or whatever,
it may be using components that were trained with a variety of these techniques or use a variety of
these techniques, right? So it's not necessarily that a whole, I don't know what the distinction
there is, like a whole program application is one of these.
You're sort of talking like a little lower level.
You're saying like one part of that pipeline was done this way.
Kind of.
So, uh, so in the case of flux, that's all supervised learning.
So in the case of flux, it's what's called self supervised learning where
In the case of flux, it's what's called self-supervised learning where you hide part of an image and you ask the AI to draw it.
And then because you hit it, you know what it used to be.
And so you show that to the AI and say, hey, you know, this pixel actually should be red,
but you drew purple or something.
And so, yeah, so that's pure supervised learning. There are like, you know, recently some,
so another way of saying it is reinforcement learning
like does stuff, like takes actions
and supervised learning and unsupervised learning
kind of reveal knowledge.
So in the case of the stop sign, you know,
drawing the bounding box around the stop sign kind of reveals or synthesizes knowledge.
Like now you went from pixels to there's a stop sign there, but it doesn't tell you how to it doesn't drive a car or turn a camera or take any action.
So as soon as you want to take an action now, either the humans have to write that code.
But as soon as you want AI to take an action, now either the humans have to write that code, but as soon
as you want AI to take an action, now you're doing reinforcement.
So there's a bunch of different kinds of reinforcement learning algorithms, but there's basically
two axes that you need to think about. One is offline versus online.
And this is just a fancy way of saying, can I make mistakes?
So for example, the AI that plays go, um, alpha go in the beginning of training,
um, let's just stick with alpha goes zero.
It's all pure reinforcement learning. So in the
beginning of training, it's just playing garbage games of Go. And that's fine because it's playing
against itself and it's, it's, you can't embarrass the computer. So, so it just plays garbage games
of Go and it gets better and better. But like you couldn't, for example, like drive a self-driving
car randomly until it got better. Like, you know, you can't do that. You'd crash the car,
people would die, be like total mess, right? So offline reinforcement learning is where
whenever you make decisions in the real world, they have to come with some kind of guarantee.
In the case of online reinforcement learning, you can just make decisions in the real world whenever you want at any point.
And so that's like a subtle difference, but it has like pretty big consequences and algorithms and everything else.
and algorithms and everything else.
So online it's able to change itself and like update
and then offline it's sort of like you're wanting to make guarantees, you wanna know like,
I understand what it's gonna do, I've tested it in some way
and I don't want it sort of like changing what it's doing.
Right, right.
So online you're willing to put any model in
production. So yeah I think it's sometimes they call it on policy versus off policy but it gets
the nomenclature there doesn't matter as much. Those are the two kinds. Um, okay. And so then there's a second axis or second kind of switch here, which
is value and value based or policy based.
So I'll go into this.
So, um, let's say you have to make some decisions, right?
And when you go to make a decision, like imagine a choose your own adventure book.
And whenever you go to make a choice, I was to tell you, like, if you make this choice, you have like this percent chance of reaching the best ending. And if you make this choice, you have this other percent chance.
Like you would just choose the highest percent.
Right. And you would just do that.
It would be like solving a maze with no walls, right?
You would just, you just like pick the highest percent every time until it's a
hundred percent and then you would win. Right.
And so the idea with value-based reinforcement learning is if I know the total
value of a decision and I know that for all my choices, then I've solved the problem.
I just picked the one with the highest value and that's just the optimal policy.
And so value-based kind of ignores the whole decision part of it somewhat and says the
game here really is figuring out the expected value because once I have that I'm set.
Now here's where it gets tough, right?
Is let's say AlphaGo places a stone somewhere on the go board to start the game.
And it's playing itself or some other world champion or something, right?
It places that stone and its value is about.5.
It has like a 50-50 chance of winning the game
when it just started, right?
Well, let's say I, Jason, go and play
the world champion of Go.
And I put the same stone in the same position,
just coincidentally, for my first move.
I have a 0% chance of winning, right?
Because I'm not even close to a world champion.
I'm going to get wrecked. Right.
So so you have this paradox where like the value is based on a policy.
But if the policy is based on the value,
you can see how this is like a cyclical reason.
Reasoning. Right.
And so getting the value is actually really, really hard for this reason. And there's
several algorithms. The simplest one is called SARSA, which basically says you make a bunch of
moves, the game ends or the episode ends, you stop driving the car or whatever it is,
then you just go back and you know what happened. So you assign the expected value. So, you know,
I turn the steering wheel here, I turn the steering wheel there, I hit the brakes, I hit the gas,
and then I made it home safe. Therefore, all those actions are plus one, right? You know,
turn the steering wheel, I hit the gas, I crash into a wall. Therefore, all those actions are plus one, right? You know, turn the steering wheel, I hit the gas, I crash into a wall.
Therefore all those actions are minus one.
And then, uh, you feed that into your neural network.
You do, uh, your training and, um, um, that's pretty simple.
Um, you know, that completely ignores the thing we just talked about.
There's other algorithms like Q learning and stuff that try to address
some of these challenges.
Um, it's really difficult.
Um, it's not, doesn't mean value optimization doesn't have its place, but
you know, ignoring the, the, the policy makes the, makes it really difficult to
optimize it's good in situations where like,
there's clearly one good action at any given time,
you just don't know what it is.
Like Atari is a great example where,
like the actions are binary,
there's usually like one good action.
Like if you're playing Mario or something
and you're about to run into a Goomba,
you either jump or die.
And so you jump, right?
But as soon as you get into environments where you need a mixed response, um,
like poker or driving a car or really doing anything in the real world,
it becomes difficult.
Um, any questions about value optimization?
Or did that, Did that make sense?
So I guess it makes sense.
So I think what you're in these cases, you're trying to, like you were saying, is there's
like understand the outcome of a game.
So you're playing a game, you don't know what's going to happen.
So you don't know if it's a good or bad move until sort of like it's too late.
So but by observing many, many, many games to their conclusion, you're hoping that when
you go to do it, I guess that's, that's offline, but for real, that you've built up an estimate
of given us context, what is the likelihood that each decision is good or bad?
Yeah, right.
Right.
And as your values improve, your policy improves, which means your values now are all inaccurate.
And so you're just like kind of iterating on this over and over again.
Yeah, but you're totally, you totally nailed it.
Okay, so the other type of algorithm is policy optimization.
And in this case, you say, at least in the most naive example, you say, I don't even
really care how good this action is.
Like I don't need to know what's my expected value of taking this action or anything.
All I want to know is I want to take actions that are good and I don't want to take actions
that are bad.
It's like I cooked some eggs, they're delicious.'t want to take actions that are bad. It's like, I cooked some eggs.
They're delicious.
I want to do more of that.
I touched the stove with my hand, not delicious.
Don't want to do that anymore.
Right?
Very simple.
So, so policy gradient is basically, and I'm going to try to do a lot of
hand waving here, but basically it says, get the expected value of this action.
So you do have to kind of like play out a whole series of events.
But then when you go back and you look at what happened, take the things that were good,
that had positive value and do more of them and do them in proportional to how positive the value was,
take the things that had negative value, do less of them and do it in proportion.
So things that are really negative value really do it a lot less, right? It's a very simple concept.
One challenge right off the bat that you can see is
your expected value has to be centered at zero.
Like in other words, if all you can do is get points,
but you can never have a negative score,
then your system is gonna say,
do everything infinite amount of time,
and it's just not gonna be able to learn.
So you need what's called a baseline such
that you hope that roughly half the time you're getting a positive score and half the time you're
getting a negative score. Now for something like Go, it's trivial because you play a game and you
either win or lose and so unless you're playing like someone way out of your league in either
direction, you're hopefully going to win and lose about half the time.
You get a negative one for losing, positive one for winning, you're all set.
So Go makes this very easy.
But in the real world, it doesn't work that way.
And so it actually like figuring out how you can get half of the expected values to be positive is really hard.
And so you actually use a second neural network just to figure that out.
And that's called the critic.
And so the, the, if you hear the term actor critic, the actor is just a policy
gradient that I talked about earlier.
And the critic, all it's doing is it's trying to figure out the baseline.
It's trying to figure out the average move, what would be the expected value of that,
so you can subtract that out and hopefully get a balance between positive and negative.
And so this is the neural network you would use on a Go board to tell you like how good
your situation is?
Yeah, exactly.
So if you were using Go for an example, let's say you're trying to solve Go with a policy
gradient.
So you would say, I took a bunch of actions, I won.
Now I'm going to look at this one action. I got an expected value
of one because I won the game. Now what was the average expected value? Actually, sorry,
what was the advantage is what I need to know. So I need to know was this one like for example like is this am I playing
someone who's like a total chump and so like even though I won I can't really
learn anything right or did I play someone who's like a grandmaster and
actually learned a ton by winning. That's your sort of advantage function and so
you're gonna take the value function of the current
state. So at this current board state, what's the probability I win? And then
you're going to take the action you took and see what's the value at that state.
So if the probability of winning the game is 50%, but after I took my action
it jumped up to 60% then I know that
taking that action like caused an extra 10% and so that's my advantage and so
you know if you're playing someone of your level half of your actions are
going to cause your win probability to go down and half of them are going to
cause your win probability to go up. Got it. Yeah, so as I said for go you don't need it as long as you're doing self play but
for you know something like Atari where there is never a negative score you
need to know like what is was a bad action and so a bad action is one where
your expected score went down after you took the
action.
It's like, oh, I took the action to run into the Goomba and now my expected score
is a lot lower because I have one less life to go and collect points with.
So is there a conversion between lives and points then, or it's just simply that
because you lost a life, your maximum point that you can
get is reduced? It's the latter. Yeah, so all of that has to be inferred. So now you could make it
explicit. So you could say, and this is something that we should talk about, it's called reward
shaping. So let's say in Mario, your goal is to get the most points. But that's kind of a really weird goal, right?
Because often when we play as humans, we don't even look at the score, right? So you might come
up with proxy goals. You might say, well, every time I eat a, well, every time you eat a mushroom,
you get points actually. But let's pretend you did it. It's like every time I eat a mushroom,
I'm going to give myself extra points. Maybe you don't get enough points for eating a mushroom.
And if you gave yourself more points for eating a mushroom, then the AI learns easier.
Um, and, and to your point, like you don't lose points when you die, but maybe you
should, like maybe if you lost 10,000 points, every time you died, the AI would learn a lot easier.
And so this is called reward shaping. It's where you, instead of learning the task at hand,
you learn a new task because those two tasks kind of go up and to the right at the same time.
They're correlated and the new task is just easier to learn.
Yeah. So I mean, I don't know the scoring of Mario either. I never paid attention
I guess like you could get weird states
Otherwise where you try to get a bunch of one-up mushrooms and then on a particularly high value level
Like basically keep dying almost at the end repeatedly in order to like keep gaining points
for that level assuming you don't lose points when you die.
And so you could get this very undesired behavior because it realizes the way to get the maximum
score is to keep dying and replay the level and gaining those points.
When the actual thing you wanted was kind of get through the levels as fast as possible.
And you didn't really care about the score and you used it as a, but it was like a bad proxy for what you wanted.
Yeah, exactly. Exactly. And then on the flip side, like you might say, well, my goal is to get as far to the right as I can.
Right. But if you don't take score into account, it might just be really hard for the AI to do that.
Like the AI might just like desperately do like kamikaze jumps
to the right where it gets killed because it didn't,
or the AI might just not be incentivized to get mushrooms
and make it more survivable, that kind of stuff.
So yeah, reward shaping is a really big part of the problem.
And, um, um, and so when people go from manually designing systems, like if this ad has this chance of getting liked, then put a highlight around it or something.
When people like build these things by hand, they, they have to deal with like
all these conflicting metrics and how do When people build these things by hand, they have to deal with all these conflicting metrics
and how do you reconcile,
oh, we showed more ads,
but there was more kind of racy, unethical kind of photos
and how do I deal with that?
And so reinforcement learning moves that problem
to the reward shaping phase,
but it doesn't really get rid of it.
You're always going to need to like better
and better understand kind of like your goals
and the nature of the problem and how to
kind of how to best like solve that problem.
Okay, so, okay.
So let's dive into the offline part.
So, you know, one thing a lot of people wonder is,
yeah, like AlphaGo plays against itself.
And so at the end, after you've used like a zillion GPU hours,
it's like a world champion,
but like, how do we do that in the real world?
Like clearly like babies don't just like run into walls
or like fall down.
Actually they do kind of fall downstairs if you let them,
but that's a bad example.
But like, but like, you know, as humans,
like the way we drive a car is, you know,
we have a person helping us,
but we're not just like randomly jerking the steering wheel
until we figure it out.
Like we have the sort of like base of common sense,
like we kind of draw from.
Like a model of how the car should work.
Exactly, exactly.
And we kind of like kind of project into using our mind,
we kind of simulate the driving experience
as best we can from watching other people drive,
watching our parents drive.
And we've built a simulation
of that on day one. And so there's a question of like, how do we do that with reinforcement
learning? And a big part of that is using what's called a trust region. And so a trust region is basically, it works like this. So let's say I play a bunch of games of Go.
And I'm a decent player.
I play a bunch of games of Go.
And now I go back and I watch all of my games, right?
This is just me as a human.
I watch all of my games and I look to myself and I say,
oh, I would have done maybe this move differently. I would have done that move a little differently,
but I'm not going to say like I would have done every move differently. Like as a person,
like we can't, that would put us into a really weird state where like we wouldn't really
know what to do. Right. So we would pick like a few key things that we would do differently
and then we would wait until it's the next tournament,
exercise those differences,
and then we'd repeat this process.
And so, you know, with reinforcement learning,
if you take a bunch of data
and have computers try to do policy optimization,
they'll just hallucinate,
just like we see with chat GBT and these other things like they'll start hallucinating like oh
If I play this move
I'm gonna get every single go piece on the board because there's like some inaccuracy in the model and
The other thing is like it only needs one
Action to be inaccurate on the positive side to throw everything off, right?
All your values are now thrown off, everything, right?
So it's inherently kind of unstable.
And so what trust region policy optimization and proximal
policy optimization, what these things do is they basically
say, we're going to keep track of the actions that were taken
in the real world and what the model is doing.
And if the model doesn't match the real world enough times, we're going to stop training.
So, you know, in the beginning of training, the model is going to match the real world perfectly
because it's the same model, right?
Like you rolled this model out in the real world, collected a bunch of data,
and at the very first mini batch of training, the model hasn't changed.
And so it's going to output the same distribution, right?
Over time, the distributions are going to start diverging.
And because, you know, because you own the model, you can actually keep
track of the entire distribution, right?
So even though you actually press the gas, you know that the model was like 50,
50 about pressing the gas or not.
And that's what you're going to log, right?
So now you're training and you say, Oh, well, the model that drove the car,
press the gas 50% of the time.
But the new model wants to press the gas a hundred percent of the time.
That's a pretty big difference.
And so maybe this would be a good time to like stop training and go drive with the new
model.
And so that's, you know, there's a lot more math than that.
That's effectively what's going on is these are like halting, halting criteria.
So you might not be able to train that much before you have to stop and go to the real
world.
Got it. to train that much before you have to stop and go to the real world.
Got it.
So it's basically like you've you're too far away from what we know. So you need to go try again.
You've changed a bunch of stuff a little, but your outputs are now very different.
So we need to go try again and see if it actually got better.
Yeah, exactly.
Um, and then the last thing that I'll kind of cover here, um, but, uh, there's a
couple of other things.
So one is imitation learning.
And so this is pretty simple.
The idea is, um, you know, we talked about supervised learning, right?
And the stop signs, right?
But I can do the same thing with decisions I could say hey when you see this situation press the brake right and I'm just saying
as an expert like it's a ground truth like like unambiguous you know press the brake here press
the gas pedal here that's called imitation learning and that's just supervised learning
press the gas pedal here. That's called imitation learning and that's just supervised learning.
So you know you could imitate a person and then all the regular things apply there of interpolation and everything we talked about. But that's not really reinforcement learning. And that's the
AlphaGo not zero where it was trained on all the human games and they basically said, we want you to imitate the person who won.
Right.
Exactly.
And so AlphaGo did that as like a bootstrapping phase and then did,
did reinforcement learning after that.
The AlphaGo zeroes where they got rid of the bootstrapping phase.
Um, so that's a, so imitation learning is a good way to bootstrap. to the bootstrapping phase.
So imitation learning is a good way to bootstrap and that's kind of what we do.
Another thing I wanna cover is model-based
reinforcement learning.
We're basically,
in the case of AlphaGo, it can play itself
because it's just a game, right?
Like it's an artificial environment.
But if you want to do, for example, a self-driving car,
as we talked about, you can't just drive randomly
while you learn what to do.
And so you have to construct a model.
You have to construct like a virtual environment
and then play within that virtual environment.
You know, the challenge now is, of course,
what happens when the virtual environment
doesn't match the real environment?
And so there's different ways to deal with that.
There's something called a joint embedding.
But long story short,
with model-based reinforcement learning,
you have this SIM to real problem.
So if you look up SIM to real, you'll find like a zillion papers on it, but it's
like, how do you take something that was trained on a simulator and bring it to
the real world and back and forth and back and forth?
Um, okay.
Yeah.
The last thing to cover is policy evaluation.
So, you know, with supervised learning, you have the truth.
So it's like, oh, I didn't draw the bounding box around the stop sign.
That's bad.
I drew the bounding box around stop sign.
That's good.
And you can, there's a million different ways you want to count those errors,
but you can count them those ways
and just output that, right?
In the case of reinforcement learning,
you don't really have like a perfect game of Go
or anything like that.
So what you have to do is,
several different things you can do.
You can either use a simulator and say, oh, in the simulator, my model got better.
That's what AlphaGo does, right?
But in case of, let's say, self-driving, where maybe you don't want to trust the simulator,
there's another thing you can do where you run two models and the first model actually controls the car.
And the second model just says, well, we call counterfactuals, which is just a
fancy word for I would have done this.
Right.
So it's just like, it's literally a backseat driver, literally.
So, so you then take those counterfactuals and what actually happened and you can figure out if the new model is better than the old one.
So for example, let's say the old model doesn't hit the brakes and the new model really, really wants to hit the brakes.
And then like half a second later, the old model slams on the brakes.
Well, that probably could have been avoided by the new model, because the new model was breaking earlier, right?
It was like more predictive, right?
And so that would be a sign that the new model is a step up.
Similarly, if the old model slams on the brakes to avoid a collision and a new model would
have hit the gas, that's a bad sign.
That means that new model probably would have got you in a bad position.
So again, a lot of math behind that, but effectively that's the intuition behind policy evaluation.
And one last thing about that is policy evaluation is harder than
solving the problem. Because if you have a perfect policy evaluator, then you also
have a perfect policy. You just take the action that the evaluator gives the
highest score to. So because of that reason, like versus like in supervised
learning, you could just measure accuracy,
just take the times you were right and divide it by the total time.
Very easy, just algebra.
But here it's like, not only is it hard to evaluate the policy, but it's actually harder
than solving the problem.
And in many cases, it's impossible.
So that's a big challenge.
And that continues to be a challenge today.
Um, um, and so that's, that's Paul.
So I've kind of covered all of the technical stuff.
I'll dive into a little bit of the large language model stuff, but before I do
that, any questions about the technical stuff?
So I guess the thing I'm missing is, is is I guess it's a little application which is understanding
So some problems like you're saying are clearly not a fit for supervised or unsupervised learning
And so you could think about like oh, maybe this is a reinforcement learning task
When do you but then some things maybe there's multiple approaches to solving and so, you know, you could use a reinforcement learning
You could try something else.
And then from a toolbox standpoint,
even today, I can just Google your stop sign example,
and there's 100 tutorials for opening up TensorFlow PyTorch,
whatever.
Give it the images, give it the labels.
We talked about labeling, but give it the labels
and monitor your loss function.
Is it the same, is there like the same set of tools
for doing reinforcement learning?
Or is there also some really like a canonical example
that you would sort of go to
to like kind of do like the simple case?
Yeah, it's a really good question.
Okay, so, okay, the first part of the question, I think that my general philosophy is to use
the simplest tool for the job, right?
So for example, I'll give a really concrete example.
There was a place I worked at, I could probably say this. I'll just say it.
I don't think it's going to be that controversial or anything, but it's not
that much of an expose, but you know, when I worked at, at Metta, you know, we
released the Oculus store, right?
And so you can go right now to the Oculus store and buy games for the Oculus Quest.
Right.
And so I talked to the product managers, they asked to meet with my team,
and they wanted to do reinforcement learning
to figure out what items to put on what places
on the storefront.
So when you go to like oculusstore.com
or whatever it is, the URL,
like what should they just show right there on the banner?
They call that the hero position.
What should they put in the hero position, et cetera?
And my response to them was like,
not only should you not use reinforcement learning,
you should also not use AI, right?
What you should do is like take the app that sold the most
and put it in the hero position just manually
and run that way for a month.
And then if you realize that like, if you just have this intuition, like,
oh, there's, there's so many people with so many different interests and we're
showing everyone beat saber and it's not going well and, and so we need to do
some AI then let's, let's go there.
Right.
So like simplest tool for the job.
Like the simplest thing was just like a YAML file
with Beat Saber in it, right?
And so like they launched that.
And then I would say, you know,
if you can do something simple around the decision,
like say, okay, in certain countries,
I'll show Beat Saber and other countries,
I'll show other stuff.
And then now I'm dividing by some other demographics.
And the next thing you know,
you're kind of like building a decision tree by hand.
Okay, let me use the decision tree, right?
And so, and then at some point you run into
like competing interests where, you know,
I want the store to do well,
but I also want game publishers to share the benefit.
I don't want to just king make big beat saber.
Now I have this competing economic model that's very complex.
Now we're starting to talk about reinforcement learning and some of that.
I would say stick with the simplest tool
for the job. Reinforcement learning often is much simpler than trying to like take actions by hand
and stuff like that. So for running a marketplace, for driving a car, you know, reinforcement learning
is a great choice. Yeah, and as far as tooling, the tooling is way way behind. There's
a lot of reasons for this. One of the biggest reasons is reinforcement learning can't really
be commoditized because it's too close to the decisions that companies make, which are sensitive. And so it's just very hard
to commoditize. I mean, we rolled out Reagent, which was the most popular reinforcement learning
platform for a while. Now there's open AI baselines. So there's a bunch of places where
you can get the algorithms, right? But if you want like the real techniques, like how do I do offline evaluation?
A lot of these are proprietary. Reagent actually has policy evaluation and all of that. So folks
can definitely check that out and the code base, as far as I know, is still active.
out and the code base as far as I know is still active. But I think the field is just still too new for there to be kind of really good practices there. But yeah, those are two awesome questions.
Okay, so I'll move on to RLHF. So a lot of people found out about reinforcement learning when chat GPT rolled out
RLHF, which is what makes the chat part of chat GPT. What took it from GPT to chat GPT.
RLHF is a pretty simple idea. The idea is
So the idea is thinking about it this way. GPT is imitation learning.
So a person wrote,
the frog or the fox jumped over the dog or whatever that is.
When you pick a font,
it always shows you that same sentence.
It's like the quick fox jumped over the lazy dog.
So that's probably all over the internet, right? Because it's in every font.
So GPT will imitate a human quote unquote, if a human is
you know all the content on the internet averaged, right?
And so if you say like the Fox, the quick Fox jumped, GPT will
respond with like over the lazy brown dog, right? And this was trained in a
supervised way, but when you think about it as like it's a decision to put that
token there, like it's a decision to put that word there, then actually GPT is
making decisions. And so, And so it becomes a reinforcement learning problem when
it becomes multi-step. So for example, you know, if I just need to predict the next word and I know
exactly what it is, that's supervised learning. But if I have like 10 different answers from GPT and I want to pick the best answer, like an
entire answer only gets one score, now it's a reinforcement learning problem.
Because I have to figure out, okay, this answer is better than that one, therefore all the
tokens that generated that answer are a little bit better, but we don't know how much.
And so RLHF is just a pretty simple algorithm where you say, give two answers.
If the system is more likely to pick the wrong answer,
it gets a negative point.
It's more likely to pick the right answer,
it gets a positive point,
and now you do your policy gradient.
to pick the right answer, it gets a positive point, and now you do your policy gradient.
And so RLHF has been a part of these LLMs for a very long time. The thing DeepSeq did that made reinforcement learning- Oh yeah. What is RLHF? What does it actually stand for? Reinforcement
learning? Reinforcement learning from human feedback. Ah, there we go. Okay. So yeah,
the person, a human is actually saying
this answer is better than that one.
Okay, so the thing that DeepSeek did
that was pretty amazing is it took the human part out.
And so it's just RLF.
And so the idea is it'll generate an answer
to a question that is easily verifiable. So for example,
they give it a word problem and they know the steps of the word problem and they know the answer.
And so they output two hypothetical answers and it comes back and says, hey, this one's better than that one, but it's all algorithmic.
So in math, there's systems that are, you know, they're very expensive to run and they're very specific to math, right?
They only solve math problems, but they're totally autonomous so i can give you like not just the answer like twenty or something i can give you like the whole reasoning and the answer to a word problem and the system will actually verify the entire thing.
And so they replace the human feedback with this expensive system and then they ran this is only in times.
system. And then they ran this a zillion times. And what they found was the model that came out of it
not only could do math problems better than anything we've ever seen, but it became like
very thoughtful and reflective. And so the reality is what they found is if you treat every question like a math word problem, then you become like
much more reflective and like thought provoking and interesting in your answers. And so that's
basically what the DeepSeek folks have done, which is definitely like a huge leap forward
and really exciting. Does that part make sense? The RLF part of it?
Yeah, I think it makes sense. I think they're, but ultimately they're
going back and tuning the output of what the LLM is doing or they're tuning
something that comes after the LLM. Yeah, no, they're modifying the LLM? Yeah, no, they're, they're modifying the LLM itself and all these cases.
Okay.
Yep.
So, so if you think about it, like regurgitating
the next token is, is actually a form of
imitation learning.
So you're saying like these humans that have
written this stuff on the internet, they're
experts and I'm trying to out, do the same
action they're doing where action is writing letters.
And so then when I change the goal to be like,
solve this reinforcement learning problem,
it's still like a set of actions.
And so you can use the same model.
Oh, another thing I should mention is,
we talked about actor critic,
and we talked about how to get this policy gradient stuff
to work, you need to have positive
values half the time, negative values half the time.
So the challenge here is these models now are huge, right?
Like these LLMs are enormous.
And so if you need a second enormous LLM, then that's going to be really problematic.
And so what they did, which is really interesting, um, and, and, uh, it actually
only works because there's no intermediate rewards, uh, but that's kind of a detail.
Um, is they said, okay, we can't afford to have a second model so what we're
going to do is we're going to get the expected value of all these different
answers to this math problem so we're going to generate ten answers we're
going to get the expected value of all ten of them and then we're going to get, and then we're going to basically normalize that number.
So for example, I get the expected value of all 10 answers,
and let's say the expected value is all 1,
except for the 10th answer, which is 2.
So I'm just going to normalize that so that all the ones become
like negative 0.8 and the two becomes
positive 0.8 or something like that. So they replaced like an entire neural
network with some simple algebra and so that's the GRPO or group
relative policy optimization. So it's one of these things that's like a really clever trick.
I have kind of mixed feelings about it. I do think that with intermediate rewards,
it's going to struggle. I think that maybe coincidentally or maybe on purpose,
but the fact that in this particular domain, you just get a reward
at the very end is one of the important causes for this approach working over like PPO or
these other alternatives.
Another interesting thing where they've kind of diverged is generally what people have done in
situations like this where you need a large actor model and a large critic
model is they've had the two share the same backbone. So for example you have a
neural net where the current state of your of your universe goes into the net
and then the neural network outputs two things.
It outputs the distribution of actions you should take,
that's your actor output.
And then it outputs the expected value of the current state,
that's your critic output.
And so you still just have one model,
it just has one tiny extra output on it.
So you might say to yourself,
well, that's like pretty awesome, right?
I mean, that seems like a no brainer.
But the problem is that even though it's just adding one node, both of those
nodes are sharing that network.
And so they're kind of competing with each other.
You know, the critic model is going to be steering the entire network
towards producing
better values.
The actor model, actor part of the model is going to be steering the network to producing
a better policy and they're going to be causing corruption in each other.
And so although you will see like people have success with this for Atari and other domains. I think it's actually super destructive and my guess is that the DeepSeek folks tried to have like just one network,
one single network that does the policy and the actor, sorry, the actor and the critic
in just one network. And they realized that doesn't work, that it just causes corruption
and it just never converges and it just is a mess. And so they ended up falling back to this
approach where they said, okay, well, we can't have a separate critic model. It's too big. and it just is a mess. And so they ended up falling back to this approach
where they said, okay, well,
we can't have a separate critic model, it's too big.
We can't put a critic head on the LLM
because that causes too much corruption.
And so we're just gonna abandon the entire idea of a critic
and just come up with baselines on the fly.
And that worked for them, which is really cool.
Was that something that was unexpected?
Like, was that like a sort of, I don't know, I call it like an innovation to make that leap?
Or was it just sort of like, no, it was pretty obvious once I got there.
Yeah, so this is where it gets interesting is, I mean, so there's a lot of theories.
So I'll say, you know, I, you know, I'm not at Metta anymore.
I don't work at OpenAI or these places.
And so I don't, you know, I don't really know like what's on the very cutting edge that
hasn't been released to the public.
There's speculation that OpenAI was already doing something like this, but they hadn't published it.
And so DeepSeek kind of scooped it. There's even more like,
even more speculative is the idea that somebody stole the idea from OpenAI and gave it to DeepSeek.
That is pure speculation.
But I would say the fact that OpenAI has a reasoning model now that is
comparable so quickly makes me think that like either they worked around the clock
or they were coming to the same idea, right?
And it's probably the latter. Probably DeepSeek saw where the wind was blowing.
And, um, and they both kind of came to that answer around the same time.
That would be my guess.
It makes sense.
So there's lots of cases like that, right?
I don't know.
Online, I saw someone using the term nerd snipe.
Wait, what does that mean?
Same kind of idea.
Like people, or I think since you're a YouTuber, you're like working on some
like new cool project, you know, you think is like crazy and innovative and someone
else just releases a video of the same thing because you didn't get out fast
enough, or like you said, there are, I mean, what, half dozen, dozen, I know, probably like a half dozen super serious competitors and like a said, there are, I mean, what half dozen, dozen, I probably like a half
dozen super serious competitors and like a dozen like within striking range of
doing these kind of similar.
I don't, I don't want to demean them by saying those like chat, but like question
and answer AI agents, um, agentic stuff, the, the reasoning, like all of these.
And so.
Like you said, you're, you're hard at work
trying to refine a project.
You're not sure if it's a big enough innovation,
whatever, and then someone else just goes ahead
and releases it.
Yeah, and so you get sort of sniped out of it, right?
Like someone got it before you,
just before you were gonna do it.
Yeah, yeah, totally.
I mean, I think one of the trends was that like math,
answering math questions was becoming like a big benchmark
that was very important.
And so I think that led a lot of people
to the same conclusion.
Like if the metric had been like right the best play
or something, then we might've ended up
with a totally different system.
But I think once people got excited about solving
these like high school
math problems, I think then, then that kind of set the course for,
for all these companies.
The answering math stuff was really bad for a long time.
So yeah, it's kind of thing.
And I, I feel, and maybe I'm kind of wrong.
I feel the store like coding maybe is, is a, one of those things that
when you think through like lead code style problems,
like there are definite like set up
where you're given a very high level question.
And then, and there are benchmarks
already have these in there,
but I still feel like performance isn't amazing
once you get off the bench,
like that they've not seen before, right?
So when you give these sort of high level problems
and you kind of have a very specific known output
for the program and it should be compilable
and it should be, it's harder, yeah,
of course than the math problem,
but it seems within striking distance.
Yeah, I mean, you know, traditionally in machine learning,
we have this concept called leaking the label,
which means like, you know, if you took, okay,
if you took the
examples you trained on for the stop sign trainer and you just fed them back
in and you get them all right, that doesn't mean you have a perfect system
because you're kind of cheating, right?
Like you might, you might have just memorized all those examples and you
can't know anything else it's possible.
Right.
Yep.
Um, but the problem is how do you not leak the label when you're and you can't know anything else, it's possible, right? Yep.
But the problem is how do you not leak the label
when you're trading on the entire internet?
And so I think what they've found is a lot of these cases
where like the AI solves math problems
or the AI solves leak code problems
is they've leaked the label
and the AI is literally outputting an answer
that somebody else, some other human wrote to that lead code problem.
And so they've done experiments where they've released
things that they know are not on the internet
and the AI's have struggled mightily with it.
I feel like until we get proper calculator use
and tool use more broadly,
I think it's gonna be very hard for AI to solve these
problems.
Well, this has been a great topic and very timely.
I know you've been working on reinforcement learning for a long time, but I feel it has,
like you said, it's kind of reached a certain hubbub in the everyday discussions recently. So I'm happy to have a sort of great overview
of what it is and what it's about.
Yeah, totally.
If folks have any questions,
they can just reach out on our Discord or email
or in my case, social media.
We need a reinforcement learning algorithm
to get Patrick on X.
That needs to be the next thing.
Okay. Well. But yeah. So I did look it up. So Dolly one was four years ago. You were right. That was very good.
And then Dolly two is what I was first trying. And that was three years ago.
Nice. So you're actually you're very accurate despite it just being off the top of your head.
Well, I remember, you know, 20 these things where like you connect it to stories.
Like I remember there's this woman, she's very influential in AI. Her name is Fei-Fei Lee.
And I remember being in this dinner and her and her student were there.
And she said something like, and this was again, a long time ago, but she said something like, oh it was, oh we were writing captions from images. So basically
given an image write a caption so that we could for accessibility reasons. And Facebook I think
still has that in the product today. It's like if you're blind or something you can click on an
image and it'll say what is going on in the image. I remember her saying, that's cool, but it'd be really cool if you could go
from the description and create the image.
And that always stuck with me.
I mean, I was like a decade ago.
That always stuck with me.
And then I remember when Dolly came out,
I was like, wow, it's like the,
something that I thought was like a joke,
but then it really happened.
Like is, for me it was like an amazing experience.
That's why I remember it.
I think people have started to get a little fatigued
on the AI thing.
And it's hard to know, is it, you always hit plateaus.
Is it like plateauing in terms of like actual functionality?
Is it on-run and exponential?
And exponentials always look self-similar
no matter where you look, right?
And you just sort of like, we can't feel the growth.
And then you tell stories like you're saying,
or even about Dolly being, you know, four years ago only,
and like talk about the, the repaint or the flux now
versus Dolly, you know, just three or four years ago.
It's not that long and there are lots better.
Yeah, I mean, yeah, actually that's a good point.
I'll end with, with my, where I think this is going.
I think that despite loving,
reinforcing learning and everything,
I don't think that AI should be making a lot of decisions
in isolation.
I think that it should be working together with people.
And so, you know, the recraft is a great example
where it's not just an API you
call and get an image, but it's like an experience and like you iterate and you say, Hey, I
want this to be all different or Hey, I want this. I want an axe in this person's hand or a
phone or whatever. Right. And so I think it's going to be really about collaboration and
reinforcement learning is always going to be really important, but it's going be really about collaboration. And reinforcement learning is always gonna be really important, but it's gonna be important
in the way that the actions are more like
suggesting things to people.
So in other words, reinforcement learning
to like book a flight for you, probably not a good idea.
Cause if one out of a hundred times you go to Tokyo
by accident, right, you're gonna be pretty upset.
I might be happy, that sounds great.
Yeah, actually Tokyo is amazing.
Yeah, actually I don't wanna say anywhere
we don't wanna go,
cause we have a list that are almost like, anyway.
So Antarctica.
Okay, sounds horrible.
But you still need reinforcement learning
to like suggest, like, you know,
come up with like three different hypotheticals send it to the person should you text them or
email them etc like there's still a lot of decisions to be made but I don't
believe tons of people are gonna lose their jobs entirely I think that work is
gonna change just like it did with with the invention of the motor and stuff. So. Well, you heard it here first.
The JSON future AI is not too scary coin.
Yeah.
Yeah.
Don't be worried about it.
Just be adaptable.
If you're adaptable, I think you'll be just fine.
And, and, oh, and coding will probably be one of the last jobs to be eliminated by the way.
So if you're worried about So if you're worried about,
if you're worried about, yeah, please stay in coding.
Not just because we want you to keep listening
to the podcast, but tell all your friends
to get into coding, stay in coding.
If they're worried about their job being eliminated,
they should be a coder.
That's like one of the last jobs that's gonna go.
I mean, trust me on this.
Like we are going to lose so many doctors. We should probably lose all the CEOs before we lose the
coders. Oh, no, no, no. All right. All right. We got to wrap. We got to wrap, guys. We got to wrap.
They're phoning me from the other room and telling us we're out of time.
That's right. Hey, I didn't say anything about HR.
What's that? Oh, yeah, we're wrapping up.
All right.
This was so fun.
Thanks everyone for tuning in and thanks Patrick for bearing with me and my rants on
reinforcement learning.
This is great.
Very illuminating.
This is awesome.
I learned a lot today.
So.
Cool.
All right, everyone.
We'll catch you later. Eric Barndaler.