Programming Throwdown - 180: Reinforcement Learning

Episode Date: March 17, 2025

Intro topic: GrillsNews/Links:You can’t call yourself a senior until you’ve worked on a legacy projecthttps://www.infobip.com/developers/blog/seniors-working-on-a-legacy-projectRecraft mi...ght be the most powerful AI image platform I’ve ever used — here’s whyhttps://www.tomsguide.com/ai/ai-image-video/recraft-might-be-the-most-powerful-ai-image-platform-ive-ever-used-heres-whyNASA has a list of 10 rules for software developmenthttps://www.cs.otago.ac.nz/cosc345/resources/nasa-10-rules.htmAMD Radeon RX 9070 XT performance estimates leaked: 42% to 66% faster than Radeon RX 7900 GREhttps://www.tomshardware.com/tech-industry/amd-estimates-of-radeon-rx-9070-xt-performance-leaked-42-percent-66-percent-faster-than-radeon-rx-7900-gre Book of the ShowPatrick: The Player of Games (Ian M Banks)https://a.co/d/1ZpUhGl (non-affiliate)Jason: Basic Roleplaying Universal Game Enginehttps://amzn.to/3ES4p5iPatreon Plug https://www.patreon.com/programmingthrowdown?ty=hTool of the ShowPatrick: Pokemon Sword and ShieldJason: Features and Labels ( https://fal.ai )Topic: Reinforcement LearningThree types of AISupervised LearningUnsupervised LearningReinforcement LearningOnline vs Offline RLOptimization algorithmsValue optimizationSARSAQ-LearningPolicy optimizationPolicy GradientsActor-CriticProximal Policy OptimizationValue vs Policy OptimizationValue optimization is more intuitive (Value loss)Policy optimization is less intuitive at first (policy gradients)Converting values to policies in deep learning is difficultImitation LearningSupervised policy learningOften used to bootstrap reinforcement learningPolicy EvaluationPropensity scoring versus model-basedChallenges to training RL modelTwo optimization loopsCollecting feedback vs updating the modelDifficult optimization targetPolicy evaluationRLHF &  GRPO ★ Support this podcast on Patreon ★

Transcript
Discussion (0)
Starting point is 00:00:00 programming throwdown episode 180 reinforcement learning. Take it away, Patrick. Welcome to another episode. This is going to be a good one. Excited to be here actually, because this is a topic I have been meaning to learn about and Jason has agreed to be put on his professor hat, robe. I don't know what it is a professor wear. I got hooded when I got the PhD. I got hooded, which I thought would be an actual hood, but it's really just a sash. Wait, what is getting hooded?
Starting point is 00:00:47 That's like what you get when you get, I don't know about this. Okay. So when you get a PhD, you get hooded, which means you go through the same ceremony as the master's students, or I think the same ceremony is everybody, but you get a hood, which is actually a sash and your PhD advisor actually puts the sash around you over you as part of the ceremony. Okay. I feel like maybe I've heard that term, but I always just kind of had some weird
Starting point is 00:01:16 probably bad association with hood winked. But, uh, anyways, okay. Where are we off topic? Anyways, it's fine because I actually, I actually, the first thing off topic. Anyway, so just- That's fine because I actually, the first thing I think of is actually, because I grew up in the inner cities. I was like, okay, we're going back to my childhood here. Oh, okay, interesting.
Starting point is 00:01:36 Okay, wow. All right. So today we've learned there's many associations of the word hood. So okay, we didn't even talk about cars yet. So, you get a new upgraded carbon fiber hood for your car. You get hooded. Okay, we have to have people like, what is going on? What are we listening to? Oh God.
Starting point is 00:01:56 Well, that's how you know we're not the AI. They would not be this off topic. That's true. They would definitely stick to the script. The AI is not allowed to say this stuff. You would definitely be pushing the script. They are not allowed to say this stuff. You would definitely be pushing the down vote button, you know, oh wait, people probably are now. That's why it's not live.
Starting point is 00:02:12 Okay. So for, and I'll keep it brief because actually I want to get to the meat of the story today. Oh, no pun intended, but talking about cooking outside, I had a grill on my like back patio which I would use to cook cook food occasionally. It was uses these little pellets of wood so it's called like a pellet grill so like pellets feed down and it burns it and makes the heat and has an electronic controller and I would do some like you know smoking on it and some some grilling anyways it, it broke and it's old so, you know, okay, fine. So I went to go like, okay, I can get a new grill. I did not know. There are so many different
Starting point is 00:02:52 kinds of grills that are, you know, like popular now. And I feel like growing up, my parents always just had a, yours are kind of one of two things. You were the charcoal Weber grill, you know, with the like the bowl and the charcoals or you had the propane grill grill, you know, with the like the bowl and the charcoals, or you had the propane grill that, you know, like you had the tank and you hooked up the hose and that was the two. But now there are like, you know, all sorts of things where it's, you know, infrared cookers where the propane goes into like a, some sort of like catalyst something and like turns out like the patio heaters that, you know, are at restaurants sometimes.
Starting point is 00:03:24 Yeah. It's like, I don't know. And then there's these like egg shaped, I guess they call them Komodo girls, like big green egg and Komodo. And they're like big ceramic things. And then you can get like various kinds of cabinet smoke. Like anyways, I, I just, maybe I'm naive in my, I just bought something straightforward and simple. And then I like, oh, I'm overwhelmed by the tyranny of choice. That's just all I was going to say.
Starting point is 00:03:50 If you've never kind of looked up grill technology, it's actually kind of crazy. There's got a lot of choices here. So now I don't know what to pick. That's wild. I have a propane grill and then I have a smoker. I have a separate electric smoker that takes the pellets and smokes meat. Um, but yeah, I think the green egg can do both. So it's like a two in one.
Starting point is 00:04:13 Um, and yeah, it's just wild. And then there's like pizza ovens now people are doing. So like when I went to go look at the grills at the hardware store, it was like, there are also pizza ovens here and yeah, I, okay. Yeah. My neighbor has a pizza oven and I think he's used it twice in four years. Well, I mean, you know, how many times do you eat pizza? I mean, and also it's like, if you eat pizza, you're often in a hurry, so you're either ordering it to go
Starting point is 00:04:37 or you're doing the regular oven because you're in a hurry. Okay, all right, so you're down on the pizza oven. That's a short on the pizza oven stocks for Jason. Yeah, I mean, I'm not a big fan of the pizza oven. I think, I've only ever endorsed one stock on my entire programming through down career and that was Data Dog. And I think it's like up just as much as everything else.
Starting point is 00:05:03 I made one like in hindsight, in five years of hindsight, like relatively neutral endorsement. Everyone's bracing themselves for the meme coin announcement now. Oh, we need a programming throw down coin. No, we don't. I'm not. No, I'm not. Right. Pulling people. Okay. I know. Okay. We got to keep going. We got to keep. We are not. We are not. Rug pulling people.
Starting point is 00:05:22 We went the opposite way. We stopped doing ads. So it's like the opposite of rug pulling people. All right, time for news of the show. So I've got the first one, and this is an article entitled, you can't call yourself a senior until you've worked on a legacy project. So talking about what is a senior engineer,
Starting point is 00:05:41 this is like an age old debate, whatever. Anyways, this person was kind of pointing out how they hadn't really worked on legacy code base. There's some specifics here of their thing and if you want to go read it, go to article. And the point though is pretty interesting that they kind of rightly wanted to avoid working on a legacy code base. They ended up kind of doing it, they were right, they didn't like it, but they actually learned a bunch of stuff that they didn't. And I think a couple interesting takeaways for me from the
Starting point is 00:06:13 story and just you know thinking on the topic is about regardless of the label of senior, like just growing as an engineer is no matter what work you're doing, finding the takeaways that are applicable. And lots of analogies, the one I've taken to using recently, just for myself and for others that I talked to about this, is like just really trying to compound the growth. So not just thinking like, hey, how do I do this thing? But like how do I think about additively like applying things I've learned before
Starting point is 00:06:43 in a way that like my growth sort of grows on top of itself and you're sort of stacking it up. And sometimes you need to widen the base, right, of expanding into new things, but other times you're trying to build up and trying to apply these different experiences. And so I think partly this plays into that. And then there's an observation here specifically
Starting point is 00:07:04 about legacy code bases in your place of work and understanding why maybe something isn't done that way anymore or why the stuff that you see in like the the kind of current pieces of tech are how they got there right people will say oh it's organic growth or you know whatever you of get there. But I think there is something different between saying, this is the current recommended practice and I have done it the other way and I will tell you the other way sucks, like we're doing it this way. Those two things come from slightly different places and understanding why you do something, not just, there is value in actually just not knowing, well, this is
Starting point is 00:07:44 kind of bad, but when you get into style guidelines and stuff, right? I think just picking away and having everyone do it is, is useful because it, it really does matter. Um, but then there are things also that even if you don't always know exactly why, uh, eventually kind of figuring them out and digging in. So the one I always use, uh, in, uh, in C plus plus is the turn area operator. So you can write this Boolean expression, put the question mark, and then the thing that is true first, and then a colon, and then thing that is false if it's false second. And you can use this, and we have it banned
Starting point is 00:08:16 in our code base. And the reason why is it literally does nothing unique. You can't, okay, there's like some very rare, you know, use case. Someone could come up with it, you know, in a const expression or something. But for most part, you're just simplifying writing an if-else statement. But the cognitive load to read the ternary operator, make sure you understand what it does. And that a new engineer showing up has the same practiced expertise at reading that. Like, why? Like, Like why just because features are there doesn't mean you have to use them.
Starting point is 00:08:47 And I try to explain this to people, but I would argue even not knowing my explanation and still doing it leads to good practices, but knowing why and having tried using all of the whiz bang features from the latest, you know, C++ update constantly and refactoring code just to rewrite it into those features. Having done that once and burning your hands probably
Starting point is 00:09:08 teaches some lessons and so code bases can be really useful. Totally, totally agree. I mean the equivalent in Python, which is even more confusing, they have a ternary operator where you can say like x equals three if foo is true else five. So it's a ternary operator, but that you switch the first and the second position. So it's like even harder to read. And you know, oh yeah. And so I remember like,
Starting point is 00:09:38 there was someone on my team who would do this a lot, like all over the place. And you know, I let it go. I didn't really push back on it because to your point, like it's not until you ban it, it's not banned. And so you can't really say like, don't do this because you have no like moral grounds other than your intuition.
Starting point is 00:09:58 And then it was a disaster. And so like people just kept getting burned by these like really long in line, you know, like conditions, right? And so now like I can ban it and I don't feel insecure about it or feel like hesitant about it. I don't feel like, oh, you know, it's not really against the rules. Now it's like, no, like I've done this. I've seen people like cause all sorts of issues and some issues went to prod. And so we're not doing it. Now that the one of the tough things is, you know, if you're talking to folks who don't have that experience, you have to like ban it in a way that is shows empathy and
Starting point is 00:10:39 doesn't like create any resentment or anything. And there is this balance that I think is useful, but often gets brushed away that bringing new folks and you actually want them to feel empowered to question. So when they see that there's a ban on this and they say, I love the ternary operator because it makes me look cool. You know, they're going to say that part, but you know, I love the ternary operator, you know, why is it banned?
Starting point is 00:11:04 I don't think it should be banned. And you actually want to take time to explain them and in some cases be willing to hear them out and maybe, you know, adapt your practice or be flexible. But in other times, like you said, I think the word confidence there is like, no, we've done this, like I hear you, but you're just going to have to trust me that like we've tried it the other way and the other way, like not banning it leads to problems. Yeah. Yeah, totally.
Starting point is 00:11:28 Um, yeah, I mean, I, just to wrap this up, a question I always ask in interviews, uh, if I'm doing a technical design interview, I'll always start with the question of like, tell me a time you refactored something and why did you refactor it? Like what led to the decision to do a big refactor? And that usually like opens up all sorts of interesting things because you know, people, you know, the, the worst answer is the one that totally neglects the conflict that comes from scarcity, right? It's like, you don't have enough time, but the code is garbage. And it's like, so it's like that creates conflict
Starting point is 00:12:08 and then you have to resolve that conflict one way or the other. That's interesting. If someone's like, yeah, you know, I rewrote it and it was the right thing to do. And everyone agreed with me from day zero to the day I wrote it. And everyone praised me at the end.
Starting point is 00:12:22 It's like, okay, well, you know, that's a little unrealistic. I was gonna ask you, does anyone ever tell you, because I didn't write the code. So therefore it could of course be better. I've got that. That's how people really do, but I would be surprised if someone said it. Yeah. And people are like, oh, it was, you know, another team and that team, you know, we inherited their code and it was garbage. And so I rewrote it all. And it's like, okay, not the, not the best answer. Um, my new story is recraft might be the most powerful AI
Starting point is 00:12:56 image platform I've ever used. Here's why. And it's a Tom's guide article. Um, honestly, like recraft is definitely the most powerful AI image system I've ever used. I found out about it yesterday. I haven't ever heard about it. Yeah, this is very, obscures is the right word,
Starting point is 00:13:13 but like I'm really into this stuff, like generative AI, I'm following it closely, and I hadn't heard about it until yesterday. It can do some amazing things. For one, it can produce vector art, like SVGs. Now the SVGs are... Like if you or I were to create a stop sign, for example, we'd create like a white octagon and then we'd create a red octagon inside the white octagon, and that's how we'd get a stop sign with a border around it. Right. But if you use this program, it's going to give you like a red octagon and then a bunch of white polygons around it.
Starting point is 00:13:49 You see what I'm saying? Like it doesn't have a concept of like each edge would be. Yeah. Yeah. It's all one layer. So that's not ideal, but it's a step in the right direction. It's the only thing I've ever seen that really will give you an SVG. Like even if they're doing it post-hoc or something.
Starting point is 00:14:15 It's extremely responsive to the prompt. So for example, one thing I've tried with a lot of these AI systems is I've said like a person holding nothing in their hand, because I'll say like a person doing XYZ and I'll get a person like holding a phone in their hand. And I'll like, and I'll type the same prompt like with nothing in their hand. And then they'll have two phones in their hand. And then it's like, it's like, it has a hard time especially with negatives. So then I'll try like a person unarmed, you know?
Starting point is 00:14:41 So it's like, there's not like a negative there and it'll still like not work. But with this system, it's like very good at following instructions, even negatives. So it's phenomenal. The other part of it is it's got this cool workflow where you can take an image and then you can say, okay, now put a phone in their hand and it'll make like a new image of the same person with a phone in their hand, as opposed to like, you know, getting a totally different person. So it's really cool. Very, you know, easy to use, relatively cheap. So I would highly recommend folks check it out. It's pretty neat. folks check it out it's pretty neat. What are they using kind of under the hood do you know like are they using their own so some of these craft up other stable diffusion whatever you know is this like a layer on top of stuff it sounds really good but yeah this is a totally custom thing. It's a pretty big model it's a 20 billion actually the 20 billion is the v2 there's a v3 model which i think is even bigger
Starting point is 00:15:46 um they they're not open source so we don't really know what they're doing they might have a blog post about it i haven't seen one yet um i'm assuming it's the same type of technology where you're doing like self self-attention and then you're doing like you know masking and trying to uncover masks and whatever like the the i don't think they're really pushing the envelope on the the base model but then they built a bunch of really impressive things on top of it that's awesome i think that's one of the debates like where's the magic is it in the ui and the like you know higher level abstractions or is it in the the base, the deep stuff, you know, I don't know that we haven't answered.
Starting point is 00:16:27 I think there's lots of opinions, but. I will say, I would say that I tried the Dolly. When was Dolly popular? Oh, that's, that's a four year and a half, two years ago. No, that long? I think so. The first Dolly? No, I don't know.
Starting point is 00:16:41 Like whenever it was making the big rounds, I feel like it was maybe two, three years ago. Oh, okay. Okay. Maybe four years ago. Anyways, I recently't know like whenever was making the big rounds. I feel like it was maybe two three years ago Okay, okay, maybe four years ago. I Anyways, I recently did download on I have a MacBook Air and I downloaded one of the like on device Stable diffusion to run flux and they just have like an app you can download So you don't have to do it the command line with Olamma or something You just like download that and it will download the model from I think a hugging facema or something, you just download that app and it will download the model from, I think a hugging face link or something. And it downloads the flux and you'll generate it
Starting point is 00:17:09 and it takes, I don't know, it's like 15 to 20 seconds or something to generate an image. But it's crazy that first of this image are much better than when Dolly was making the rounds initially. You kind of wrote it off, it didn't really obey your prompts. It would make cool pictures, but anyways, so now the flux stuff and it runs like on my computer and it's free. Like the models are open source,
Starting point is 00:17:30 the program's free. So it's running locally. There's no subscription, you know, and it, you know, obviously it has to be using less power because it doesn't have an external GPU or anything. Yeah, the Flux models are super, super impressive. There's a package called M-Flux, which is flux optimized for MPS, optimized for the Apple processor. So you can use M-flux and it'll run in like half the time or a third of the time or something.
Starting point is 00:18:00 It's really impressive. And this re-craft thing really takes it to another level. So to your point, it'll only be a matter of time before there's an open source version of recraft. But at the moment, they have a monopoly. Well, that's another debate the open source was closed. But we'll just keep moving on. Yeah, right.
Starting point is 00:18:17 Open weight, yeah. Oh, oh, yeah. So, okay, no, no, no, okay. No, okay, respected. My next one is NASA has a list of 10 rules for software development. And this person is taking the sort of like publicly disclosed a list of software rules that I've bumped into before
Starting point is 00:18:37 being an embedded engineer previously in my career that for writing C code, but they've tried to extend it to C++ as well, there's like a set of embedded guidelines for not doing. And this individual is taking a sort of, I'll say a kind of critique of some of them and why maybe they don't make much sense or other stuff. But if you've never seen them before,
Starting point is 00:19:01 I will say it is somewhat interesting to, and it's a lot harder if you use something like Python or Java to, I guess, derive value maybe from it, but if you've ever programmed in C or C++ before or you use Rust, probably applicable as well, or one of the other sort of systems programming languages, go kind of looking through there and seeing like how would you approach if these were your rule sets. Leaving aside, I guess this the blog post is kind of talking about how maybe the rule set could be improved or doesn't make the most sense but I'll say if someone if you showed up on a job and this was the restrictions
Starting point is 00:19:38 because it was a contract that you were trying to honor like how would you accomplish this so things like never using dynamic memory allocation. And then, you know, there's kind of two approaches you end up with. One is, so I can see C++, the standard library generally just uses lots of allocation under the hood. So you end up with, you know, of course you can't use that. So some people do a lot of like static sizing of things
Starting point is 00:20:02 up front and trying to, you know, have all of their things of an unknown size. Other people. Isn't there like a concept in C++ called an arena? Where like you define like a thousand spots? Yeah. Okay. Yeah. So then other people use a memory pool. An arena is like a kind of memory pool where they basically write their own sort of very thin sort of memory management, but it doesn't necessarily suffer from some of the same problems. And so you can use containers that adapt to that.
Starting point is 00:20:29 But if you think in your head, how would you make sure that your code never did some of these things? Never had a loop that couldn't exit, right? So all loops need to have an upper bound. And just code coverage, how would that work? These kinds of things, some of them are, again,
Starting point is 00:20:46 pretty restrictive. Every function has to be smaller than can fit on a single piece of paper with sizing given. And it's like, well, that is probably good practice, but maybe sometimes you want to change it to be one thing or another. But it is worth reading if you've never read an embedded rule set like this before. They're not uncommon.
Starting point is 00:21:09 And you can occasionally, if you work in embedded space, bump into places where this is the style. There has been an explosion in processing power and real time operating systems and just the complexity and abilities of the processors. But I still think there are some places where there are probably many lines of code being written having to follow these guidelines.
Starting point is 00:21:33 Yeah, this is fascinating. I mean, this is a whole new universe for me. But this is really interesting. I mean, this is definitely a person who is kind of like, I think the general critique here is on C. Like this person is like, I draw that you should use another language. Yeah, like use Ada, not C. This is just a valid commentary, but yeah. All right. So my next news story is AMD Radeon RX 9070 XT performance estimates leaked.
Starting point is 00:22:07 Okay, so I want to go do a little rant here. I hate kind of complaining about products. Like I feel like that's maybe like not the best use of the show, but I bought a PC, like a mini PC with a Radeon and I used it for a little while, it was okay. The drivers were really buggy. I had to go into safe mode and some stuff to get it working. But then I got it working.
Starting point is 00:22:35 You know how you can plug USB to DisplayPort or USB to HDMI? You have these cables. I don't actually know how they work under the hood, but there's some magic that allows you to go from a USB-C port to right into your display, right? I think it's like something called like display port pass through or something.
Starting point is 00:22:56 Anyways, I plug one of these cables in, pop, the GPU blows up. Like hardware blow up dead. Yeah. Where did the it come from? It's a cable that I use in my MacBook pro. Like I've used my, I've used this cable for years. Okay. So it's the cables is fine. At least for the Mac book plugged in as many PC and it just popped. And I guess like, I mean, I'm kind of dogpiling here.
Starting point is 00:23:23 I almost feel bad, but like, you know, George Hots has this post on Twitter about this. Like he's trying to build these tiny boxes that use like that run, that run PyTorch and they use CUDA or, or, or Rockham, which is AMD's equivalent, but like the AMD, like, you know, anything above the hardware just sucks.
Starting point is 00:23:44 And it's kind of disappointing because everyone wants there to be an alternative. If nothing else, not only for price, but maybe they could do something interesting. Maybe they could make a card that has like a gigabyte of RAM and is not very fast, but just has a ton of RAM. When there's multiple people in the pond, like there's multiple ideas, right? But I was just super disappointed. I mean, I'm one of like a long line of people now who will just not buy these AMD cards. And I guess maybe just to turn this into a question, like, like how does AMD kind of like recover from this? I kind of feel like of like recover from this, I kind of feel like if I was them, I would hire like some software person, like some person who's really high up in the stack
Starting point is 00:24:32 to like lead like a whole branch of the company to just like go through and just ruggedize everything from a software perspective. So I feel like this is more in your area, because when it popped, I was just shell shocked. So what causes hardware to kind of fail like that and not be tested, and what do you think you would do? Like if you were CEO of AMD, Patrick,
Starting point is 00:25:01 how would you save this? Oh my gosh. You know, as much as it's been in the news with Nvidia GPUs and stuff and the scarcity and the crypto stuff and now the AI stuff, I don't know, I'm not super up on where the profit margins come from, I know AMD of course makes processors as well, and I actually have AMD processor and GPU in the PC I built.
Starting point is 00:25:28 And I feel like I got, you know, the Ryzen processor was like a good value for dollar over the Intel one at the time. Maybe, you know, they used to be two separate companies. Now like they're combined. I don't know how internally their company is structured, where the profit margins are on GPU. Like you said, you may say, oh, there's a large group of people who would really buy a large
Starting point is 00:25:53 amount of memory, not that much processing power. But it's possible that the die costs and the spin up for that, especially when Nvidia could basically pay premium for any foundry costs because they have, you know, far more supply than demand. If you're sitting in second place, I don't know, you may be forced to basically pay more for foundry costs and things in order to be able to like, get your chips made, right? So it may make it difficult.
Starting point is 00:26:25 They may not have as much freedom as they otherwise would want. And I think the software drivers are hard as well because there's so many people you need to appease, right? You need to appease like people like you're saying, like I just wanna plug it in and get a monitor working. But then the video game people are like, I want variable frame rates.
Starting point is 00:26:44 And then you need to appeal to the video game developers who are needing to optimize the outputs on your card and the wrappers that sit on top of it, so OpenGL or DirectX or whatever. There's all this different stuff swirling around, and I actually just feel like this, I don't even, that space just seems so complicated. I get to plug into a variety of motherboards, there's a variety of power situations, a
Starting point is 00:27:09 variety of connectors, a variety of it, like the amount of compatibility you need on a GPU almost on to say like on par, probably more with than almost any other part of the PC is, is just actually bonkers. That needs to be compatible with all manner of software, all manner of OSs, all manner of hardware, internal to the computer, external to the computer. That's a lot. Maybe that stuff's all really robust and ruggedized, but I imagine just a ton of time
Starting point is 00:27:37 to get nailed down exactly right. Yeah, that's a really good call out. I wonder, and also like a lot of these things are very thankless. It's like the guy who makes sure that you could plug the USB-C to display port versus the... Because I had this machine working, display port to display port. So I know that the machine worked, but then as soon as I plugged in this other way, I heard a audio pop and it was done.
Starting point is 00:28:05 So like you have to have a person to like test all these different ways. And then like, unless it breaks, that person is not adding any value. Like they're just reducing risk and you don't know what's risky, what's not risky. So this is one of these things, it's almost like a really high level
Starting point is 00:28:21 of kind of performance management, like value of the company kind of thing that you have to fix. And then on your specific issue, I mean, it could be a faulty card that, you know, hopefully they would want to replace. And then the question is, if you did it again with the same cable with the new card, would it happen again? I don't know.
Starting point is 00:28:39 I mean, if I would, I wouldn't risk it, but like, there's two failure modes there. There's a failure mode of an individual card, and then there's a design flaw that like every card where you did that, it would sort of break. And I don't, from where I am, I just don't know enough about which specific problem it is. But certainly, like you said, it makes people have very bad sentiment.
Starting point is 00:29:01 But even if you look at the, and again, not knocking, but like the Nvidia consumer cards that were having like the plug you plug into the GPU to give it extra power directly from your power supply had so much power going through it that they were like melting, they need to like have it keep pulling on it. Like, again, like you said, this is sort of thankless,
Starting point is 00:29:21 right, the dude or dudette who is trying to like design the interface for some wire, copper cable to come in and deliver power. And all of a sudden they never, anyone ever thought about them in their whole entire lives. And now they're like, you know, front page of social media because these expensive graphics cards are melting. You know, oh man, this is going to be a distraction with the distraction.
Starting point is 00:29:42 But like one thing that's really interesting is like what jobs are of that sort? Where like if you do a good job, nobody cares. Or like if you do a good job, nobody notices. And what are the jobs? Because it feels to me like in every career, there exists like maybe that's, maybe I'm trying to stretch too far here. In many careers, there exists, like maybe that's, maybe I'm trying to stretch too far here.
Starting point is 00:30:05 In many careers, there exists like jobs where you get praised for doing a good job and nothing happens if you do nothing. So like jobs where you're trying to like bring in more business or raise the profit of a company or something. And then there's, there's jobs where like you're trying to keep the lights on. And so people are kind of, it's really hard for people not to ignore you until something goes wrong. It feels like there's these two kinds of jobs. And it feels to me like the former is almost always like, better than the latter in terms of like your satisfaction and a lot of other things. I was having this conversation and I won't give the details because it'll make it sound like a political statement one way or another and not trying to make it.
Starting point is 00:30:54 But anything where you're, and in this case, it was government officials, but anything you're dealing with a probabilistic event. So something may happen, may not happen. Even if it happens, the certainty isn't well known, right? You think like weather forecasting, you know, whatever, like. Yeah. And when you get it right, it's sort of like, no one, you
Starting point is 00:31:17 could just do nothing, right? And it would probably just be fine. And then that one time, you know, it's, it's out of standard deviations, like very high. And then everybody's like, why did you do? And it's like, well, it's true. Probably could have done better or done these other things, but you could have been doing those other things every single time and it either still wouldn't have made a difference or wouldn't have been relevant. And so to your point, anytime you're dealing with something where, you know, like think preparing for a holiday rush on a server that's doing e-commerce, right?
Starting point is 00:31:52 You could spend tons of money like building up, you know, extra CDNs and having flexible compute and then, you know, like people come, they shop, there's no outages, no one knows if you did a great job or a bad job, but it didn't crash. So like there's that, but if it crashes, certainly you're getting hauled in and told how much millions of dollars you lost the website, right?
Starting point is 00:32:15 Yeah, I mean, it's pretty tragic that, you know, things are just set up that way, but I don't know if there's not any clear solution or anything. All right. Well, book of the show. Book of the show. What's your book?
Starting point is 00:32:29 All right. My book, this is going to be a little bit different, but I've not been reading as much as I should or I want to, but I did just start because I have never read any books by this author before and I often see it recommended in science fiction. So I decided I'm going to read a book and tried to find the recommendation and this is the recommendation I got and that is a book by Ian M. Banks and I chose The Player of Games. Have you read any Ian M. Banks books? I have not. Okay, yeah, neither have I. So I am trying to embark on reading this one. Apparently some of the
Starting point is 00:33:04 books can be a little hard to read which is okay. But this one people are saying is a good introduction. There's not, from what I've seen online without trying to read any spoilers, apparently there's not like a strong order you need to read the book since. It's not necessarily the first book. But it is an often recommended one. So I'm starting here. That's not really a good recommendation because I don't know whether to tell you it's good or bad other than every other person I saw on the internet. This seemed to be bubbling to the top
Starting point is 00:33:32 as a good starting place. So I'm embarking on a journey here. Cool, that sounds awesome. Yeah, I might check that out. I have a few books queued up that I have to get to and then I'll check that out. My book of the show is Basic Role-Playing Universal Game Engine which is a reference book. It's
Starting point is 00:33:53 written by a couple of people who have been making tabletop games for their whole adult lives and they've made a bunch of them. I think some of them even from like the 70s and the 80s so it's bunch of them. I think some of them, even from like the 70s and the 80s, so it's kind of wild. I mean, some of the stories of the of the creators, but, but basically they synthesized. So there's this question in general, like think about any art form. You always think about like, what is the essence of this art?
Starting point is 00:34:21 You know, like you might as an artist, like draw a bunch of things and then think to yourself, OK, what is like the essence? Like if I had to reduce something to its like most basic form, what would that be? And so these guys got together and thought, well, if we had to reduce these tabletop games, because there's a lot of like lore built into it, you know, like so many games have magic missile. Why? Because Dungeons and Dragons 1.0 had magic missile, but like what really is a magic missile? I guess it's like an arrow made out of magic, right? Or something, but like, you know, pair these things down to their essence and, and just explain like it literally like the first, the first page of the book is like the point of an RPG is for your players to have fun.
Starting point is 00:35:06 So it's like, you know, it's like, let's start from like the first principle. The first principle is that people should be having fun. And it kind of builds up. And so it's a combination of an instruction guide to making a game engine and a reference manual of like, here's a list of like hundreds of skills from like all these skill based games we've ever seen. I've kind of to be honest been skipping over a lot of the, you know, here's a list like of like a million different types of armor because it's not what I'm interested in. And what I'm interested in is like,
Starting point is 00:35:45 how do people create these game engines and how do they keep them balanced and how do they keep them interesting? And so when I say game engine, just to be clear, it is programming throwdown, but this has nothing to do with programming. It's literally like the math that like, you could use this for like a card game
Starting point is 00:36:04 or a tabletop game or really anything. It's like the math that keeps people kind of on the edge of their seat, right? So it's like how do you have all these different options for your players but still keep them kind of on the edge of their seat? And at the same time, how do you do that in a minimalist fashion where it's not like, okay, I'm now gonna have to roll like 45 dice to like build my character or whatever.
Starting point is 00:36:31 So these people tackle a lot of that. And I'm about, I wanna say maybe about halfway through, as I said, skipping a lot of the pure reference stuff. And it's really interesting. I'm having a really good time reading it. I've never played an in-person tabletop RPG or even, I mean, I probably played a video game that somewhere
Starting point is 00:36:52 under the hood was running some sort of like role checks and chance checks or something. I learned the other day, a spoiler, something I'm gonna talk about in a few minutes, that Pokemon was actually doing that when the Pokeball rattles, it's doing like a, you know, probability check and it can fail at each of the things. And then that's when the Pokemon came out, which I didn't know.
Starting point is 00:37:11 And maybe I'm completely wrong, but that's sort of like what the internet was telling me, which is in line to your point with rolling a dice and getting certain values. But never had the occasion to play one, but I'm endlessly fascinated by, like you said, the sort of crafting of the stories and the storytelling and the fact that it's a less game than, you know, a board game with rigid rules and more about, like you said, having an adventure together, making it fun and entertaining and collaborate, like collaboratively doing something which is somewhat still gaming but is also you are sort of being flexible on the fly as well to you know keep it fun. Yeah exactly yeah exactly
Starting point is 00:37:53 like how do you let each person at the table have a unique character that brings something unique while still like being able to handle a person not being there. It's like, oh, you know, Jim's wife is having a baby, so this chest has to stay locked. It was the one at the keys. He's sleeping in the inn. Yeah, or he's the only one with lock picks or something. Yeah, so the book is really interesting.
Starting point is 00:38:23 I'd recommend folks check it out. If nothing else, it's a nice book to have on your coffee table, because it has kind of a provocative title, Universal Game Engine. All right, well, I spoiled it, but tool of the show for me is a video game, and I, for whatever reason,
Starting point is 00:38:40 skipped every modern Pokemon video game. And so I think the last one I actually legit played was when I got Pokemon Red in my Game Boy as a child and played that to no end using, and I was trying to describe this to my kids, and I had to go to, when we would go shopping, I liked the Walmart or Kmart, and I would look in the strategy guide for why I was stuck.
Starting point is 00:39:08 And so I would go with my mom so I could go to the, you know, video game section and like open the strategy guide and like, look, because I wouldn't just buy it. I probably should have just bought it. But anyways, I didn't go home and, you know, get through anyway. So so Pokemon, right? Anyways, I've been aware I've, you know, dabbled various times, but I hadn't really sat down and played,
Starting point is 00:39:26 but I was sitting down and playing Sword and Shield. I was playing the Shield variant, but not super important on the Switch. And I just hadn't done that in a really long time. And I know it's a pretty big departure for the series, but it was really kind of fun. Like I was really into it. I realized now that the game is easy, like it's not supposed to be challenging
Starting point is 00:39:48 to actually, you know, quote unquote beat the game. So it's not that much of an accomplishment, but not a great time. And if you've ever been interested and you, you know, have a switch or whatever, we definitely recommend checking out one of the newer ones, sword and shield. I guess the other one I'm going to try now is Scarlet. Scarlet and I think violet it is. Um, but. Pearl or something. Uh, yeah.
Starting point is 00:40:09 Okay. I did. Or that's a different, I think Pearl. I think that's a remake. I think that's a remake. Oh, okay. They did like a remake. Yeah.
Starting point is 00:40:16 Like, uh, so anyways, if you hadn't checked one of those out, they definitely went and some of them, like a little bit more with open world sections and you can kind of control how often you get into a battle versus you know just wandering around in a set of grass until it happens. So definitely some quality of life improvements over the old ones that make it less frustrating and you know ability to save sort of everywhere you want. If you never checked one out I guess this is me telling you the obvious thing of like, it's a thing and it's, it's kind of fun. Yeah. I played it with my kids and, uh, it's been maybe a year or two and, and, you
Starting point is 00:40:50 know, they would, uh, get frustrated. And so I'd help them like, kind of optimize their characters a little bit. Um, but more or less they could get through it eventually. And yet the other thing is it's all the bosses and everything. As far as I know, they have static levels. So, you know if if you're kind of like like my kids are just running around kind of aimlessly for a while so their pokemon were like super over leveled and that made the game even easier than if you're
Starting point is 00:41:16 trying to like speed run it but yeah that game is awesome I think uh I think the open world added a lot actually, like being able to really like see the enemies and they run into you physically and then the fight starts. Like that really added a lot, I think. Yeah, and I think that it goes crazy deep though. Once you like look on the internet, there's all this like each Pokemon you catch has different stats and there's a Rotoma.
Starting point is 00:41:43 I never paid attention to it other than like it has a type and certain moves or whatever. Very basic level strategy and that was fine to get through the game. But when you look online you find out, oh yeah the competitive stuff and people playing online and whatever that each Pokemon you catch has like different base stats that have been rolled for that character. I mean it's not actual dice but probabilistically generated and so some are better than, even if they're the same level. All right. So Patrick, do you want me to waste hours and hours of your life? Is it going to be fun?
Starting point is 00:42:14 It's going to be fun. Okay. So, uh, later on go on YouTube. Okay. There's this guy who really understands the Pokemon mechanics and purposely. So there's a, there's a Pokemon, I think it's a web based game. It's probably not legal. It's probably already shut down or something, but there are maybe it's
Starting point is 00:42:33 sanctioned out at a, but there's this web based game where you can just do Pokemon battles with other people. And there's a, the ELO like for chess and everything, you rise up the ranks. And so it's, it's literally just a battling part of Pokemon. And so, um, this guy who really understands the mechanics, he makes builds that are very unintuitively strong and he plays people who, um, uh, and he must play a lot of people, but inevitably he ends up playing someone who, who starts off like making fun of him and like, Oh, like you just have one Pokemon.
Starting point is 00:43:08 Like why didn't you build the other four Pokemon? Haha. You're so trash or whatever. And then he wrecks them and they start raging and they start like, and then they won't make their final moves and he's like, Hey, your time's running out. And they just get so pissed and everything. It's like the people who get the scammers upset or whatever. It's that, but for gaming trolls and it is hilarious.
Starting point is 00:43:32 Oh dear, okay. Now you have to send this to me, but I feel like I'm gonna not like you for doing it. Yeah, I mean, I don't know how many videos he has. I'm pretty sure I've watched like five or six of them. They're really, really funny. I'm pretty sure I've watched like five or six of them. They're really, really funny. Okay.
Starting point is 00:43:46 All right. So, oh, my tool of the show is features and labels or fowl.ai. There's a bunch of alternatives. There's together.ai, there's fireworks.ai, there's a bunch of them. But basically these are people who are kind of a middleman between you and the AI models So they'll host the open source ones
Starting point is 00:44:09 They'll often have agreements with the closed source ones so you can run like the Google image and you know, it's you Otherwise you'd have to use some proprietary Google API or whatever. So you think of these like a middle layer and they often charge you you know, per thing that you do versus like you having to rent a machine for an hour, right? The thing that, that, so the reason I picked FAL is I actually know the founders. So I, I'll just put it right out there and say, I don't know if FAL is any better than any of the other ones, but the user interface is really nice. They have like a playground mode where you can just build things on the web. And then you can click on the API button and get the Python code if you wanted to
Starting point is 00:44:53 make that programmatic. The other thing they did, which I thought was really clever UX, you know, as engineers, especially as people who have GPUs or maybe an M2 MacBook or something, we think to ourselves like, yeah, I mean, I should just run Flux myself. Like Patrick's run Flux, I've run Flux, right? But when you go to Fowl, they're like, yeah, so you can run this model like 87 times for a dollar. Like basically for every model, it tells you how many times you can run it for a dollar and that to me is like really powerful because like often like I have some code
Starting point is 00:45:30 that I have right now and I have a local version of Flux and then I have the foul version of Flux and you can just like toggle between one and the other and and you know like I'll want to run sometimes I'll think I'll run the local version because I'm going to work like I'll want to run, sometimes I'll think, oh, I'll run the local version because I'm going to work and I'll just let it run and it'll be done when I get back. But then I'm like, yeah, it'll be done when I get back or I could spend like $7 and this is just like done in like a second.
Starting point is 00:45:58 So it's like, they did a good job of kind of like really laying out the economics, which are themselves startling, you know, how the economics have changed for AI. But they just put it right out there. And they recently had something, they had something where they they're able to do some caching of the, I think, caching of the tiles. The way the image transformers work is, you know,
Starting point is 00:46:25 breaks your image up into tiles. And I think they're caching tiles that are very similar or something. I don't know. It's something I don't remember off top of my head, but it made the price even cheaper. So I guess long story short, check out these folks. They're all awesome. I know the fireworks people too. All these services are great I know the fireworks people too, that all these services are great and there's an economy of scale that you can really take advantage of. Yeah, I mean, I think all of it from like, someone was asking me with 3D printing, how much would it cost you to print this? You know, I saw it in a shop or something and I was like, well, most obvious thing is how much plastic it takes. But then you start thinking about it. There's depreciation of your machine, like wear and tear on your machine.
Starting point is 00:47:08 There's like the power to run it. There's my time to like walk out. Mine's in the garage. Like walk out to the garage and like get it off or clean the, you know, bill plate. And so I think what you're saying is interesting too, that running it locally to me is, I guess I just cheat by, I don't want to give places my credit card. I don't know.
Starting point is 00:47:28 And I have stuff and so I feel like I should use the stuff I have. But you're right. Like by the time you factor in the power to run it, like it's not free and you know, your computer getting hot and the time taken. And so the economies of scale, these really big, you know, server clusters dedicated to this AI stuff, it's really kind of amazing even with how expensive those really high end GPUs are. Yeah, yeah, totally.
Starting point is 00:47:54 I think running it yourself is great. Everyone should learn how to do it. Definitely not discounting that, but check out these folks and there's similar folks too. It's a really neat service. If you have something that you then say, oh, I need to run this like 200 more times. Um, you know, your time is also really important. So, um, all right, on to our topic reinforcement learning.
Starting point is 00:48:18 Um, so a bit of background here is like the opposite of like assembly language show where in this case like this is my background, my area that I know a lot about. I'll kind of dive into it and then Patrick is going to play the role of you folks and stop me anytime I say something that is a buzzword in my community or doesn't make sense or something. So I'll start really broad. There's basically three types of AI. There's supervised learning, there's unsupervised learning, and there's reinforcement learning. So supervised learning is where you have the right answer is right there. So, so someone gives you a picture and they draw a box around the stop sign. They're like, there's the stop sign.
Starting point is 00:49:12 And your job is to learn a function that maps the picture to the bounding box of the stop sign. And you're given a lot of these as ground truth, right? the stop sign and you're given a lot of these as ground truth, right? And then you're also given a second set that you're purposely not meant to train on called the holdout set. And if your training did really well, then you're able to interpolate between all the other stop signs that could exist in the universe. And so after training, if I was to give you a new image you've never seen before with a stop sign in it, you could draw the box around the stop
Starting point is 00:49:49 sign. That's supervised learning. And under the hood that works through what we call a loss function. So a loss function takes the output of your model. So in this case, maybe it's a bunch of hypothetical bounding boxes. It takes the ground truth, which is the actual bounding box, and it turns all of that into a number where the further the number is away from zero, the worse you got, the worse you did. And so a zero loss would be perfectly nailed that bounding box. Now a zero loss might not be good, right? Because you want the model to have some uncertainty. Like for example, imagine we're playing paper
Starting point is 00:50:34 rock scissors, right? And you play paper and I play scissors with like 100% certainty. That's actually not good, right? Because although I won in this game, you know, we know that someone who plays scissors a hundred percent of the time is not playing an optimal paper rock scissors, right? You could just, you could learn that and then just play, uh, wait, did I get it wrong? Anyways, you know, the analogy, you could just play whatever counters what I just said, and then just win, right?
Starting point is 00:51:06 So, so often you're going to output a mixed answer, right? A distribution of answers. And so you're always going to have some amount of loss. Um, but through, through what's called the learning rate, you don't like totally change your line of thinking every time an example is presented, right? You're just slowly moving in different directions as examples are presented and if the learning rate is low enough and all the hundred other things kind of stars align then you will create like a mixed, you
Starting point is 00:51:39 know, a mixed response that is optimal. So that's supervised learning. Did I get that right? Any questions about that part of it? Or did that make sense? So you said, the only question I had is you were saying, so this makes sense that you have the thing that you're trying to match and then you're tested on something else, but you said interpolate between the results, but it should be possible even with supervised learning, not strictly like interpolation to me means like between the points given,
Starting point is 00:52:09 but you should even like in your stop sign example, like you were mentioning for stop signs that are new, the idea is hopefully you would also understand that those should be heavy bounding boxes put around them. Yeah, right. So what you're hoping is that you could imagine like a manifold, like a stop sign space. And in that space, there's like a whole bunch of different kinds of stop signs.
Starting point is 00:52:33 And so inside of that space, there's stop signs that like look really different. But hopefully they're like, they're within the space of stop signs that you've already seen. So like an example where extrapolation doesn't happen is so we've seen this with Waymo where people will wear a t-shirt with a stop sign on it and that's not it so that's an example of extrapolation. And in that case, the model doesn't really know what to do. So it, it, it thinks it's a stop sign. So, so really, so there's a whole area around what's called out of bounds, um, detection and out of bounds prediction. And long story short,
Starting point is 00:53:20 that's a very, very hard topic, but really important. But by default, supervised learning will interpolate at a really high dimensional space, right? Interpolate between all these things that's seen, but if you give it something totally new, it's gonna have trouble. Got it. Okay, so unsupervised learning is where you don't have a loss.
Starting point is 00:53:47 Like there's not a ground truth, but what you do have is something that's kind of stateless and easy to evaluate. So like the most common example is clustering. So there often isn't like known a perfect clustering. Like you might have millions of documents and you wanna break them into a thousand clusters, each one having 3000 documents. And you want the entropy of each of those clusters to be really small.
Starting point is 00:54:16 So you want all of them to be like really close together. So you might never know the perfect clustering, like you do at supervised learning, but it's like trivial to like evaluate. So I can like show you a set of clusters, you could put the documents in the clusters and come back with a score. This clustering has a score of seven,
Starting point is 00:54:41 then I can make some changes. They say, oh, this clustering has a score of eight. It's a little better. That makes some changes. Clustering has a score of nine, et cetera, et cetera. And so that's an example of unsupervised learning. And so you're not even really trying to figure out the best way to cluster.
Starting point is 00:54:59 Like a human is doing that, but the computer is just kind of following the instructions and then over time getting a better and better clustering. And you can measure that. And so you might never get to the optimal, but you can get closer. So unsupervised learning is a little bit trickier in a sense that you don't have a ground truth. Now reinforcement learning is in my opinion, the hardest of these areas,
Starting point is 00:55:28 not in terms of you have to be the smartest to do it or anything, but the hardest in terms of getting good results is the most difficult. And this is because you have all the challenges of unsupervised learning, where you don't have a perfect game of go or a perfect game of chess to reference, but you also are making decisions. In the unsupervised learning case, you're not really making any decisions.
Starting point is 00:55:58 There's a human making decisions or a human written algorithm making decisions and you're just evaluating them. But here you have to make the decision. So it's like, here's a set of clusters. How do I make them better? And then you do that and then did they actually get better? So if you were actually on the fly designing your own clustering algorithm with AI, then that's reinforcement learning.
Starting point is 00:56:25 But the stuff that we talk about when we say at a high level like, oh, flux or this or whatever, it may be using components that were trained with a variety of these techniques or use a variety of these techniques, right? So it's not necessarily that a whole, I don't know what the distinction there is, like a whole program application is one of these. You're sort of talking like a little lower level. You're saying like one part of that pipeline was done this way. Kind of. So, uh, so in the case of flux, that's all supervised learning.
Starting point is 00:56:59 So in the case of flux, it's what's called self supervised learning where In the case of flux, it's what's called self-supervised learning where you hide part of an image and you ask the AI to draw it. And then because you hit it, you know what it used to be. And so you show that to the AI and say, hey, you know, this pixel actually should be red, but you drew purple or something. And so, yeah, so that's pure supervised learning. There are like, you know, recently some, so another way of saying it is reinforcement learning like does stuff, like takes actions
Starting point is 00:57:36 and supervised learning and unsupervised learning kind of reveal knowledge. So in the case of the stop sign, you know, drawing the bounding box around the stop sign kind of reveals or synthesizes knowledge. Like now you went from pixels to there's a stop sign there, but it doesn't tell you how to it doesn't drive a car or turn a camera or take any action. So as soon as you want to take an action now, either the humans have to write that code. But as soon as you want AI to take an action, now either the humans have to write that code, but as soon as you want AI to take an action, now you're doing reinforcement.
Starting point is 00:58:13 So there's a bunch of different kinds of reinforcement learning algorithms, but there's basically two axes that you need to think about. One is offline versus online. And this is just a fancy way of saying, can I make mistakes? So for example, the AI that plays go, um, alpha go in the beginning of training, um, let's just stick with alpha goes zero. It's all pure reinforcement learning. So in the beginning of training, it's just playing garbage games of Go. And that's fine because it's playing against itself and it's, it's, you can't embarrass the computer. So, so it just plays garbage games
Starting point is 00:58:59 of Go and it gets better and better. But like you couldn't, for example, like drive a self-driving car randomly until it got better. Like, you know, you can't do that. You'd crash the car, people would die, be like total mess, right? So offline reinforcement learning is where whenever you make decisions in the real world, they have to come with some kind of guarantee. In the case of online reinforcement learning, you can just make decisions in the real world whenever you want at any point. And so that's like a subtle difference, but it has like pretty big consequences and algorithms and everything else. and algorithms and everything else. So online it's able to change itself and like update
Starting point is 00:59:53 and then offline it's sort of like you're wanting to make guarantees, you wanna know like, I understand what it's gonna do, I've tested it in some way and I don't want it sort of like changing what it's doing. Right, right. So online you're willing to put any model in production. So yeah I think it's sometimes they call it on policy versus off policy but it gets the nomenclature there doesn't matter as much. Those are the two kinds. Um, okay. And so then there's a second axis or second kind of switch here, which is value and value based or policy based.
Starting point is 01:00:33 So I'll go into this. So, um, let's say you have to make some decisions, right? And when you go to make a decision, like imagine a choose your own adventure book. And whenever you go to make a choice, I was to tell you, like, if you make this choice, you have like this percent chance of reaching the best ending. And if you make this choice, you have this other percent chance. Like you would just choose the highest percent. Right. And you would just do that. It would be like solving a maze with no walls, right? You would just, you just like pick the highest percent every time until it's a
Starting point is 01:01:11 hundred percent and then you would win. Right. And so the idea with value-based reinforcement learning is if I know the total value of a decision and I know that for all my choices, then I've solved the problem. I just picked the one with the highest value and that's just the optimal policy. And so value-based kind of ignores the whole decision part of it somewhat and says the game here really is figuring out the expected value because once I have that I'm set. Now here's where it gets tough, right? Is let's say AlphaGo places a stone somewhere on the go board to start the game.
Starting point is 01:01:58 And it's playing itself or some other world champion or something, right? It places that stone and its value is about.5. It has like a 50-50 chance of winning the game when it just started, right? Well, let's say I, Jason, go and play the world champion of Go. And I put the same stone in the same position, just coincidentally, for my first move.
Starting point is 01:02:22 I have a 0% chance of winning, right? Because I'm not even close to a world champion. I'm going to get wrecked. Right. So so you have this paradox where like the value is based on a policy. But if the policy is based on the value, you can see how this is like a cyclical reason. Reasoning. Right. And so getting the value is actually really, really hard for this reason. And there's
Starting point is 01:02:51 several algorithms. The simplest one is called SARSA, which basically says you make a bunch of moves, the game ends or the episode ends, you stop driving the car or whatever it is, then you just go back and you know what happened. So you assign the expected value. So, you know, I turn the steering wheel here, I turn the steering wheel there, I hit the brakes, I hit the gas, and then I made it home safe. Therefore, all those actions are plus one, right? You know, turn the steering wheel, I hit the gas, I crash into a wall. Therefore, all those actions are plus one, right? You know, turn the steering wheel, I hit the gas, I crash into a wall. Therefore all those actions are minus one. And then, uh, you feed that into your neural network.
Starting point is 01:03:33 You do, uh, your training and, um, um, that's pretty simple. Um, you know, that completely ignores the thing we just talked about. There's other algorithms like Q learning and stuff that try to address some of these challenges. Um, it's really difficult. Um, it's not, doesn't mean value optimization doesn't have its place, but you know, ignoring the, the, the policy makes the, makes it really difficult to optimize it's good in situations where like,
Starting point is 01:04:05 there's clearly one good action at any given time, you just don't know what it is. Like Atari is a great example where, like the actions are binary, there's usually like one good action. Like if you're playing Mario or something and you're about to run into a Goomba, you either jump or die.
Starting point is 01:04:24 And so you jump, right? But as soon as you get into environments where you need a mixed response, um, like poker or driving a car or really doing anything in the real world, it becomes difficult. Um, any questions about value optimization? Or did that, Did that make sense? So I guess it makes sense. So I think what you're in these cases, you're trying to, like you were saying, is there's
Starting point is 01:04:52 like understand the outcome of a game. So you're playing a game, you don't know what's going to happen. So you don't know if it's a good or bad move until sort of like it's too late. So but by observing many, many, many games to their conclusion, you're hoping that when you go to do it, I guess that's, that's offline, but for real, that you've built up an estimate of given us context, what is the likelihood that each decision is good or bad? Yeah, right. Right.
Starting point is 01:05:20 And as your values improve, your policy improves, which means your values now are all inaccurate. And so you're just like kind of iterating on this over and over again. Yeah, but you're totally, you totally nailed it. Okay, so the other type of algorithm is policy optimization. And in this case, you say, at least in the most naive example, you say, I don't even really care how good this action is. Like I don't need to know what's my expected value of taking this action or anything. All I want to know is I want to take actions that are good and I don't want to take actions
Starting point is 01:06:02 that are bad. It's like I cooked some eggs, they're delicious.'t want to take actions that are bad. It's like, I cooked some eggs. They're delicious. I want to do more of that. I touched the stove with my hand, not delicious. Don't want to do that anymore. Right? Very simple.
Starting point is 01:06:14 So, so policy gradient is basically, and I'm going to try to do a lot of hand waving here, but basically it says, get the expected value of this action. So you do have to kind of like play out a whole series of events. But then when you go back and you look at what happened, take the things that were good, that had positive value and do more of them and do them in proportional to how positive the value was, take the things that had negative value, do less of them and do it in proportion. So things that are really negative value really do it a lot less, right? It's a very simple concept. One challenge right off the bat that you can see is
Starting point is 01:07:07 your expected value has to be centered at zero. Like in other words, if all you can do is get points, but you can never have a negative score, then your system is gonna say, do everything infinite amount of time, and it's just not gonna be able to learn. So you need what's called a baseline such that you hope that roughly half the time you're getting a positive score and half the time you're
Starting point is 01:07:32 getting a negative score. Now for something like Go, it's trivial because you play a game and you either win or lose and so unless you're playing like someone way out of your league in either direction, you're hopefully going to win and lose about half the time. You get a negative one for losing, positive one for winning, you're all set. So Go makes this very easy. But in the real world, it doesn't work that way. And so it actually like figuring out how you can get half of the expected values to be positive is really hard. And so you actually use a second neural network just to figure that out.
Starting point is 01:08:09 And that's called the critic. And so the, the, if you hear the term actor critic, the actor is just a policy gradient that I talked about earlier. And the critic, all it's doing is it's trying to figure out the baseline. It's trying to figure out the average move, what would be the expected value of that, so you can subtract that out and hopefully get a balance between positive and negative. And so this is the neural network you would use on a Go board to tell you like how good your situation is?
Starting point is 01:08:48 Yeah, exactly. So if you were using Go for an example, let's say you're trying to solve Go with a policy gradient. So you would say, I took a bunch of actions, I won. Now I'm going to look at this one action. I got an expected value of one because I won the game. Now what was the average expected value? Actually, sorry, what was the advantage is what I need to know. So I need to know was this one like for example like is this am I playing someone who's like a total chump and so like even though I won I can't really
Starting point is 01:09:32 learn anything right or did I play someone who's like a grandmaster and actually learned a ton by winning. That's your sort of advantage function and so you're gonna take the value function of the current state. So at this current board state, what's the probability I win? And then you're going to take the action you took and see what's the value at that state. So if the probability of winning the game is 50%, but after I took my action it jumped up to 60% then I know that taking that action like caused an extra 10% and so that's my advantage and so
Starting point is 01:10:12 you know if you're playing someone of your level half of your actions are going to cause your win probability to go down and half of them are going to cause your win probability to go up. Got it. Yeah, so as I said for go you don't need it as long as you're doing self play but for you know something like Atari where there is never a negative score you need to know like what is was a bad action and so a bad action is one where your expected score went down after you took the action. It's like, oh, I took the action to run into the Goomba and now my expected score
Starting point is 01:10:54 is a lot lower because I have one less life to go and collect points with. So is there a conversion between lives and points then, or it's just simply that because you lost a life, your maximum point that you can get is reduced? It's the latter. Yeah, so all of that has to be inferred. So now you could make it explicit. So you could say, and this is something that we should talk about, it's called reward shaping. So let's say in Mario, your goal is to get the most points. But that's kind of a really weird goal, right? Because often when we play as humans, we don't even look at the score, right? So you might come up with proxy goals. You might say, well, every time I eat a, well, every time you eat a mushroom,
Starting point is 01:11:36 you get points actually. But let's pretend you did it. It's like every time I eat a mushroom, I'm going to give myself extra points. Maybe you don't get enough points for eating a mushroom. And if you gave yourself more points for eating a mushroom, then the AI learns easier. Um, and, and to your point, like you don't lose points when you die, but maybe you should, like maybe if you lost 10,000 points, every time you died, the AI would learn a lot easier. And so this is called reward shaping. It's where you, instead of learning the task at hand, you learn a new task because those two tasks kind of go up and to the right at the same time. They're correlated and the new task is just easier to learn.
Starting point is 01:12:24 Yeah. So I mean, I don't know the scoring of Mario either. I never paid attention I guess like you could get weird states Otherwise where you try to get a bunch of one-up mushrooms and then on a particularly high value level Like basically keep dying almost at the end repeatedly in order to like keep gaining points for that level assuming you don't lose points when you die. And so you could get this very undesired behavior because it realizes the way to get the maximum score is to keep dying and replay the level and gaining those points. When the actual thing you wanted was kind of get through the levels as fast as possible.
Starting point is 01:13:04 And you didn't really care about the score and you used it as a, but it was like a bad proxy for what you wanted. Yeah, exactly. Exactly. And then on the flip side, like you might say, well, my goal is to get as far to the right as I can. Right. But if you don't take score into account, it might just be really hard for the AI to do that. Like the AI might just like desperately do like kamikaze jumps to the right where it gets killed because it didn't, or the AI might just not be incentivized to get mushrooms and make it more survivable, that kind of stuff. So yeah, reward shaping is a really big part of the problem.
Starting point is 01:13:54 And, um, um, and so when people go from manually designing systems, like if this ad has this chance of getting liked, then put a highlight around it or something. When people like build these things by hand, they, they have to deal with like all these conflicting metrics and how do When people build these things by hand, they have to deal with all these conflicting metrics and how do you reconcile, oh, we showed more ads, but there was more kind of racy, unethical kind of photos and how do I deal with that? And so reinforcement learning moves that problem
Starting point is 01:14:22 to the reward shaping phase, but it doesn't really get rid of it. You're always going to need to like better and better understand kind of like your goals and the nature of the problem and how to kind of how to best like solve that problem. Okay, so, okay. So let's dive into the offline part.
Starting point is 01:14:48 So, you know, one thing a lot of people wonder is, yeah, like AlphaGo plays against itself. And so at the end, after you've used like a zillion GPU hours, it's like a world champion, but like, how do we do that in the real world? Like clearly like babies don't just like run into walls or like fall down. Actually they do kind of fall downstairs if you let them,
Starting point is 01:15:10 but that's a bad example. But like, but like, you know, as humans, like the way we drive a car is, you know, we have a person helping us, but we're not just like randomly jerking the steering wheel until we figure it out. Like we have the sort of like base of common sense, like we kind of draw from.
Starting point is 01:15:29 Like a model of how the car should work. Exactly, exactly. And we kind of like kind of project into using our mind, we kind of simulate the driving experience as best we can from watching other people drive, watching our parents drive. And we've built a simulation of that on day one. And so there's a question of like, how do we do that with reinforcement
Starting point is 01:15:53 learning? And a big part of that is using what's called a trust region. And so a trust region is basically, it works like this. So let's say I play a bunch of games of Go. And I'm a decent player. I play a bunch of games of Go. And now I go back and I watch all of my games, right? This is just me as a human. I watch all of my games and I look to myself and I say, oh, I would have done maybe this move differently. I would have done that move a little differently, but I'm not going to say like I would have done every move differently. Like as a person,
Starting point is 01:16:32 like we can't, that would put us into a really weird state where like we wouldn't really know what to do. Right. So we would pick like a few key things that we would do differently and then we would wait until it's the next tournament, exercise those differences, and then we'd repeat this process. And so, you know, with reinforcement learning, if you take a bunch of data and have computers try to do policy optimization,
Starting point is 01:17:02 they'll just hallucinate, just like we see with chat GBT and these other things like they'll start hallucinating like oh If I play this move I'm gonna get every single go piece on the board because there's like some inaccuracy in the model and The other thing is like it only needs one Action to be inaccurate on the positive side to throw everything off, right? All your values are now thrown off, everything, right? So it's inherently kind of unstable.
Starting point is 01:17:30 And so what trust region policy optimization and proximal policy optimization, what these things do is they basically say, we're going to keep track of the actions that were taken in the real world and what the model is doing. And if the model doesn't match the real world enough times, we're going to stop training. So, you know, in the beginning of training, the model is going to match the real world perfectly because it's the same model, right? Like you rolled this model out in the real world, collected a bunch of data,
Starting point is 01:18:03 and at the very first mini batch of training, the model hasn't changed. And so it's going to output the same distribution, right? Over time, the distributions are going to start diverging. And because, you know, because you own the model, you can actually keep track of the entire distribution, right? So even though you actually press the gas, you know that the model was like 50, 50 about pressing the gas or not. And that's what you're going to log, right?
Starting point is 01:18:32 So now you're training and you say, Oh, well, the model that drove the car, press the gas 50% of the time. But the new model wants to press the gas a hundred percent of the time. That's a pretty big difference. And so maybe this would be a good time to like stop training and go drive with the new model. And so that's, you know, there's a lot more math than that. That's effectively what's going on is these are like halting, halting criteria.
Starting point is 01:18:59 So you might not be able to train that much before you have to stop and go to the real world. Got it. to train that much before you have to stop and go to the real world. Got it. So it's basically like you've you're too far away from what we know. So you need to go try again. You've changed a bunch of stuff a little, but your outputs are now very different. So we need to go try again and see if it actually got better. Yeah, exactly.
Starting point is 01:19:24 Um, and then the last thing that I'll kind of cover here, um, but, uh, there's a couple of other things. So one is imitation learning. And so this is pretty simple. The idea is, um, you know, we talked about supervised learning, right? And the stop signs, right? But I can do the same thing with decisions I could say hey when you see this situation press the brake right and I'm just saying as an expert like it's a ground truth like like unambiguous you know press the brake here press
Starting point is 01:20:00 the gas pedal here that's called imitation learning and that's just supervised learning press the gas pedal here. That's called imitation learning and that's just supervised learning. So you know you could imitate a person and then all the regular things apply there of interpolation and everything we talked about. But that's not really reinforcement learning. And that's the AlphaGo not zero where it was trained on all the human games and they basically said, we want you to imitate the person who won. Right. Exactly. And so AlphaGo did that as like a bootstrapping phase and then did, did reinforcement learning after that.
Starting point is 01:20:37 The AlphaGo zeroes where they got rid of the bootstrapping phase. Um, so that's a, so imitation learning is a good way to bootstrap. to the bootstrapping phase. So imitation learning is a good way to bootstrap and that's kind of what we do. Another thing I wanna cover is model-based reinforcement learning. We're basically, in the case of AlphaGo, it can play itself because it's just a game, right?
Starting point is 01:21:03 Like it's an artificial environment. But if you want to do, for example, a self-driving car, as we talked about, you can't just drive randomly while you learn what to do. And so you have to construct a model. You have to construct like a virtual environment and then play within that virtual environment. You know, the challenge now is, of course,
Starting point is 01:21:28 what happens when the virtual environment doesn't match the real environment? And so there's different ways to deal with that. There's something called a joint embedding. But long story short, with model-based reinforcement learning, you have this SIM to real problem. So if you look up SIM to real, you'll find like a zillion papers on it, but it's
Starting point is 01:21:50 like, how do you take something that was trained on a simulator and bring it to the real world and back and forth and back and forth? Um, okay. Yeah. The last thing to cover is policy evaluation. So, you know, with supervised learning, you have the truth. So it's like, oh, I didn't draw the bounding box around the stop sign. That's bad.
Starting point is 01:22:17 I drew the bounding box around stop sign. That's good. And you can, there's a million different ways you want to count those errors, but you can count them those ways and just output that, right? In the case of reinforcement learning, you don't really have like a perfect game of Go or anything like that.
Starting point is 01:22:37 So what you have to do is, several different things you can do. You can either use a simulator and say, oh, in the simulator, my model got better. That's what AlphaGo does, right? But in case of, let's say, self-driving, where maybe you don't want to trust the simulator, there's another thing you can do where you run two models and the first model actually controls the car. And the second model just says, well, we call counterfactuals, which is just a fancy word for I would have done this.
Starting point is 01:23:15 Right. So it's just like, it's literally a backseat driver, literally. So, so you then take those counterfactuals and what actually happened and you can figure out if the new model is better than the old one. So for example, let's say the old model doesn't hit the brakes and the new model really, really wants to hit the brakes. And then like half a second later, the old model slams on the brakes. Well, that probably could have been avoided by the new model, because the new model was breaking earlier, right? It was like more predictive, right? And so that would be a sign that the new model is a step up.
Starting point is 01:24:00 Similarly, if the old model slams on the brakes to avoid a collision and a new model would have hit the gas, that's a bad sign. That means that new model probably would have got you in a bad position. So again, a lot of math behind that, but effectively that's the intuition behind policy evaluation. And one last thing about that is policy evaluation is harder than solving the problem. Because if you have a perfect policy evaluator, then you also have a perfect policy. You just take the action that the evaluator gives the highest score to. So because of that reason, like versus like in supervised
Starting point is 01:24:43 learning, you could just measure accuracy, just take the times you were right and divide it by the total time. Very easy, just algebra. But here it's like, not only is it hard to evaluate the policy, but it's actually harder than solving the problem. And in many cases, it's impossible. So that's a big challenge. And that continues to be a challenge today.
Starting point is 01:25:06 Um, um, and so that's, that's Paul. So I've kind of covered all of the technical stuff. I'll dive into a little bit of the large language model stuff, but before I do that, any questions about the technical stuff? So I guess the thing I'm missing is, is is I guess it's a little application which is understanding So some problems like you're saying are clearly not a fit for supervised or unsupervised learning And so you could think about like oh, maybe this is a reinforcement learning task When do you but then some things maybe there's multiple approaches to solving and so, you know, you could use a reinforcement learning
Starting point is 01:25:44 You could try something else. And then from a toolbox standpoint, even today, I can just Google your stop sign example, and there's 100 tutorials for opening up TensorFlow PyTorch, whatever. Give it the images, give it the labels. We talked about labeling, but give it the labels and monitor your loss function.
Starting point is 01:26:06 Is it the same, is there like the same set of tools for doing reinforcement learning? Or is there also some really like a canonical example that you would sort of go to to like kind of do like the simple case? Yeah, it's a really good question. Okay, so, okay, the first part of the question, I think that my general philosophy is to use the simplest tool for the job, right?
Starting point is 01:26:35 So for example, I'll give a really concrete example. There was a place I worked at, I could probably say this. I'll just say it. I don't think it's going to be that controversial or anything, but it's not that much of an expose, but you know, when I worked at, at Metta, you know, we released the Oculus store, right? And so you can go right now to the Oculus store and buy games for the Oculus Quest. Right. And so I talked to the product managers, they asked to meet with my team,
Starting point is 01:27:06 and they wanted to do reinforcement learning to figure out what items to put on what places on the storefront. So when you go to like oculusstore.com or whatever it is, the URL, like what should they just show right there on the banner? They call that the hero position. What should they put in the hero position, et cetera?
Starting point is 01:27:29 And my response to them was like, not only should you not use reinforcement learning, you should also not use AI, right? What you should do is like take the app that sold the most and put it in the hero position just manually and run that way for a month. And then if you realize that like, if you just have this intuition, like, oh, there's, there's so many people with so many different interests and we're
Starting point is 01:27:53 showing everyone beat saber and it's not going well and, and so we need to do some AI then let's, let's go there. Right. So like simplest tool for the job. Like the simplest thing was just like a YAML file with Beat Saber in it, right? And so like they launched that. And then I would say, you know,
Starting point is 01:28:12 if you can do something simple around the decision, like say, okay, in certain countries, I'll show Beat Saber and other countries, I'll show other stuff. And then now I'm dividing by some other demographics. And the next thing you know, you're kind of like building a decision tree by hand. Okay, let me use the decision tree, right?
Starting point is 01:28:32 And so, and then at some point you run into like competing interests where, you know, I want the store to do well, but I also want game publishers to share the benefit. I don't want to just king make big beat saber. Now I have this competing economic model that's very complex. Now we're starting to talk about reinforcement learning and some of that. I would say stick with the simplest tool
Starting point is 01:29:05 for the job. Reinforcement learning often is much simpler than trying to like take actions by hand and stuff like that. So for running a marketplace, for driving a car, you know, reinforcement learning is a great choice. Yeah, and as far as tooling, the tooling is way way behind. There's a lot of reasons for this. One of the biggest reasons is reinforcement learning can't really be commoditized because it's too close to the decisions that companies make, which are sensitive. And so it's just very hard to commoditize. I mean, we rolled out Reagent, which was the most popular reinforcement learning platform for a while. Now there's open AI baselines. So there's a bunch of places where you can get the algorithms, right? But if you want like the real techniques, like how do I do offline evaluation?
Starting point is 01:30:12 A lot of these are proprietary. Reagent actually has policy evaluation and all of that. So folks can definitely check that out and the code base, as far as I know, is still active. out and the code base as far as I know is still active. But I think the field is just still too new for there to be kind of really good practices there. But yeah, those are two awesome questions. Okay, so I'll move on to RLHF. So a lot of people found out about reinforcement learning when chat GPT rolled out RLHF, which is what makes the chat part of chat GPT. What took it from GPT to chat GPT. RLHF is a pretty simple idea. The idea is So the idea is thinking about it this way. GPT is imitation learning. So a person wrote,
Starting point is 01:31:15 the frog or the fox jumped over the dog or whatever that is. When you pick a font, it always shows you that same sentence. It's like the quick fox jumped over the lazy dog. So that's probably all over the internet, right? Because it's in every font. So GPT will imitate a human quote unquote, if a human is you know all the content on the internet averaged, right? And so if you say like the Fox, the quick Fox jumped, GPT will
Starting point is 01:31:46 respond with like over the lazy brown dog, right? And this was trained in a supervised way, but when you think about it as like it's a decision to put that token there, like it's a decision to put that word there, then actually GPT is making decisions. And so, And so it becomes a reinforcement learning problem when it becomes multi-step. So for example, you know, if I just need to predict the next word and I know exactly what it is, that's supervised learning. But if I have like 10 different answers from GPT and I want to pick the best answer, like an entire answer only gets one score, now it's a reinforcement learning problem. Because I have to figure out, okay, this answer is better than that one, therefore all the
Starting point is 01:32:39 tokens that generated that answer are a little bit better, but we don't know how much. And so RLHF is just a pretty simple algorithm where you say, give two answers. If the system is more likely to pick the wrong answer, it gets a negative point. It's more likely to pick the right answer, it gets a positive point, and now you do your policy gradient. to pick the right answer, it gets a positive point, and now you do your policy gradient.
Starting point is 01:33:14 And so RLHF has been a part of these LLMs for a very long time. The thing DeepSeq did that made reinforcement learning- Oh yeah. What is RLHF? What does it actually stand for? Reinforcement learning? Reinforcement learning from human feedback. Ah, there we go. Okay. So yeah, the person, a human is actually saying this answer is better than that one. Okay, so the thing that DeepSeek did that was pretty amazing is it took the human part out. And so it's just RLF. And so the idea is it'll generate an answer
Starting point is 01:33:44 to a question that is easily verifiable. So for example, they give it a word problem and they know the steps of the word problem and they know the answer. And so they output two hypothetical answers and it comes back and says, hey, this one's better than that one, but it's all algorithmic. So in math, there's systems that are, you know, they're very expensive to run and they're very specific to math, right? They only solve math problems, but they're totally autonomous so i can give you like not just the answer like twenty or something i can give you like the whole reasoning and the answer to a word problem and the system will actually verify the entire thing. And so they replace the human feedback with this expensive system and then they ran this is only in times. system. And then they ran this a zillion times. And what they found was the model that came out of it not only could do math problems better than anything we've ever seen, but it became like
Starting point is 01:35:01 very thoughtful and reflective. And so the reality is what they found is if you treat every question like a math word problem, then you become like much more reflective and like thought provoking and interesting in your answers. And so that's basically what the DeepSeek folks have done, which is definitely like a huge leap forward and really exciting. Does that part make sense? The RLF part of it? Yeah, I think it makes sense. I think they're, but ultimately they're going back and tuning the output of what the LLM is doing or they're tuning something that comes after the LLM. Yeah, no, they're modifying the LLM? Yeah, no, they're, they're modifying the LLM itself and all these cases. Okay.
Starting point is 01:35:45 Yep. So, so if you think about it, like regurgitating the next token is, is actually a form of imitation learning. So you're saying like these humans that have written this stuff on the internet, they're experts and I'm trying to out, do the same action they're doing where action is writing letters.
Starting point is 01:36:06 And so then when I change the goal to be like, solve this reinforcement learning problem, it's still like a set of actions. And so you can use the same model. Oh, another thing I should mention is, we talked about actor critic, and we talked about how to get this policy gradient stuff to work, you need to have positive
Starting point is 01:36:28 values half the time, negative values half the time. So the challenge here is these models now are huge, right? Like these LLMs are enormous. And so if you need a second enormous LLM, then that's going to be really problematic. And so what they did, which is really interesting, um, and, and, uh, it actually only works because there's no intermediate rewards, uh, but that's kind of a detail. Um, is they said, okay, we can't afford to have a second model so what we're going to do is we're going to get the expected value of all these different
Starting point is 01:37:12 answers to this math problem so we're going to generate ten answers we're going to get the expected value of all ten of them and then we're going to get, and then we're going to basically normalize that number. So for example, I get the expected value of all 10 answers, and let's say the expected value is all 1, except for the 10th answer, which is 2. So I'm just going to normalize that so that all the ones become like negative 0.8 and the two becomes positive 0.8 or something like that. So they replaced like an entire neural
Starting point is 01:37:51 network with some simple algebra and so that's the GRPO or group relative policy optimization. So it's one of these things that's like a really clever trick. I have kind of mixed feelings about it. I do think that with intermediate rewards, it's going to struggle. I think that maybe coincidentally or maybe on purpose, but the fact that in this particular domain, you just get a reward at the very end is one of the important causes for this approach working over like PPO or these other alternatives. Another interesting thing where they've kind of diverged is generally what people have done in
Starting point is 01:38:45 situations like this where you need a large actor model and a large critic model is they've had the two share the same backbone. So for example you have a neural net where the current state of your of your universe goes into the net and then the neural network outputs two things. It outputs the distribution of actions you should take, that's your actor output. And then it outputs the expected value of the current state, that's your critic output.
Starting point is 01:39:14 And so you still just have one model, it just has one tiny extra output on it. So you might say to yourself, well, that's like pretty awesome, right? I mean, that seems like a no brainer. But the problem is that even though it's just adding one node, both of those nodes are sharing that network. And so they're kind of competing with each other.
Starting point is 01:39:40 You know, the critic model is going to be steering the entire network towards producing better values. The actor model, actor part of the model is going to be steering the network to producing a better policy and they're going to be causing corruption in each other. And so although you will see like people have success with this for Atari and other domains. I think it's actually super destructive and my guess is that the DeepSeek folks tried to have like just one network, one single network that does the policy and the actor, sorry, the actor and the critic in just one network. And they realized that doesn't work, that it just causes corruption
Starting point is 01:40:38 and it just never converges and it just is a mess. And so they ended up falling back to this approach where they said, okay, well, we can't have a separate critic model. It's too big. and it just is a mess. And so they ended up falling back to this approach where they said, okay, well, we can't have a separate critic model, it's too big. We can't put a critic head on the LLM because that causes too much corruption. And so we're just gonna abandon the entire idea of a critic and just come up with baselines on the fly.
Starting point is 01:41:03 And that worked for them, which is really cool. Was that something that was unexpected? Like, was that like a sort of, I don't know, I call it like an innovation to make that leap? Or was it just sort of like, no, it was pretty obvious once I got there. Yeah, so this is where it gets interesting is, I mean, so there's a lot of theories. So I'll say, you know, I, you know, I'm not at Metta anymore. I don't work at OpenAI or these places. And so I don't, you know, I don't really know like what's on the very cutting edge that
Starting point is 01:41:41 hasn't been released to the public. There's speculation that OpenAI was already doing something like this, but they hadn't published it. And so DeepSeek kind of scooped it. There's even more like, even more speculative is the idea that somebody stole the idea from OpenAI and gave it to DeepSeek. That is pure speculation. But I would say the fact that OpenAI has a reasoning model now that is comparable so quickly makes me think that like either they worked around the clock or they were coming to the same idea, right?
Starting point is 01:42:23 And it's probably the latter. Probably DeepSeek saw where the wind was blowing. And, um, and they both kind of came to that answer around the same time. That would be my guess. It makes sense. So there's lots of cases like that, right? I don't know. Online, I saw someone using the term nerd snipe. Wait, what does that mean?
Starting point is 01:42:47 Same kind of idea. Like people, or I think since you're a YouTuber, you're like working on some like new cool project, you know, you think is like crazy and innovative and someone else just releases a video of the same thing because you didn't get out fast enough, or like you said, there are, I mean, what, half dozen, dozen, I know, probably like a half dozen super serious competitors and like a said, there are, I mean, what half dozen, dozen, I probably like a half dozen super serious competitors and like a dozen like within striking range of doing these kind of similar. I don't, I don't want to demean them by saying those like chat, but like question
Starting point is 01:43:16 and answer AI agents, um, agentic stuff, the, the reasoning, like all of these. And so. Like you said, you're, you're hard at work trying to refine a project. You're not sure if it's a big enough innovation, whatever, and then someone else just goes ahead and releases it. Yeah, and so you get sort of sniped out of it, right?
Starting point is 01:43:34 Like someone got it before you, just before you were gonna do it. Yeah, yeah, totally. I mean, I think one of the trends was that like math, answering math questions was becoming like a big benchmark that was very important. And so I think that led a lot of people to the same conclusion.
Starting point is 01:43:52 Like if the metric had been like right the best play or something, then we might've ended up with a totally different system. But I think once people got excited about solving these like high school math problems, I think then, then that kind of set the course for, for all these companies. The answering math stuff was really bad for a long time.
Starting point is 01:44:15 So yeah, it's kind of thing. And I, I feel, and maybe I'm kind of wrong. I feel the store like coding maybe is, is a, one of those things that when you think through like lead code style problems, like there are definite like set up where you're given a very high level question. And then, and there are benchmarks already have these in there,
Starting point is 01:44:35 but I still feel like performance isn't amazing once you get off the bench, like that they've not seen before, right? So when you give these sort of high level problems and you kind of have a very specific known output for the program and it should be compilable and it should be, it's harder, yeah, of course than the math problem,
Starting point is 01:44:51 but it seems within striking distance. Yeah, I mean, you know, traditionally in machine learning, we have this concept called leaking the label, which means like, you know, if you took, okay, if you took the examples you trained on for the stop sign trainer and you just fed them back in and you get them all right, that doesn't mean you have a perfect system because you're kind of cheating, right?
Starting point is 01:45:15 Like you might, you might have just memorized all those examples and you can't know anything else it's possible. Right. Yep. Um, but the problem is how do you not leak the label when you're and you can't know anything else, it's possible, right? Yep. But the problem is how do you not leak the label when you're trading on the entire internet? And so I think what they've found is a lot of these cases
Starting point is 01:45:33 where like the AI solves math problems or the AI solves leak code problems is they've leaked the label and the AI is literally outputting an answer that somebody else, some other human wrote to that lead code problem. And so they've done experiments where they've released things that they know are not on the internet and the AI's have struggled mightily with it.
Starting point is 01:45:57 I feel like until we get proper calculator use and tool use more broadly, I think it's gonna be very hard for AI to solve these problems. Well, this has been a great topic and very timely. I know you've been working on reinforcement learning for a long time, but I feel it has, like you said, it's kind of reached a certain hubbub in the everyday discussions recently. So I'm happy to have a sort of great overview of what it is and what it's about.
Starting point is 01:46:30 Yeah, totally. If folks have any questions, they can just reach out on our Discord or email or in my case, social media. We need a reinforcement learning algorithm to get Patrick on X. That needs to be the next thing. Okay. Well. But yeah. So I did look it up. So Dolly one was four years ago. You were right. That was very good.
Starting point is 01:46:52 And then Dolly two is what I was first trying. And that was three years ago. Nice. So you're actually you're very accurate despite it just being off the top of your head. Well, I remember, you know, 20 these things where like you connect it to stories. Like I remember there's this woman, she's very influential in AI. Her name is Fei-Fei Lee. And I remember being in this dinner and her and her student were there. And she said something like, and this was again, a long time ago, but she said something like, oh it was, oh we were writing captions from images. So basically given an image write a caption so that we could for accessibility reasons. And Facebook I think still has that in the product today. It's like if you're blind or something you can click on an
Starting point is 01:47:39 image and it'll say what is going on in the image. I remember her saying, that's cool, but it'd be really cool if you could go from the description and create the image. And that always stuck with me. I mean, I was like a decade ago. That always stuck with me. And then I remember when Dolly came out, I was like, wow, it's like the, something that I thought was like a joke,
Starting point is 01:48:01 but then it really happened. Like is, for me it was like an amazing experience. That's why I remember it. I think people have started to get a little fatigued on the AI thing. And it's hard to know, is it, you always hit plateaus. Is it like plateauing in terms of like actual functionality? Is it on-run and exponential?
Starting point is 01:48:21 And exponentials always look self-similar no matter where you look, right? And you just sort of like, we can't feel the growth. And then you tell stories like you're saying, or even about Dolly being, you know, four years ago only, and like talk about the, the repaint or the flux now versus Dolly, you know, just three or four years ago. It's not that long and there are lots better.
Starting point is 01:48:41 Yeah, I mean, yeah, actually that's a good point. I'll end with, with my, where I think this is going. I think that despite loving, reinforcing learning and everything, I don't think that AI should be making a lot of decisions in isolation. I think that it should be working together with people. And so, you know, the recraft is a great example
Starting point is 01:49:04 where it's not just an API you call and get an image, but it's like an experience and like you iterate and you say, Hey, I want this to be all different or Hey, I want this. I want an axe in this person's hand or a phone or whatever. Right. And so I think it's going to be really about collaboration and reinforcement learning is always going to be really important, but it's going be really about collaboration. And reinforcement learning is always gonna be really important, but it's gonna be important in the way that the actions are more like suggesting things to people. So in other words, reinforcement learning
Starting point is 01:49:34 to like book a flight for you, probably not a good idea. Cause if one out of a hundred times you go to Tokyo by accident, right, you're gonna be pretty upset. I might be happy, that sounds great. Yeah, actually Tokyo is amazing. Yeah, actually I don't wanna say anywhere we don't wanna go, cause we have a list that are almost like, anyway.
Starting point is 01:49:54 So Antarctica. Okay, sounds horrible. But you still need reinforcement learning to like suggest, like, you know, come up with like three different hypotheticals send it to the person should you text them or email them etc like there's still a lot of decisions to be made but I don't believe tons of people are gonna lose their jobs entirely I think that work is gonna change just like it did with with the invention of the motor and stuff. So. Well, you heard it here first.
Starting point is 01:50:28 The JSON future AI is not too scary coin. Yeah. Yeah. Don't be worried about it. Just be adaptable. If you're adaptable, I think you'll be just fine. And, and, oh, and coding will probably be one of the last jobs to be eliminated by the way. So if you're worried about So if you're worried about,
Starting point is 01:50:46 if you're worried about, yeah, please stay in coding. Not just because we want you to keep listening to the podcast, but tell all your friends to get into coding, stay in coding. If they're worried about their job being eliminated, they should be a coder. That's like one of the last jobs that's gonna go. I mean, trust me on this.
Starting point is 01:51:03 Like we are going to lose so many doctors. We should probably lose all the CEOs before we lose the coders. Oh, no, no, no. All right. All right. We got to wrap. We got to wrap, guys. We got to wrap. They're phoning me from the other room and telling us we're out of time. That's right. Hey, I didn't say anything about HR. What's that? Oh, yeah, we're wrapping up. All right. This was so fun. Thanks everyone for tuning in and thanks Patrick for bearing with me and my rants on
Starting point is 01:51:35 reinforcement learning. This is great. Very illuminating. This is awesome. I learned a lot today. So. Cool. All right, everyone.
Starting point is 01:51:42 We'll catch you later. Eric Barndaler.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.