Programming Throwdown - 172: Transformers and Large Language Models

Episode Date: March 11, 2024

172: Transformers and Large Language ModelsIntro topic: Is WFH actually WFC?News/Links:Falsehoods Junior Developers Believe about Becoming Seniorhttps://vadimkravcenko.com/shorts/falsehoods-j...unior-developers-believe-about-becoming-senior/Pure PursuitTutorial with python code: https://wiki.purduesigbots.com/software/control-algorithms/basic-pure-pursuit Video example: https://www.youtube.com/watch?v=qYR7mmcwT2w PID without a PHDhttps://www.wescottdesign.com/articles/pid/pidWithoutAPhd.pdfGoogle releases Gemmahttps://blog.google/technology/developers/gemma-open-models/Book of the ShowPatrick: The Eye of the World by Robert Jordan (Wheel of Time)https://amzn.to/3uEhg6vJason: How to Make a Video Game All By Yourselfhttps://amzn.to/3UZtP7bPatreon Plug https://www.patreon.com/programmingthrowdown?ty=hTool of the ShowPatrick: Stadia Controller Wifi to Bluetooth Unlockhttps://stadia.google.com/controller/index_en_US.htmlJason: FUSE and SSHFShttps://www.digitalocean.com/community/tutorials/how-to-use-sshfs-to-mount-remote-file-systems-over-sshTopic: Transformers and Large Language ModelsHow neural networks store informationLatent variablesTransformersEncoders & DecodersAttention LayersHistoryRNNVanishing Gradient ProblemLSTMShort term (gradient explodes), Long term (gradient vanishes)Differentiable algebraKey-Query-ValueSelf AttentionSelf-Supervised Learning & Forward ModelsHuman FeedbackReinforcement Learning from Human FeedbackDirect Policy Optimization (Pairwise Ranking) ★ Support this podcast on Patreon ★

Transcript
Discussion (0)
Starting point is 00:00:00 programming throwdown episode 172 transformers and large language models. Take it away, Jason. Hey, everybody. I had a really interesting discussion on LinkedIn. This is like a meta meta thing, but I post everything on LinkedIn and Twitter and my Twitter, nobody follows my Twitter. And the weird thing is I, my ex, yeah. Maybe it's going to be my ex social network. I just can't get anybody to follow me on there. Everyone follows me on LinkedIn,
Starting point is 00:00:47 which is fine. But I even tried, you know, putting my Twitter link on presentations I give and stuff like that. And, and people would rather just find me on LinkedIn. So I find this amazing that you give presentations where people actually could join and follow you
Starting point is 00:01:05 i always see people do that but i've never done it myself just not on social social media but i also find it interesting that you have presentations where that would even be an opportunity i gave a amazing presentation to suny purchase which is a university in new york um and it was on the ai singularity and uh um it was it was a lot of fun um had a great time there or i wasn't like i wasn't in person but had a great time speaking oh nice now i kind of want you to give it here i actually would love i asked if if they would make the video public they said no it was actually technically one of the lessons like it was part of a course and so for that reason they can't just post it on the
Starting point is 00:01:45 internet but i was really happy with it the questions were extremely interesting the students were very engaging and um um yeah it's a shame that we can't just share it with everybody but the thing that really kind of took off on linked sometime over the weekend, was I posed this question, is work from home really work from city? And what I meant by that is how many people during the pandemic really moved to a different city and they're not really interested in working from home. That's not the spirit of what transpired it's actually that
Starting point is 00:02:25 they want to work from another place and uh so maybe i'll pose a question to you patrick so if you're if your office was there in your city would you go to it why and why not oh this is interesting uh i think i i see your point and i have seen the statistics although i'm not sure how trustworthy they are i i don't know the questioning to be honest but that not everyone to your point is actually happy about work from home personally um you know relocating to a different city for for a plethora of reasons to your point it's not necessarily that it was only the office it was also the geographic location as much as we are moving to online people you know having family and you know just a laundry list of different
Starting point is 00:03:16 personal reasons for myself but for others as well there can be specific cities you want to live in or don't that being said i the question so it's hard because the place i chose to live is different than where i would potentially have lived if i had tried to locate next to an office that is where an office most likely would be near me would be far and therefore i would not want to go to it oh i see but if there had been an office and i was living close to it it's not that i mind going into an office occasionally. I will turn the flip around, which I'll say by not commuting personally have found ways of, I don't want to say exploiting, that sounds negative. I have found ways of parlaying that time that I would normally have spent in the car into other industrious activities. So things like,
Starting point is 00:03:58 you know, being more active and exercising and going out or meeting with people sort of early in the morning because I'm on the East Coast and work with people on the West Coast. And so going and meeting with people I otherwise wouldn't have because I'm working late, basically meeting them in the morning. So replacing commute time by having a time shifted schedule and finding these other things for me means work from home is really about work from home. But I do see your point there is a gradient between office for your full work week hybrid work from city like down the down the list i think there's sort of like a a spectrum and i think it's it's not a it's not a binary choice yeah that
Starting point is 00:04:39 makes sense you know one of the reasons, so I was working in Austin. So like when I moved to Austin, I did work in an office here. And then recently I switched to working from home, which wasn't a choice. It was part of just things that transpired at my company. And so now all the Austin folks are remote. There's definitely pros and cons. I'm a bit ambivalent to it but what i really want to do is sort of blunt this argument that um i feel like when they create this dialectic of you have
Starting point is 00:05:15 you have people in the office and people at home then it's easy for like i think elon musk is one of these people who says oh people working from home are lazy you know they just want to sit eat doritos all day whatever um and so i feel like you know the entire premise of that argument is false in the sense like like i would i wonder if the majority of people actually you know they're not working from home to work from home they're working from home to live somewhere else and so then you know the whole uh this whole stereotype of like oh this person doesn't want to get out of bed is really not not even true on the premise i think yeah i i feel like it's one of those cases where it's there's no universal truth i mean i could imagine a you know theoretical person that you know this is the classic goes
Starting point is 00:06:12 and spends all their time at the water cooler right and actually causes a distraction of other other employees trying to work and you know basically finds ways of not being at their desk and not working because they're not busy and they're trying to cover for it throughout my career i've always known people that that have you know been more or less like that and that for that person going home if they're doing that specifically to avoid the appearance of you're not being busy then going from his is a is a you know a revelation to them because they could just do whatever they want and no one knows if they're working or not and then you get all the like you know mouse tracking software that you see companies roll out or whatever to combat this so i do think there are people like that but
Starting point is 00:06:54 like you said i i don't know that that's universally true that just because you work from home means you're like eating cheetos and don't have any pants on. Yeah. All right. Well, we'll see where that goes. But there's a lively discussion. If you want to follow me on LinkedIn, where people actually respond to my inane questions,
Starting point is 00:07:15 you can follow me on LinkedIn and see a really interesting discussion unfolding. One thing is, you know, and I knew this would happen, so I tried to avoid it, but you can't completely avoid it. You know, you know, and I try, I knew this would happen. So I tried to avoid it, but you can't completely avoid it. You know, some people took it as a, like, attack on, you know, the Bay Area. And that wasn't really the point. You know, and I tried to make a point of saying, look, there's just as many people who would be running the other direction. If the tech companies weren't already there, you know, like, if you were to flip the script, there'd be just as many people who would be running the other direction if the tech companies weren't already
Starting point is 00:07:45 there you know like if you were to flip the script there'd be just as many people in the bay area working from home while their hq is in florida saying like oh you know it'd be the same thing right so it's not it's not about which place is better or any of that it's just about why are people um leaving and for for a subtly different question i mean do you have a like if a company wanted to offer work from city as a flexibility like is that something you imagine i mean we work is now i think going towards bankruptcy i think is last i saw but i mean i i mean what what is your like do you have a thought towards like what that ends up looking like yeah i mean i i loved it so we had a we work um basically the the short story here is i was part of a startup
Starting point is 00:08:29 as you know a startup has relatively few rules i mean obviously it's still a corporation and everything but it's a small company and so we had a we work uh startup was acquired by a much bigger company and the bigger company has a bunch of rules around what constitutes an office. And WeWork wasn't able to fit those rules. And so all the WeWorks got shut down. Basically, it's what happened. But I loved it. I would go in, there would be a bunch of people
Starting point is 00:08:59 from all sorts of different industries and different micro offices. That was interesting. There's nice common areas where you can meet people um so i was a big fan i actually um i would ride my bicycle to downtown which is something like 12 miles each way um and that would be my whole exercise for the day so i bike in a downtown i would uh you know work um bike, bike back. And yeah, it was a lifestyle that I was enjoying. But working from home, I think has been pretty much fine. Yeah, maybe I'm pretty
Starting point is 00:09:34 ambivalent to this, you know, it doesn't seem all that different. I think, you know, a lot of the people in my team were not in my city anyways. So I was spending a lot of time, you know, on the internet with them anyways. Well, check it out on LinkedIn. The discussion continues there. But you have to make an account, Patrick. I think I have one somewhere. I'll have to dig it out. Time for news of the show.
Starting point is 00:10:07 So fitting with not being a news story story we should just really rename these sections uh this one was a article that uh was titled falsehoods junior developers believe about becoming senior and uh just uh the the person's name at least i assume his name is Vadim. I'm not sure how to say it. Sounds right. Okay. Apologize if you're listening. And they had some great points here that is just a thought-provoking article, which is,
Starting point is 00:10:38 you know, I have been in my career a little bit longer. I can sometimes forget what it was like when you're sort of new and you don't understand a lot of stuff um first of all i mean i personally think the junior developer senior developer thing is is overdone at some companies it's a very strong delineation at other companies it's not uh i i think there are people who are have characteristics of being quote unquote junior and senior um and so maybe my mindset already sort of like answer some of these, but these are great to point out. I don't know, I hit on all of them. But an example of is that a senior developer just knows all the answers. And I think that's obviously not true. Like senior developers, or at least I've not met a senior developer who
Starting point is 00:11:21 actually knows all the answers. I've met a few who thought they knew all the answers, but they don't actually know all the answers. Similarly, there's a belief, and we bump into this from time to time that, oh, you're a senior developer, like you must work like the cutting edge stuff and, you know, insert whatever language is, you know, hip at the moment or constructs in the language you're working in. So I work c++ like oh you must use all the insert exotic c++ stuff and it's actually like no i use probably one of the more basic subsets of the language because it's important for me to get collaboration and get help from teammates and so the person just pointing out that you know there's not this belief that you somehow fundamentally shift that one day you're doing sort of junior tasks and fetching coffee that's not this belief that you somehow fundamentally shift that one day you're doing sort of junior tasks and fetching coffee, that's not the equivalent of as writing documentation
Starting point is 00:12:10 and then code. And then one day you're not you're doing only cool stuff. And you're just pawning off all the mundane things and whatever. There may be sadistic or problematic, you know, senior developers who do that, but that's in practice, not really true. And it really is a gradient as you move up, you may be solving bigger bigger problems but bigger problems are really just large collections of smaller problems in my experience for the most part with very rare occasional you know singular tough nuts to crack and so interesting article i i didn't read them all i think there's you know like 10 here um but check it out we'll have a link in the show notes. Uh, what are your thoughts? This is, this is great. Yeah.
Starting point is 00:12:45 I think, um, I actually have seen things degenerate the other direction where, you know, as someone becomes a tech lead, they say, well, you know, my team's welfare is now really important. Therefore I'm going to do the really, you know, crappy for lack of a better word I'm going to do the really crappy work that no one else wants to do. That way my team is happy. And then a year later, they're completely burnt out and they hate their life and everything. And you kind of start diving into it and you find out, oh yeah, they've been doing the worst work for a year. And that's just not sustainable. So I think it actually you know goes in the
Starting point is 00:13:26 other direction more often than not um and yeah i guess maybe the thing takeaway is like as you go up in level you become like more and more of a servant to more and more people and and so uh um and so yeah it doesn't it doesn't go that, that way, but, but yeah, I also felt the same way as a junior developer. So this is interesting that like, how do you correct the record there? It's not, it's not there. I guess, you know, with, I guess, you know, if, if, as you said, if you're sadistic, you have now the power and the influence to be a sadist, but if you're sadistic, you know, as you said, if you're sadistic, you have now the power and the influence to be a sadist.
Starting point is 00:14:07 But if you're sadistic, you probably have a really hard time getting promoted. It's not impossible, but it's harder. So it kind of works itself out. Yeah, I mean, culture and management is, you know, I guess the fallback here, which is depending on how your team is run determines how how much if if you're one of a very small number of software developers at a non-engineering non-software engineering company i think it's a little different than if you're uh you know at a at a sort of big tech company i think the how those situations develop and evolve can be very different. Yep. Yep. All right. My news story is Pure Pursuit.
Starting point is 00:14:49 This is really cool. It's a relatively simple way of designing a robot vehicle trajectory optimizer. So, for example, imagine like this is an example. Take racing video games right for most people a racing video game is kind of a non-starter because they have no idea how to build the opponent ai and it seems like when you watch um you know even mario kart or something when you watch something when you watch these racing games it can seem really hard like how do i code something like that up and what if they get totally you know bumped by the player way off track how do they get back on track and yeah i guess the thing is like they as opposed to
Starting point is 00:15:35 like mario or the enemies basically have their own physics and their own you know universe in this case you know everyone's a car with the same you know if you can't really bend the rules if if if the ai um you know just teleports in the middle of the track or something it completely destroys the immersion you can make the ai go faster and slower and there's rubber banding and some of that but but still like it has to kind of follow the same kind of physics constraints as the player. Otherwise, it ruins the game. And so it turns out there's a whole bunch of relatively simple methods for doing kind of basic robot navigation and trajectory planning. And this is one of them called Pure Pursuit.
Starting point is 00:16:23 So I included two links. One is a pretty lengthy tutorial that has a bunch of python code where you can um follow along and see what they're doing um the other one is a youtube video where it's as you'd imagine much more visual um and yeah it's a lot of fun if if you're looking for kind of a neat thing to do, I think coding something like this up and having a bunch of little robots race each other could be kind of a fun thing to add to your GitHub account. I think I saw there was a... I guess that's reinforcement learning person wrote a tutorial
Starting point is 00:17:04 or like not a tutorial, one of those funny videos videos I guess you watch mostly for entertainment about track mania which is some you know race car game that is not super hyper accurate but it's you know a little bit more and so they were kind of showing how over generations trying to I guess you call it evolve or train a an ML agent to kind of get the best times. And, you know, they were talking about some of the subtleties about, you know, what moves or inputs you are allowed to do and, you know, whether it was allowed to drift or whether it tried to avoid drifting as an example. It's a very complex topic. And I agree with Jason. I mean, I think it's pretty interesting and it has a lot of, I don't know, I don't want to say like complete real world application, but there's a of uh thinking through these kinds of things that help you in other domains that would be you know sort of adjacent so anything from like building a little you know flying airplane or a quadcopter is gonna have some of these same control loops and
Starting point is 00:18:00 dampening and you know making sure you don't get into oscillations and these kinds of kind of things as well so yeah i don't is there a like i don't want to call it like a playground or like a something where it's like pretty well set up where where all you have to do is really code in the the kind of like driving there must be no one off the top of my head yeah there was a um let's see i can find it opens oh i forgot what it was called um let me see if i can yeah there's there's carla oh torx torx is what i was thinking of it stands for the open racing car simulator and torx was designed for ai like you could actually you could play it but it's really meant for ai to play it nice so i was gonna say that that if you want to get down to the that
Starting point is 00:18:52 specific cut of the problem may get you there faster than writing mario kart from scratch oh yeah totally i mean i think you know uh the mario kart from scratch you would make just like a really fake physics engine and just really silly. But yeah, if you wanted to, you know, ultimately control a real car, if you wanted to ladder up to that, then yeah, Torx would be a good starting point or Carla would be another one. So my next news article unintentionally, I guess, is it's a piece of what you might use to kind of get there. And that is something called a control loop or a controller and sort of figuring out, you know, given some observations of the real world, you know, affect some output. And you can get as fancy or non-fancy of these words as you want. I'm pretty non-fancy.
Starting point is 00:19:36 And that is, I call it PID controller. So the article I have here is PID without a PhD. Jason was telling us at the pre-show he says pid which i've heard before as well so uh we'll have to ask the ai which one is correct um but pid without a phd is a is a little pdf with uh i will say not the most elementary explanation but without getting into my background makes it such that if you just, you know, look at the Wikipedia page for PID controller, you're quickly going to get masked out. Well, that's not good grammar. Oh, well, you're going to get into math over your head, or at least math I'm not comfortable
Starting point is 00:20:14 with for like a casual reading. And so this PDF is a great introduction, and did want to give a shout out to it. But also just to talk about, about i guess similar to what jason is saying a useful tool in the toolbox that kind of pops up in a surprising number of places and so just in brief a pid controller is an acronym stands for proportional integral derivative and that is that you have some value you want to achieve in your output say you're heating a pot of water and you have a thermometer in it and you're controlling the heater underneath. If you crank the coil underneath all the way to full power, the water is not gonna instantaneously boil
Starting point is 00:20:54 if you've ever watched a pot of water. It would be nice if it did that, but it doesn't. And so you have some temperature you're trying to achieve and you have the thermometer that's giving you feedback, there's you know this latency and as you get closer you want to start modulating the power down so you don't necessarily like overshoot your temperature um you know because if you were cooking well now i said water but if you were cooking something in the water that you know you didn't want to be overheated like burn your food or whatever the equivalent of over overheating it would be um you know you don't you can set this
Starting point is 00:21:25 without going into it the proportional integral and derivative terms that will help you to kind of control the behavior of getting to as quickly as possible and then sort of having good behavior as you sort of have this uh latent response to your inputs and this is a form of a control loop because you're kind of sitting there looping over and over again, take the measurement, control the measurement to what you want the output to be decide what you, you know, want to do to your to your, your settings. And so they they grow from here, it's kind of the most, you know, trivial one, there's all sorts of more advanced optimizations that you can do with PID controllers, and then ultimately even moving on to other controllers, once you get past sort of like the trivial things. And even the online discussion around this article had a lot of debate about whether PID is sort of 99% of the time, okay,
Starting point is 00:22:15 or whether it's the worst thing ever, because it just, you know, it's just not the most optimal answer. My take is, you know, it's a good tool in the toolbox and learning. And there are often things where you want to apply some sort of, know move from a to b but in a controlled response and it it sort of ends up being in the same bucket of problems i will say and so being aware of these things is uh it's good to know your way around in unfamiliar territory yeah if i remember correctly and i haven't read this article but i think the the P is basically how far away from the target you are. The I is what has happened recently, and the D is your kind of more Like, you know, if you're, if you're screaming towards the finish line, then that proportion better be really big. Otherwise, you have to start slowing down. So that's where the derivative comes in. And if you're if you're waffling back and forth, that it's clear that like you need to slow down because you've overshot six, seven times a row and that's where the integral comes in even setting the values has a whole bucket of uh theories around and it is really interesting if you use a pit controller to set the values gee i can't see where this might become recursive
Starting point is 00:23:39 uh stack overflow um i i think that if you uh once you learn about these you will notice like you know cruise control in your car and how does it behave uh you know you can sort of think about this not necessarily that they do it but you'll start to see also some um if you have like a 3d printer or as an example it will not know how much kind of like mass the heating the heated bed has at the bottom or the part that squirts out the plastic. So often you'll see during, when you're turning it on, it'll do some calibration.
Starting point is 00:24:10 And part of its calibration is heating, cooling, heating, cooling to understand the response of those devices in their current configuration and their current environment and setting the parameters of these PID controllers. Yep. Yep. That makes sense.
Starting point is 00:24:24 Very cool. Yeah, folks definitely read i mean pid controller is is is as patrick says like the simplest thing it's in every house it's all your thermostat uh you know it's almost certainly using a pid maybe not nowadays i think but yeah yeah i mean but if you have like one of these old honeywell ones you know whatever um but uh definitely worth learning about i think it's it's a great foundation um all right my new story is google releasing gemma so um most people have heard about llama um llama is this open source llm from facebook um someone uh made a project called llama.cpp which is kind of a weird name of a project but basically it's a way to run these llama models really fast um there's no
Starting point is 00:25:16 pytorch um they're loading the pytorch model um but it's all done in C++ and everything is super, super customized and optimized for all these different architectures. And it gets to the point where like on your MacBook, you can run these large language models almost in real time. And so that's been really exciting. There's been a whole ton of research that's come out of that.
Starting point is 00:25:43 So Google released Gemma. This is also kind of an area of debate. But okay, well, just basically the Gemma models are also these models that are small enough that you could run them locally. In fact, they even have one even smaller than the smallest Lama model. So as a result, it can run on even less hardware like maybe on your phone in real time um they claim the performance is you know much better um there's a big debate because the way they're kind of um it kind of gets in the weeds here but you know the the the seven if you look at like the llama 7b or gemma 7b that that 7b number means 7 billion so there's 7 billion parameters
Starting point is 00:26:36 in the model 7 billion weights that have to be tuned and um whenever you execute the model to generate a token you have to you have to at least use all of those seven billion weights um so you know the smaller that number gets the less weight you have to use the faster everything gets um now the problem is like that's not the only uh it's not so one-dimensional right so the gemma models have a much larger embedding and i'm not gonna get too much in details here it's not it's not like that relevant but basically because they have such a large embedding size um they could be a lot slower than the llama model of the same size. But they can also perform better. So there's a little bit of advertising talk around the,
Starting point is 00:27:29 oh, we perform better at 7 billion. But the cool thing is, you know, it's a whole new set of open source models that we have at our disposal. One of the most interesting large language models that have come out that you could run on commodity hardware is this one called mix drill where i think they mixed several different open source models to make one kind of supervisor model um and uh and i think that's fascinating
Starting point is 00:28:02 and so now you have another model you can add to the mix, literally. So I think there's a lot of potential here and folks should check it out. The llama.cpp, despite the project name, actually runs a lot of these open source models. And I'm sure they're feverishly working. By the time this podcast is out, they'll have the gemma models in llama.cpp so you could run them on your on your laptop so um definitely something to check out is the number of parameters also like uh sort of related to how much memory it takes to use because that's one of the things that that they always make a big deal about is like not just needing a gpu but
Starting point is 00:28:43 a gpu with like very large amounts of memory yep yep and so there's uh a bunch of tricks you can do um they got to the point where they're doing four bit quantization um so they're only allowing each weight to be one of 16 different values um but uh that's what that llama.cpp uh project that's one of the things they do but you're right you know the more parameters the more either cpu ram or video ram you need to to run the model very cool and then does it or maybe we can we move on but fine tuning fine tuning of these like is it some models are easier to fine tune like so the training so running them and executing them presumably like jemma coming from google is related to you know the ones that you can just go online and use so if you don't have an internet connection or something it feels useful and being open source but to me the power and i think you've talked
Starting point is 00:29:37 about trying that before is like adding your own inputs and doing some you know additional training or switching or customization to um are these ones equivalent when it comes to that aspect or is like some are better for that and some are worse yeah so the the big difference between jemma and llama is the embedding layer of jemma is enormous um so and we'll talk about this actually later in this episode but but basically the embedding layer is the layer where you switch from understanding what you just said to deciding what to say next and so that that layer is really important um and so that layer is enormous in gemma so i would expect it to be harder to fine tune just because of that, to require more data, more iterations.
Starting point is 00:30:32 But in general, yes, I mean, it's much easier to fine tune the 7 billion model than the 70 billion. The other challenge about fine tuning is you can't do quantization while you're training. And so now you need to store the 32 bit or maybe even 64 bit potentially float for each weight um and so that gets that gets really expensive so that's where you need like the 64 gig gpu to do the training uh this is a fascinating time fascinating time i'm excited need to learn here good thing we have a topic queued up for this yeah you could start a large language model startup but i think
Starting point is 00:31:11 they took all the names there's no names left in the entire language there's so many of them all right time for book of the show all right what's your book of the show mine is the eye of the world by robert j which is the first book in the wheel of time series. I guess I'm, I'm late to this, you know, classic fantasy. Most people probably already heard of this. Also, if you have not seen any of the large amounts of advertisement, Amazon prime has done for Amazon prime video. They have a wheel of time series, a tv series i guess it's called oh cool um and so
Starting point is 00:31:47 that's actually i had long known about this uh we talked a lot about brandon sanderson books and brandon sanderson and ends up writing the ending of the wheel of time uh book series because oh i didn't know that jordan passes passed away before the series could be completed. So he left his notes and Brandon Sanderson sort of released it. So very adjacent to this series, which is never, if you've ever seen the books, they're intimidatingly large. So they're being,
Starting point is 00:32:14 you know, I don't, I should have looked at, I think there's 12 of them, but being there so many, you know, I was always a little intimidated to pick it up, but I actually really enjoyed the TV show.
Starting point is 00:32:23 I've watched there's two seasons now and decided it's finally time to pick the book up being aware that, you know, books and TV shows are not necessarily the same thing. But but interesting enough in the world, and, you know, seeing some conversations, I guess, Brandon Sanderson is actually one of the consultants for the Wheel of Time show. And, you know, you see arguments. Oh, I guess I'm nerdy. But in the parts of the internet that we all spend time on, you'll see people kind of debating about the Wheel of Time time show. And so it piqued my interest.
Starting point is 00:32:53 And so now I have to like know for myself and, you know, go look at the book so I can judge if the TV series is a good reflection of the book or not. Very cool. I read this book probably when i was like 14 or something oh but you've not seen the tv series i've not i didn't even know there was one until you just mentioned oh no i'm going to have to catch up i'm one of those people that i don't watch tv but i do watch youtube um and uh i watch premium, so I only get the ads that the creators, you know, actually put in themselves.
Starting point is 00:33:28 Um, but, uh, yeah, actually you would, you would, you and I would probably have very similar YouTube interests lately. I've been binging on this guy, black tail studio who makes coffee tables. Have you seen this guy? Yeah. Um, but yeah, I, I don't have, I have Amazon prime, prime but i've never watched watched videos on it but i will check that out that sounds cool uh so my new my new so so sorry on a side topic here watching youtube a lot i have uh apparently like too many interests slash hobbies that youtube
Starting point is 00:33:59 doesn't like it's not able to hold them all in its you know recommended video list so i have subscriptions of course but like if i go to my front page, like whatever certain topics, if I ever watch a video, so we were talking about power world, I watched a video about power world instantly. Like my whole feed became like two thirds dominated by power world. Um, and I had to like go remove watching power world videos from my watch history to like get back to any semblance. But in just in in general as i like rotate through my interests that like i noticed that it it feels like it can't hold them all and it's like you know consideration matrix or whatever however it's doing you know not to anthropomorphize it but
Starting point is 00:34:36 it just feels like yeah if i yeah once you once you move to a new thing you get lots of videos about that and you stop getting videos about the other even if you were watching them when offered and so i don't know maybe that's a personal problem just because i i should focus but well i think it just yeah there's a movement of the masses there like like probably people get on a topic and binge it and then that changes the behavior of the system for everybody else you know um my book of the show is how to make a video game all by yourself uh this is a very short book um extremely useful book i've been going to over the past you know since since covid let up a bit um i've been going to a lot of
Starting point is 00:35:20 video game developer meetups as people know i made this ai hero game but i've also been talking to a lot of other game developers in the area and one thing i've noticed is a lot of them don't see themselves as producers a lot of them are software engineers and they end up with like a really cool tech demo or engine this one guy built this um thing where it was like kind of like a minecraft world but you had uh you had rocket launchers for arms and every rocket actually caused a crater in the world and so you flew around like blowing craters in the world and they even had the water working so like if you if you're underground and you blew a hole up and you accidentally blew a hole into the bottom of the ocean like water would just start filling in it was cool i mean i was super impressed um and and as the water filled
Starting point is 00:36:21 in like voxels of air like floated up. So he had naturally had bubbles like you didn't have to like, you know, design the bubble separately. It was fascinating. But what it wasn't was it was not a game, you know, like I had a lot of fun, but I wouldn't say I was playing a game. I would say maybe I was it was more of like an art kind of thing. Um, and, uh, I feel like this book would have been perfect for this person because, um, it really, it starts off with the first principle of like, you are a video game producer. And so it lists all these things that ultimately don't really matter. Um, like which engine you pick or these other things right or
Starting point is 00:37:06 they're they're secondary to your to the goal um i thought it was really well written um i didn't bother to research the author but i'm assuming it's somebody who's done a bunch of indie games um and uh yeah it was a short read but I had a lot of fun. That sounds cool. I, I fascinated by this as well. I, one day I should go to a video game meetup. That actually sounds like it'd be really cool. I feel like it's a blast and I would start writing video games. It's a total blast.
Starting point is 00:37:36 I've met some really nice people there. Very cool. Time for tool of the show. All right. So I'm up first again, my tool is not so much about the tool, but just a shout out that like, I wish more companies kind of did this thing. So we talked about it a while ago. And I recently used this tool. And that is Google released an unlock tool for their online streaming gaming service stadia which shut down
Starting point is 00:38:07 and they had run sales where you could get a chromecast and a stadia controller which is what you needed to play stadia games and the stadia controller interestingly connects straight to wi-fi so you are wi-fi controlling your server box in the cloud that is running your game and then the results of the video would stream down to your chromecast and onto your tv and so you were not interacting you may be sitting in front of your tv but you could have been from another city you know playing your video game on the tv in your house or whatever and it wouldn't care because your controller just connects up to the internet via wi-fi um and then you know happens to control troll's device and there were some ui loops to you know if you had multiple to make sure that you were controlling the right
Starting point is 00:38:53 tv and that kind of thing but they kept running specials trying to get people to sign up for their service and i just wanted the chromecast so i ended up having a number of the stadia controllers laying around but there's nothing really do with them because they connect to bluetooth and the service shut down so they released uh on the radio chip that they had in there it also had bluetooth so they released a a website that you can plug your controller in follow the on-screen instructions and will basically install a new firmware that removes the wi-fi and adds bluetooth connectivity just use it as a bluetooth gamepad oh nice play on your steam deck i paired it to my steam deck and play it on my steam deck or my computer as well um lots of things have bluetooth these days so it's really easy to use as a game controller and i just thought they didn't need to do that like it it was custom
Starting point is 00:39:39 hardware for their thing but they found a way that was like easy enough for them and straightforward to kind of do this and it's if you if if you're like me you probably have a drawer full of things that correspond to services which are dead and not able to be used anymore and so it's not always possible i understand the complexities of it but it would be really cool to see this be like a thing that people try to do like at least give some functionality to devices when they reach their sort of end of service life yeah totally um yeah that's that's amazing yeah now i kind of wish i had bought some of these maybe you can get them on sale on ebay or something i don't know if
Starting point is 00:40:18 they went up in price or down in price after this i've not followed the ebay price trajectory all right my tool of the show is uh fuse and sshfs this is one of these kind of table stakes things where it's really good to have this in your toolbox i use this recently um i guess just a recap so fuse is a way for it stands for like file system in user space, something like that. I don't know. I'm not sure if I'm getting that totally right. But basically, it's a way for you to mount a file system without being the root user. And so, you know, if you're logging into a system at work, you probably don't have the
Starting point is 00:40:58 root account. And so you need to use something like Fuse. Even if you're doing something in your house where you do have root access it could be cumbersome to like have to sudo all the time and put in the root password and all that you might just want to mount a directory um just right there you know as a sub directory of your home directory um like imagine mounting your google drive or something like that you want that to just happen you don't want to have to put in your root password um so fuse lets you do that um so sshfs is interesting thing where um you know you can you can ssh into a machine and and you have now remote access you can run an editor do all that stuff um there's also something called SCP where it uses SSH,
Starting point is 00:41:47 but instead of giving you a pseudo terminal, it gives you a file. So you can take a file from that remote computer and put it on your computer. So you can like SCP, foo.txt, my home directory, and it'll actually, or sorry, SCP, user at server colon foo.txt and so i'll actually go to that server find the foo.txt file bring it to your computer that's okay but it could
Starting point is 00:42:17 be kind of cumbersome i was having a situation where i was creating files on a remote computer and i was wanting to look at them on my laptop and work pretty quickly i didn't want to have to keep copying them to the laptop and all of that also the files were kind of big and i only needed to look at the first part of them etc so um so i just set up this shfs uh. And so I mounted a directory on the target computer as a directory on my laptop. And I could just look at all of those files. I could read the first 100 kilobytes or 10 megabytes of the file without having to copy the whole thing. It worked well.
Starting point is 00:43:02 The challenge is if your laptop goes to sleep, just like any SSH connection, when it comes back, the file system's like broken. You have to unmount it or remount it. So, you know, it's not perfect, but it's extremely useful, especially at a pinch.
Starting point is 00:43:19 And it's one of these things that's almost ubiquitous because almost any machine you can SSsh into so you're only required to install things on your local machine i didn't know you were going to talk about this i didn't know this is i actually used this for the first time a couple days ago no way interestingly uh there is a windows version as well that will allow you to mount another linux computer over ssh to a drive letter in windows and i was on a windows computer and my i wanted to copy some files from my steam
Starting point is 00:43:54 deck memory card which is ext3 formatted so my windows computer couldn't see it well don't ask anyways long story anyways but the steam deck allows you to just enable SSH pretty easily. So I just did that. And this was the path. The path was to basically turn on SSH file system, have it on Windows show up as a mount. And then I could have used, I actually did later, end up using the SCP method you talked about.
Starting point is 00:44:20 It was a little cleaner. But mounting as a file system was also something that was eminently doable. And then other applications could have pointed at it. So it has this advantages. But yeah, so I did end up using, as you said, so many things have SSH on them, or you can SSH into, and if you can, and it has files, this is a good way of getting access to those files. Yeah, totally. to those files yeah totally um all right let's jump into transformers and large language models
Starting point is 00:44:50 um yeah i mean there's it's one of these things it's actually agoraphobic like there's so much content and so topical but but um we'll start with the basics, which is how neural networks store information. If you don't know what a neural network is, we actually had some AI, an AI two-part series. It's a bit dated, but it covers that in pretty good detail. But basically, you have these layers, and each layer you do a bunch of dot products. This almost becomes like a tensor product to produce the next layer. And so the weights, the things that you're multiplying by, you can change those as part of training this model. And by default, it goes from one layer to the next to the next.
Starting point is 00:45:44 Think of it as like a DAG, right? And by default, you know, it goes from one layer to the next to the next is like this. Think of it as like a DAG, right? And it can split and it can rejoin and there's convolutions. But effectively, it's acyclic. It just goes in one direction and then it ends with some target. You compare that target to your expected value for that target and you use that difference to adjust all the weights um now you know people very quickly wanted a way to store information they wanted these neural networks to be stateful you know imagine if you are training a neural net to solve a maze. If it's the same maze every time, you could just train the neural net to
Starting point is 00:46:32 solve that maze and the weights of the neural net will just hold really specific information about that maze. Like, oh, when I see this intersection, turn left. But if you wanted to train a generic neural network to solve any maze and to to you know solve it over time it has to keep a representation of the maze in the neural network right and so it's it's a memory that is sort of online that's independent of training um so this is this is the goal and um there's a lot of different ways to do it. The kind of obvious thing would be, well, make it cyclical. Like take some of the weights and instead of making it an acyclic graph of operations that just ends in this point, make some of those weights just point backwards.
Starting point is 00:47:22 And so you execute the network. And when you execute it, some of the data is kind of left over right um and so they call us a recurrent neural network um the problem with this is they're incredibly hard to train and to learn anything meaningful um and there's a ton of reasons for this. But the biggest one is this problem called the vanishing gradient problem. And so the idea is basically if you multiply a lot of numbers less than one, you very quickly get zero, right's that's basically the gist of the vanishing gradient problem there's more complexity than that but um you know if you multiply a bunch of numbers together that are bigger than one then you quickly go to some huge number right that approaches
Starting point is 00:48:16 infinity and so that's not going to happen because you have something called regularization so that's not an issue. But the other thing where you multiply numbers smaller than one over and over again, and you get zero, that happens. And so you kind of have this sort of dilemma. It's like either everything goes to infinity or everything goes to zero. Either way, it's not really very usable um so someone came up with lstms long-term short-term memory and they basically said let's have one process that's going to infinity and let's have another process that's going to zero and then add them together and hope that like the two problems cancel each other out and uh um again it's one of these things that yes in theory you have these these long-term
Starting point is 00:49:08 you know gradients these short-term gradients and and the problems of both of them cancel each other out but in practice it just is really hard to get it to do things um throughout my career tons of people have tried to do lstms for all sorts of practical things that all these companies have worked at and it's never worked um um there's a time at facebook uh someone came up to me and said hey i have this idea um i think we'll do an lstm to predict uh um you know effects over time of people you know watching things on facebook i was like forget it i'm not interested it's not gonna work like i've just seen it fail too many times and it's it's like um you know they call this like a tar pit idea because um um you know
Starting point is 00:49:59 you don't really realize you're stuck and then and and it seems, it doesn't seem like it's a problem, maybe a quicksand idea. It's like, it doesn't seem like there's anything bad about that idea. Once you get into it, it just, it just sucks up all of your time. You don't get anything. Um, so LSTMs, you know, not much success. Um, but then something interesting kind of happened. Differentiable algebra kind of took off. So what that means is you used to, and Patrick, you might have done this like electrical engineering where they have you kind of derive all of the the updates for a neural network so they they show you like you know here's how you calculate like the the derivative of of um you know this type of
Starting point is 00:50:54 activation layer and you get these like exact numbers and you know exactly like you know if my answer was four and the answer should have been five then this weight in this neural network needs to go up by exactly like 1.2 times the learning rate or something and so you have these like very specific formulas and as long as you follow the formulas you'll eventually you know get to the right place um the problem is the formulas only work in certain circumstances so you couldn't for example say um you couldn't for example say like take the maximum of these three values because now like you can't differentiate the max function like there's no derivative of the hard max function
Starting point is 00:51:46 right and so what came what came out you know some what got popular in like around 2015 ish was this idea of numerical differentiation it's like instead of trying to come up with the derivative of all of these functions, let's numerically differentiate all of them. And so now you can actually have a gradient of the hard max function or any function. It's just a numerical gradient, a numerical approximation of a gradient. And so what that means is now you can write pretty much any code, virtually any code, and backpropagate through it. So you're not going to change that hardmax function. It is what it is. But you'll be able to differentiate through it,
Starting point is 00:52:43 and things that happen before and after it that you can change will start changing. So for example, let's say I have three neural networks and then I take the max of the outputs of those three neural networks and then I have a fourth neural network. All four of those networks can be updating and learning even though they have this function in the middle that's not differentiable so all of that leads into this concept called attention layers so an attention layer is you know a set of algebra that you apply um on three things one is your query which is what are you interested in right now your keys which is how does the thing you're interested in now relate
Starting point is 00:53:40 to the other things in your list and your values which are the other things in your list and your values, which are the other things in the list. So for example, you might say the cat jumped over the moon. Those are my values. My query is going to be, let's say, dog. And then my keys are going to be what the relationship that I think the other words have to dog. So like cat and dog probably have like a pretty close relationship, even though we joke about cats and dogs, but they have a close relationship because they're both animals, right? But you know, dog and the probably don't have a good relationship because the is just related to everything and it just washes out, right? So given your query,
Starting point is 00:54:31 your relationship of that query with each of these items and something that describes each of these items, you can then create a total amount of attention so you get um so you can say like you know dog has this much you know is capturing this much energy from that sentence um so that gets into self-attention which is just a fancy way of saying given for example a sentence take every single word in the sentence and find out how much attention the sentence offers each of those words so if you say you know the cat jumped over the moon for each of those words you know for the cat jumped you know how how much attention am i getting um from the
Starting point is 00:55:27 other words in that sentence and for that jason so if if you're saying like cat related to dog i imagine you can look across a kind of training corpus and kind of like say how often do they appear together you know appear next to other words but for self-attention, how do you figure out that like what's holding the weight in a sentence? Yeah, so the way this works is the keys, like the weight between two words, that you're going to learn over time. So in the beginning, it's going to be just random. But then when you calculate the attention, then you take all of... And these are actually stacked on top of each other.
Starting point is 00:56:18 So you take these stacked attention layers. So you're learning keys, you're doing the attention algebra, and then you're learning a new set of keys, and you're doing another attention algebra step. And then at the end, all of it is hopefully in service to some task. So by default, most of these models are what's called self-supervised models or forward models. So what that means is they're trying to predict the next word in the sentence. So we'll just walk through the very first training step. So you have everything token embedding that you train somewhere
Starting point is 00:57:09 else so there's some other process that takes a word and turns it into a vector of numbers um you know that could be even all part of the same thing but usually it's broken up into two systems so so now what i have is for each word i have a vector of numbers and so i pass in let's say the cat jumped over i pass in that matrix right so it's a set of vectors of numbers and um in the beginning it's going to say well they're they're all just randomly interesting to each other. So we get a bunch of random numbers out of that, calculate the attention, do this a bunch of times. And then that's the encoding step of the transformer. So what comes out of that whole process we just talked about is a single vector.
Starting point is 00:58:02 And this is true if you're doing Lama, ChatGPT, all of these things. After all these attention layers, what comes out is a just single vector of numbers that describes the entire context of what you've seen so far um now you have to do something with that embedding and so what you usually do is you say okay i want to predict the next word so i'm going to take this embedding i'm going to create a decoder i'm going to create a function that takes the embedding as an input and outputs which word I think should go next. And so in the beginning, it's going to be totally random. Even the decoder is going to be random and it's going to output whatever, you know, foobar. It's literally going to pick a random word to output next.
Starting point is 00:59:03 Then it's going to look at the actual word so i think i did the cat jumped over the next word is going to be the and so it's going to say oh you got it wrong it wasn't foobar it was the and so what needs to change so that next time i ask you that question you say the and so that what needs to change or that's the loss and that loss is going to be propagated all the way back and it's going to change all of the attentions among all the pairs of objects and every attention layer it's going to change the entire encoder like everything is going to change a little bit and this process then repeats a zillion times a zillion yeah and uh and so then we talked about the trying to get it is saying the comes from some training so
Starting point is 00:59:57 you're just sticking in texts books whatever and learning hey try to guess this this next sentence yeah that's right yep so um and there's been a bunch of work on this so sometimes they try to predict sentence fragments sometimes they try to predict single words and they run it each time for for a different word um but you're right i mean people are are going through you know all of wikipedia you can download all of wikipedia there's something called common crawl which has like you know gigabytes and gigabytes of text from the internet um and these systems are going through all of this um you know taking the first n words and trying to predict the next one um and this is happening at massive massive
Starting point is 01:00:46 scale um it's just ingesting like huge volumes of text um this also works on images and video and all that as well but it's ingesting your huge volumes of text and and um and and trying to predict the the next thing um and so um yeah and so so uh it's kind of remarkable that it works at all, but there's a lot more complexity around if you dive layers deeper. Like for example, the sentence needs to make sense. And so if you're just predicting one word at a time, you might end up with things that, like the system might paint itself into a corner. And it might realize, oh, actually, I outputted this word,
Starting point is 01:01:39 but now that I've kind of gone three, four words in, I realize I made a mistake three words ago you know we even do this as humans when we're typing right and so there's uh it used to be a beam search now they're doing something else but basically there's a way where you kind of look ahead and and based on that you can kind of go back and make some changes and so it almost becomes like a search system in the decoder that's doing like a best first search yeah so the if you ever saw one of those i guess that's the hidden markov model train something would you know you feed all the harry potter books to a hidden markov model train something would, you know, you'd feed all the Harry Potter books to a hidden Markov model and ask it, you know, or on your keyboard on your phone, if you it has like
Starting point is 01:02:28 word suggestion, if you just keep tapping the word suggestions, you get what I would call sentences, but yeah, they don't go anywhere. You're just inserting like plausible next words right after each other. And so you end up with something akin to a sentence structure but it it there's no like story or progression or statement it just you know words that come close to each other you know just happen to appear yep yep and so you know by default if you um um you know if you put like uh is it the cat or the cow jumped over the moon right yeah isn't that the same yeah i've been saying cat the whole time but if you put you know the cow jumped over the it's gonna say moon with super high accuracy because it's just seen that a bunch of times on
Starting point is 01:03:17 the internet right um but um so that works pretty well and so a lot of these things when you for example ask a question to chad gpt there's actually it's it's embedding your question in a prompt right because otherwise if you just took one of these naive forward models and you asked and you typed a question and said generate some text it'll probably generate more questions right like it you know it's it's it's whatever people like who wrote that kind of question would write next and so sometimes it'll generate an answer sometimes it might generate more questions um and so typically you know it might not be obvious from the user interface but you know what usually happens is you will um um you know it will put like question colon and then your question and then new paragraph
Starting point is 01:04:13 answer colon and then the model will start generating tokens and so you didn't see the question the answer but they're there and that tells the model like like hey you know this is the broader context of what's going on um so this works pretty well uh the thing about it is um you know it's it's uh we want to be able to make improvements and so it's such a huge corpus of data that we can't it's not like normal machine learning stuff where you say, oh, I'm going to just change my data set to improve the model because the data sets like the entire internet. So like you can't, can't really do a whole lot with that. So there needs to be some way to fine tune things after the fact.
Starting point is 01:05:01 And, uh, you know, once you have this forward model and so the first attempt at this which is what open ai did with their initial models was called rlhf reinforcement learning from human feedback and basically the way this works is they would have um chat gbt and this is before it was released or anything, generate four or five different answers. They would give those answers to people. The people would score the answers, and then the chat GPT model would optimize for that score and try and get the highest score possible. The problem with that is numeric scores only make sense when there's a real unit of measure. Is a score of four really twice as good as a score of two?
Starting point is 01:06:06 It's very hard to get people to think in such linear ways, such proportional ways. And so people are just basically the human raters are just scoring everything either a five or a one, very few two, threes, and fours. Or even if they were, it wasn't very proportional. A five wasn't five times better than a one, etc. So what's become more popular now is direct. I think it's called direct policy optimization, but basically it's pairwise ranking. So you have the system generate two answers. You give it to a person and the person decides which answer is better um and then the system is is encouraged to give the better answer and discouraged from giving the weaker answer directly um and so there's a paper that came out i think only six months ago maybe a year ago on this uh but it's as much as i love
Starting point is 01:07:01 reinforcement learning you know that was my ph, I also was pretty skeptical of this. I kind of felt like this was a poor use of reinforcement learning. And so I was not very surprised when direct policy optimization came out. And so you can imagine this is how a lot of these um things are are ironed out so a lot of ambiguities and sentences and things like that you know things that require a lot of common sense reasoning those are ironed out through through dpo so the but this works so they do the initial training you described and then they sort of like go into this fine tuning step but the answers i mean i guess they could become part of like a future training corpus but they aren't trained on the same way because if you have it generate an answer
Starting point is 01:07:54 from its current weights and then you try to tweak them that's different than showing answers from another version of you know chat gpt one or whatever and saying hey here's the responses they got or like how does that work you just sort of accumulate all of them over time yeah it's a good question let me see if i understand so so in the beginning you're you're just trying to predict the next token and there's no preference right then later on you're given a preference but for an entire answer. And so there is somewhat of a credit assignment problem there.
Starting point is 01:08:33 It's like, which token caused that answer to be good or bad? But again, with enough data, it works itself out. But you're right, the loss is really different. And so the way that the network is changing in that second phase is very different and so yeah you can't really go back once you've started this preference approach i mean you know you could there's nothing to stop you but it's like the two are kind of disrupting each other and so you just disrupt the first with the second and then you get what you want so yeah so if you did like 10,000 rankings i don't know how many they do at the end they don't it is not necessarily reusable like if
Starting point is 01:09:10 they had a new update to the corpus or whatever they would have the weights they'd fine-tune roll them forward but they would go do another 10,000 they wouldn't just like play back oh 10,000 oh now i see what you're saying this is really interesting so um okay so you're saying you know i have chat gbt6 and uh it generates a totally different answer than the ones than either of the ones i sent to the human last time yeah so yeah this gets complicated but there is something called importance waiting and basically the gist of it is like this. When a token is generated, it's generated from a distribution. So it's not like the neural network runs and at the end it says cow.
Starting point is 01:09:59 It actually runs and it outputs a probability over the entire space of all words that it could output. And it just happens that cow had the highest value, right? But you can normalize that output vector. And now what you get is a probability mass function over all the words so there is so for example there is a chance that chat gbt6 and chat gbt5 or chat gbt5 with two random seeds two different random seeds there is a chance that they will both generate exactly the same answer um it might be a low chance but it's there and and so there's actually, there is a chance that it will generate anything, right? Any sequence of words. And so because that chance
Starting point is 01:10:53 is non-zero, you can do something called importance weighting where you say, okay, the new chat GPT, although it generated nothing like the two answers i sent to a human last time it still has a chance of generating those two answers and so if it's it should be it should be more likely to generate the better answer even though it it really didn't want to generate either of them oh so i think so it doesn't give the yeah like it doesn't pick the final answer it goes and looks at the distribution and says was your distribution closer to the better one or the worst one and then you can you can somewhat reuse i see oh that's that is interesting yeah exactly now the thing about importance weighting is has two problems one is um you know the easier problem is is uh those
Starting point is 01:11:41 numbers are going to be small you know the chance that you do anything like with that complexity is going to be small and you're dividing small numbers against each other and so things get a little crazy there um and the other thing is oh yeah you might you might have to move a lot um or actually rather um it might be that the answer is really orthogonal to either of those two answers. It could be like almost like equidistant from both of them. Yeah, exactly. And so in that case, it doesn't matter what answer you pick then.
Starting point is 01:12:16 So it's not perfect, but you still get a lot of value out of it. It's not like that old work was wasted. Oh, that's cool yeah i mean because i guess like when you're moving up versions you could ask it a question and in one case it generates two different diatribes about cows and moons but then in the next one it generates a picture of a cow jumping over a moon and actually that is better but then now you're stuck with this problem of like oh i have a picture versus two sentences and so yeah i hear yeah yeah makes sense but yeah the data mode is real you know i mean all this data and all that human effort that's gone into labeling
Starting point is 01:12:52 and ranking all of that data is just permanently valuable so um so yeah it's a huge boon for for a lot of these companies have started early that was that was uh so the large language model uh and then we talked about all this so the large there just comes from the fact that the parameter count has gotten so numerous and it happens to be that these large language models today have a very similar architecture that you're describing and not so much that there's like you couldn't do it a different way right so no of a different better way yeah exactly so so the the recurrent neural network you couldn't make it large because the gradients would vanish the lstm you couldn't make it large because it was an unstable equilibrium
Starting point is 01:13:41 and so the current the gradients would either vanish again or explode and so so you couldn't do it but with the uh you know the attention layers in the beginning a lot of people myself included were a little skeptical of the attention approach because um it felt like a strange compromise between you know convolution where you have like a relatively small mask but because the mask is small it can kind of like rove roving eye you know through an image kind of compromise between that and an lstm which in theory has like an infinite horizon. Like you could, in theory, you could feed, you know, a near infinite amount of, not literally infinite, but you could feed an extremely large amount of data to LSTM.
Starting point is 01:14:34 It could remember all of it. There's no limit. So attention was strange in that, you know, it had a lot of the features of long-term, short-term, but it didn't have the benefits um but you know the fact that it was a stable training that was really incredibly useful it's one of these things that's hard to know because in theory everything is stable right it's hard to know in practice like what will this do if you have seven billion
Starting point is 01:15:05 worth of this right and it held up really well at that scale yeah that's really interesting so it's like when you grow really big it was hard to predict what would happen but that yeah it has the attributes that sort of end up working at that scale and the compromise. But I guess like to the same thing, it doesn't mean it's none of this means that it is the sort of like optimal or correct answer. It's just the best we know today. So if we found a new LST, I mean,
Starting point is 01:15:36 maybe not, but like LST, I'm like, or you found a new way of, you know, solving the problems or adding another thing. It could do better, but we just don't know that today.
Starting point is 01:15:43 So someone would have to figure it out. Yep. Yep. Um, now here's where it gets really tricky is, um, you know, all of this is, um, very direct. Like, you know, you have this direct policy optimization, you know, the RLHF was technically reinforcement learning but um it was it was you know a very direct approach and so a lot of these systems you know they can't actually do things like they can't they can't get rewarded or punished really um
Starting point is 01:16:20 and if they do it's this really sort of brute way where you know you could tell chat gpt like go invest in stocks like oh you're bankrupt that was bad you know it's like but it can't it can't it can't really reason in a way like it can't sort of build a world model that it can then use to reason and to make decisions. So I think the future of this is this really interesting work coming out of Facebook called JEPA, which is I think Joint Embedding Policy Architecture. But it's basically a way to combine a lot of these embedding approaches, whether it's a large language model or a large image model, or maybe it's a large, you know, actuator motor response module or something doesn't necessarily have to be language, but a way to sort of combine those with decision making. And I think that's really going to be super exciting but it's going to take a while for that research to mature yeah so i guess like to your point we were talking about
Starting point is 01:17:32 earlier uh picking a racing line and driving your car towards it you could tell chat gbt or one of these large language models how you want and it maybe could generate your code or something but it can't actually go play the game itself like it doesn't it doesn't have the hooks it doesn't have the inputs to go do it there's no module and this the architecture isn't really set up to you know be a mario kart ai agent right and then you know people joke about telling chat gbt like your answer is terrible and then it generates another one and yeah that's funny but but really it's not it's not really um you're trying to optimize for some objective like it will pivot um but but even that pivot is is kind of artificial it's not really directionally heading towards some place of greater value and so you
Starting point is 01:18:27 know yeah you can't plug chat gbt straight into mario kart and like over time get something that get it to produce python code that gives you a higher and higher score over time like there's not an easy way to do that yet yeah yeah so the i always find that funny i guess like that i don't know it's anthropomorphizing it's not the right way but you're right people like fuss at chat gpt like i'm gonna i'm gonna kill you if you you know i'm gonna unplug you if you do one more thing but it's all just adding to the context it's not like you said it's not actually evolving forward so you you could just written all of that out and claimed what it told you like is what it told you even though it didn't and just put it all in the prompt and it would do the same thing
Starting point is 01:19:08 right in other words like if someone comes back to you and says jason you're bad because you did this thing and you didn't do that thing you're going to be like what no like you're just going to ignore it but with chat gbt if it gives you an output if you were to start a new session and copy that what it told you and what your response was into the very beginning, it's the same. Like it's not actually like remembering and evolving in that way. It's just building this ongoing context that kind of feels similar. Right. And, and all of this, you know, even what's coming in the future is not going to kill coding. Like we should spend probably the last five minutes of the show talking about this. But if you are a follower of this show, you know, maybe you are working professional or not yet,
Starting point is 01:19:50 you're in college and you're thinking, oh, if I major in computer science, there won't be a job for me because Jack GVT will have my job. That is not going to happen. You know, I think coding is just a way of solving problems at the end of the day. And so the medium might change, but the need to solve hard problems is not going away anytime
Starting point is 01:20:14 soon. And if anything, ChatGPT will actually automate everyone else's job. People who are solving easier problems, those are the people who really should be worried. If you're in here listening to the show, show going to college or even on your own learning to be a programmer you're doing it because you want to solve hard problems and that's doing hard things really mentally difficult things is give me one of the last things to get automated yeah i i mean i think they've always talked about it for a long time and accuracy hasn't been there but i the one i always think about is like i don't just use it so much like x-ray tech it's like you go in and you know you have a broken arm and you get your x-ray and the
Starting point is 01:20:54 the job of the person who i don't know what the right word reads the x-ray like looks at the x-ray is supposed to look at the notes from the doctor about what they think was wrong look at the x-ray and check for problems and then like write a. But if you gave it to 100 doctors, they won't all say the same thing. But there is a correct reading of the chart. And hopefully, most of them would give the same answer. This one feels like very difficult before you would just given it to like, you know, something that would use like Jason was mentioning a convolutional neural network or something try to highlight where the fracture is. but now you're getting to the point where you could give the context of the doctor's notes and hey i was in an accident and my arm is hurting and you know here's my and so
Starting point is 01:21:33 it would you know kind of understand what it's attempting to do and you're right those things are problem solving but not really in the same way that like like you said, hey, I want to build a game where you're a go-kart racing turtle and throw shells at each other. And that kind of problem solving is a fundamentally different approach. Right, right. And also, you know, when you're building anything, this is true of anything you're building, software or anything. There's really two things that you're constantly adapting to. One is product market fit. Are people enjoying my game? Who is enjoying the game?
Starting point is 01:22:17 And then the second one is quality of life. So maybe my game is actually fun, but there's too many buttons in the menu and people like are like not even getting past the main menu. Right. And so you have to constantly adapt to all of these changes and you have to decide like what parts to be flexible, what parts of the code should be inflexible and written quickly. And these are all things that, you know, trade-offs that ChatGPT is not going to make very effectively or any AI is not going to make very effectively. So, you know, and who's to say what's going to happen way out in the distant future. But I would say, you know, if you're listening to this show, if you're interested in this topic, your job is extremely safe. I think that you can if if you
Starting point is 01:23:08 said oh i'm going to pivot to accounting well that's that's probably uh not unwise no i mean i brought their laws and accountants so it's not an insult to accountants but i'm just saying you know you are in a very safe profession this idea that coding is dead or will be automated is absurd and uh don't worry about it and yeah in like normal term i think there's a caveat there like who knows what happens in a thousand years but yeah exactly i but again i think you know your job will be one of the last ones to go so by then i saw this crazy stat that like 99 of job titles didn't exist 100 years ago something like that oh interesting yeah yeah i mean i think like uh taking taking as an example like someone who's an actor or something which i know that there there's a bunch of politics around that and fighting whatever but
Starting point is 01:24:01 taking an actor and their voice and their body image and like the way they like this is very easy easy to join on and then like get them to do new things and become an ai agent of some sort and you know probably like they call those people quote unquote talent but like their talent is something very very specific uh and mostly a function of how they look and how they sound and so um you know even the things that they do on camera all scripted right the writers and the producers and and all of that stuff and so yeah i i think you're right without saying when or if i mean saying one of the last to go is a reassuring fact yeah i mean you would be late enough that you you would see the writing on the wall and you would pivot to one of the 99
Starting point is 01:24:45 percent of jobs that are coming out in the next hundred years i've seen i've seen the matrix your job is to become a heater eat food walk around in the matrix and provide warmth for the robots the robot army so good all right folks i think we'll put a wrap on that if you have any questions about lms join our discord um i do have discord's one of the few apps i actually have notifications turned on so when people post in discord i do see it right away um join our discord uh you know support us on patreon we really love and thank all of our supporters um you know we're putting all that money back in this show trying to get more more people, kids, adults into programming. And we will catch everybody next show.
Starting point is 01:25:29 Thanks, everyone. Music by Eric Varndollar. music by eric barn dollar programming throwdown is distributed under a creative commons attribution share alike 2.0 license you're free to share copy distribute transmit the work to remix adapt the work but you must provide an attribution uh to uh patrick and i and uh share alike in kind

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.