Programming Throwdown - 172: Transformers and Large Language Models
Episode Date: March 11, 2024172: Transformers and Large Language ModelsIntro topic: Is WFH actually WFC?News/Links:Falsehoods Junior Developers Believe about Becoming Seniorhttps://vadimkravcenko.com/shorts/falsehoods-j...unior-developers-believe-about-becoming-senior/Pure PursuitTutorial with python code: https://wiki.purduesigbots.com/software/control-algorithms/basic-pure-pursuit Video example: https://www.youtube.com/watch?v=qYR7mmcwT2w PID without a PHDhttps://www.wescottdesign.com/articles/pid/pidWithoutAPhd.pdfGoogle releases Gemmahttps://blog.google/technology/developers/gemma-open-models/Book of the ShowPatrick: The Eye of the World by Robert Jordan (Wheel of Time)https://amzn.to/3uEhg6vJason: How to Make a Video Game All By Yourselfhttps://amzn.to/3UZtP7bPatreon Plug https://www.patreon.com/programmingthrowdown?ty=hTool of the ShowPatrick: Stadia Controller Wifi to Bluetooth Unlockhttps://stadia.google.com/controller/index_en_US.htmlJason: FUSE and SSHFShttps://www.digitalocean.com/community/tutorials/how-to-use-sshfs-to-mount-remote-file-systems-over-sshTopic: Transformers and Large Language ModelsHow neural networks store informationLatent variablesTransformersEncoders & DecodersAttention LayersHistoryRNNVanishing Gradient ProblemLSTMShort term (gradient explodes), Long term (gradient vanishes)Differentiable algebraKey-Query-ValueSelf AttentionSelf-Supervised Learning & Forward ModelsHuman FeedbackReinforcement Learning from Human FeedbackDirect Policy Optimization (Pairwise Ranking) ★ Support this podcast on Patreon ★
Transcript
Discussion (0)
programming throwdown episode 172 transformers and large language models. Take it away, Jason. Hey, everybody. I had a really interesting discussion on LinkedIn.
This is like a meta meta thing,
but I post everything on LinkedIn and Twitter
and my Twitter, nobody follows my Twitter.
And the weird thing is I, my ex, yeah.
Maybe it's going to be my ex social network.
I just can't get anybody to follow me on there.
Everyone follows me on LinkedIn,
which is fine.
But I even tried,
you know,
putting my Twitter link on presentations I give and stuff like that.
And,
and people would rather just find me on LinkedIn.
So I find this amazing that you give presentations where people actually
could join and follow you
i always see people do that but i've never done it myself just not on social social media but i
also find it interesting that you have presentations where that would even be an opportunity i gave a
amazing presentation to suny purchase which is a university in new york um and it was on the ai
singularity and uh um it was it was a lot of fun um had a great time there or i wasn't like
i wasn't in person but had a great time speaking oh nice now i kind of want you to give it here
i actually would love i asked if if they would make the video public they said no
it was actually technically one of the lessons like it was part of a course
and so for that reason they can't just post it on the
internet but i was really happy with it the questions were extremely interesting the students
were very engaging and um um yeah it's a shame that we can't just share it with everybody but
the thing that really kind of took off on linked sometime over the weekend, was I posed this question,
is work from home really work from city?
And what I meant by that is how many people
during the pandemic really moved to a different city
and they're not really interested in working from home.
That's not the spirit of what transpired it's actually that
they want to work from another place and uh so maybe i'll pose a question to you patrick so if
you're if your office was there in your city would you go to it why and why not
oh this is interesting uh i think i i see your point and i have seen the statistics although
i'm not sure how trustworthy they are i i don't know the questioning to be honest but that not
everyone to your point is actually happy about work from home personally um you know relocating
to a different city for for a plethora of reasons to your point
it's not necessarily that it was only the office it was also the geographic location as much as we
are moving to online people you know having family and you know just a laundry list of different
personal reasons for myself but for others as well there can be specific cities you want to
live in or don't that being said i the question so it's hard because the
place i chose to live is different than where i would potentially have lived if i had tried to
locate next to an office that is where an office most likely would be near me would be far and
therefore i would not want to go to it oh i see but if there had been an office and i was living
close to it it's not that i mind going into an office occasionally. I will turn the flip around, which I'll say by not commuting personally have found ways of,
I don't want to say exploiting, that sounds negative. I have found ways of parlaying that
time that I would normally have spent in the car into other industrious activities. So things like,
you know, being more active and exercising and going out or meeting with people sort of early
in the morning because I'm on the East Coast and work with people on the West Coast.
And so going and meeting with people I otherwise wouldn't have because I'm working late,
basically meeting them in the morning.
So replacing commute time by having a time shifted schedule and finding these other things
for me means work from home is really about work from home.
But I do see your point there is a gradient between office for your full work week hybrid work from city like down the down the list i think
there's sort of like a a spectrum and i think it's it's not a it's not a binary choice yeah that
makes sense you know one of the reasons, so I was working in Austin.
So like when I moved to Austin, I did work in an office here.
And then recently I switched to working from home, which wasn't a choice.
It was part of just things that transpired at my company.
And so now all the Austin folks are remote.
There's definitely pros and cons.
I'm a bit ambivalent to it but what i really want to do
is sort of blunt this argument that um i feel like when they create this dialectic of you have
you have people in the office and people at home then it's easy for like i think elon musk is one
of these people who says oh people working from home are lazy you know they just want to sit eat doritos all day whatever um and so i feel like you know the
entire premise of that argument is false in the sense like like i would i wonder if the majority
of people actually you know they're not working from home to work from home they're working from home to
live somewhere else and so then you know the whole uh this whole stereotype of like oh this
person doesn't want to get out of bed is really not not even true on the premise i think yeah i i
feel like it's one of those cases where it's there's no universal truth
i mean i could imagine a you know theoretical person that you know this is the classic goes
and spends all their time at the water cooler right and actually causes a distraction of other
other employees trying to work and you know basically finds ways of not being at their
desk and not working because they're not busy and they're trying to cover for it throughout my career i've always known people that that have you know been
more or less like that and that for that person going home if they're doing that specifically to
avoid the appearance of you're not being busy then going from his is a is a you know a revelation to
them because they could just do whatever they want and no one knows
if they're working or not and then you get all the like you know mouse tracking software that
you see companies roll out or whatever to combat this so i do think there are people like that but
like you said i i don't know that that's universally true that just because you work from home means
you're like eating cheetos and don't have any pants on. Yeah.
All right.
Well, we'll see where that goes.
But there's a lively discussion.
If you want to follow me on LinkedIn,
where people actually respond
to my inane questions,
you can follow me on LinkedIn
and see a really interesting discussion unfolding.
One thing is, you know,
and I knew this would happen,
so I tried to avoid it, but you can't completely avoid it. You know, you know, and I try, I knew this would happen. So I tried to avoid it,
but you can't completely avoid it. You know, some people took it as a, like, attack on, you know,
the Bay Area. And that wasn't really the point. You know, and I tried to make a point of saying,
look, there's just as many people who would be running the other direction. If the tech companies weren't already there, you know, like, if you were to flip the script, there'd be just as many people who would be running the other direction if the tech companies weren't already
there you know like if you were to flip the script there'd be just as many people in the bay area
working from home while their hq is in florida saying like oh you know it'd be the same thing
right so it's not it's not about which place is better or any of that it's just about why are
people um leaving and for for a subtly different question i mean do you have
a like if a company wanted to offer work from city as a flexibility like is that something you
imagine i mean we work is now i think going towards bankruptcy i think is last i saw but i
mean i i mean what what is your like do you have a thought towards like what that ends up looking
like yeah i mean i i loved it so we had a we work um basically the the short story here is i was part of a startup
as you know a startup has relatively few rules i mean obviously it's still a corporation and
everything but it's a small company and so we had a we work uh startup was acquired by a much
bigger company and the bigger company has a bunch of rules around what constitutes an office.
And WeWork wasn't able to fit those rules.
And so all the WeWorks got shut down.
Basically, it's what happened.
But I loved it.
I would go in, there would be a bunch of people
from all sorts of different industries
and different micro offices.
That was interesting.
There's nice common areas where
you can meet people um so i was a big fan i actually um i would ride my bicycle to downtown
which is something like 12 miles each way um and that would be my whole exercise for the day so i
bike in a downtown i would uh you know work um bike, bike back. And yeah, it was a lifestyle that I was
enjoying. But working from home, I think has been pretty much fine. Yeah, maybe I'm pretty
ambivalent to this, you know, it doesn't seem all that different. I think, you know, a lot of the people in my team were not in my city anyways.
So I was spending a lot of time, you know, on the internet with them anyways.
Well, check it out on LinkedIn.
The discussion continues there.
But you have to make an account, Patrick.
I think I have one somewhere.
I'll have to dig it out.
Time for news of the show.
So fitting with not being a news story story we should just really rename these sections uh this one was a article
that uh was titled falsehoods junior developers believe about becoming senior and uh just uh the
the person's name at least i assume his name is Vadim.
I'm not sure how to say it.
Sounds right.
Okay.
Apologize if you're listening.
And they had some great points here that is just a thought-provoking article, which is,
you know, I have been in my career a little bit longer.
I can sometimes forget what it was like when you're sort of new and you don't understand a lot of stuff um first of all i mean i personally think the junior developer senior developer thing is
is overdone at some companies it's a very strong delineation at other companies it's not uh i i
think there are people who are have characteristics of being quote unquote junior and senior um and
so maybe my mindset already sort of
like answer some of these, but these are great to point out. I don't know, I hit on all of them.
But an example of is that a senior developer just knows all the answers. And I think that's
obviously not true. Like senior developers, or at least I've not met a senior developer who
actually knows all the answers. I've met a few who thought they knew all the answers, but they don't actually know all the answers. Similarly, there's a belief,
and we bump into this from time to time that, oh, you're a senior developer, like you must work like
the cutting edge stuff and, you know, insert whatever language is, you know, hip at the moment
or constructs in the language you're working in. So I work c++ like oh you must use all the insert exotic c++ stuff and it's actually like no i use probably one of the more basic subsets of the
language because it's important for me to get collaboration and get help from teammates and so
the person just pointing out that you know there's not this belief that you somehow fundamentally
shift that one day you're doing sort of junior tasks and fetching coffee that's not this belief that you somehow fundamentally shift that one day you're doing
sort of junior tasks and fetching coffee, that's not the equivalent of as writing documentation
and then code. And then one day you're not you're doing only cool stuff. And you're just pawning
off all the mundane things and whatever. There may be sadistic or problematic, you know, senior
developers who do that, but that's in practice, not really true. And it really is a gradient as
you move up, you may be solving bigger bigger problems but bigger problems are really just large collections of smaller
problems in my experience for the most part with very rare occasional you know singular tough nuts
to crack and so interesting article i i didn't read them all i think there's you know like 10
here um but check it out we'll have a link in the show notes. Uh, what are your thoughts? This is, this is great.
Yeah.
I think, um, I actually have seen things degenerate the other direction where, you know, as someone
becomes a tech lead, they say, well, you know, my team's welfare is now really important.
Therefore I'm going to do the really, you know, crappy for lack of a better word I'm going to do the really crappy work that no one else wants to do.
That way my team is happy.
And then a year later, they're completely burnt out and they hate their life and everything.
And you kind of start diving into it and you find out, oh yeah, they've been doing the worst work for a year.
And that's just not sustainable.
So I think it actually you know goes in the
other direction more often than not um and yeah i guess maybe the thing takeaway is like as you
go up in level you become like more and more of a servant to more and more people and and so uh
um and so yeah it doesn't it doesn't go that, that way, but, but yeah, I also felt
the same way as a junior developer.
So this is interesting that like, how do you correct the record there?
It's not, it's not there.
I guess, you know, with, I guess, you know, if, if, as you said, if you're sadistic, you
have now the power and the influence to be a sadist, but if you're sadistic, you know, as you said, if you're sadistic, you have now the power and the influence to be a sadist.
But if you're sadistic, you probably have a really hard time getting promoted.
It's not impossible, but it's harder.
So it kind of works itself out.
Yeah, I mean, culture and management is, you know, I guess the fallback here, which is depending on how your team is run determines how
how much if if you're one of a very small number of software developers at a non-engineering
non-software engineering company i think it's a little different than if you're uh you know at a
at a sort of big tech company i think the how those situations develop and evolve can be very different. Yep. Yep.
All right. My news story is Pure Pursuit.
This is really cool.
It's a relatively simple way of designing a robot vehicle trajectory optimizer.
So, for example, imagine like this is an example.
Take racing video games right for
most people a racing video game is kind of a non-starter because they have no idea how to build
the opponent ai and it seems like when you watch um you know even mario kart or something when you
watch something when you watch these racing games it can seem really hard like how do i code something like that up and what if they get totally you know bumped by the player
way off track how do they get back on track and yeah i guess the thing is like they as opposed to
like mario or the enemies basically have their own physics and their own you know universe in this
case you know everyone's a car with the same you know if you can't really
bend the rules if if if the ai um you know just teleports in the middle of the track or something
it completely destroys the immersion you can make the ai go faster and slower and there's rubber
banding and some of that but but still like it has to kind of follow the same kind of physics constraints as the player.
Otherwise, it ruins the game.
And so it turns out there's a whole bunch of relatively simple methods for doing kind of basic robot navigation and trajectory planning.
And this is one of them called Pure Pursuit.
So I included two links.
One is a pretty lengthy tutorial that has a bunch of python code where you can um follow along and see
what they're doing um the other one is a youtube video where it's as you'd imagine much more visual
um and yeah it's a lot of fun if if you're looking for kind of a neat thing to do, I think coding something like this up
and having a bunch of little robots race each other
could be kind of a fun thing to add to your GitHub account.
I think I saw there was a...
I guess that's reinforcement learning person wrote a tutorial
or like not a tutorial, one of those funny videos videos I guess you watch mostly for entertainment about track mania which is some you know race car game that is not super hyper accurate but it's you know a little bit more and so they were kind of showing how over generations trying to I guess you call it evolve or train a an ML agent to kind of get the best times. And, you know,
they were talking about some of the subtleties about, you know, what moves or inputs you are
allowed to do and, you know, whether it was allowed to drift or whether it tried to avoid
drifting as an example. It's a very complex topic. And I agree with Jason. I mean, I think
it's pretty interesting and it has a lot of, I don't know, I don't want to say like complete
real world application, but there's a of uh thinking through these kinds of things that help you
in other domains that would be you know sort of adjacent so anything from like building a little
you know flying airplane or a quadcopter is gonna have some of these same control loops and
dampening and you know making sure you don't get into oscillations and these kinds of
kind of things as well so yeah i don't is there a like i don't want to call it like a
playground or like a something where it's like pretty well set up where where all you have to
do is really code in the the kind of like driving there must be no one off the top of my head
yeah there was a um let's see i can find it opens oh i forgot what it was called um
let me see if i can yeah there's there's carla oh torx torx is what i was thinking of
it stands for the open racing car simulator and torx was designed for ai like you could actually you could play it but it's
really meant for ai to play it nice so i was gonna say that that if you want to get down to the that
specific cut of the problem may get you there faster than writing mario kart from scratch
oh yeah totally i mean i think you know uh the mario kart from scratch you would make just like
a really fake physics engine and just really silly.
But yeah, if you wanted to, you know, ultimately control a real car, if you wanted to ladder up to that, then yeah, Torx would be a good starting point or Carla would be another one.
So my next news article unintentionally, I guess, is it's a piece of what you might use to kind of get there. And that is something called a control loop or a controller and sort of figuring out,
you know, given some observations of the real world, you know, affect some output.
And you can get as fancy or non-fancy of these words as you want.
I'm pretty non-fancy.
And that is, I call it PID controller.
So the article I have here is PID without a PhD.
Jason was telling us at the pre-show he says pid which
i've heard before as well so uh we'll have to ask the ai which one is correct um but pid without a
phd is a is a little pdf with uh i will say not the most elementary explanation but without getting
into my background makes it such that if you just, you know, look at the
Wikipedia page for PID controller, you're quickly going to get masked out. Well, that's not good
grammar. Oh, well, you're going to get into math over your head, or at least math I'm not comfortable
with for like a casual reading. And so this PDF is a great introduction, and did want to give a
shout out to it. But also just to talk about, about i guess similar to what jason is saying a useful tool in the toolbox that kind of pops up in a surprising
number of places and so just in brief a pid controller is an acronym stands for proportional
integral derivative and that is that you have some value you want to achieve in your output say
you're heating a pot of water and you have a thermometer in it
and you're controlling the heater underneath.
If you crank the coil underneath all the way to full power,
the water is not gonna instantaneously boil
if you've ever watched a pot of water.
It would be nice if it did that, but it doesn't.
And so you have some temperature you're trying to achieve
and you have the thermometer that's giving you feedback, there's you know this latency and as you get closer you want to start modulating
the power down so you don't necessarily like overshoot your temperature um you know because
if you were cooking well now i said water but if you were cooking something in the water that you
know you didn't want to be overheated like burn your food or whatever the equivalent of over
overheating it would be um you know you don't you can set this
without going into it the proportional integral and derivative terms that will help you to kind
of control the behavior of getting to as quickly as possible and then sort of having good behavior
as you sort of have this uh latent response to your inputs and this is a form of a control loop
because you're kind of sitting there looping over and over again, take the measurement, control the measurement to what you want the output to be
decide what you, you know, want to do to your to your, your settings. And so they they grow from
here, it's kind of the most, you know, trivial one, there's all sorts of more advanced optimizations
that you can do with PID controllers, and then ultimately even moving on to other controllers, once you get past sort of like the trivial things. And even the online discussion
around this article had a lot of debate about whether PID is sort of 99% of the time, okay,
or whether it's the worst thing ever, because it just, you know, it's just not the most optimal
answer. My take is, you know, it's a good tool in the toolbox and learning. And there are often
things where you want to apply some sort of, know move from a to b but in a controlled response
and it it sort of ends up being in the same bucket of problems i will say and so being aware of these
things is uh it's good to know your way around in unfamiliar territory yeah if i remember correctly
and i haven't read this article but i think the the P is basically how far away from the target you are.
The I is what has happened recently, and the D is your kind of more Like, you know, if you're, if you're screaming towards the finish line, then that proportion better be really big. Otherwise, you have to start slowing down. So that's where the derivative comes in. And if you're if you're waffling back and forth, that it's clear that like you need to slow down because you've overshot six, seven times a row and that's where the integral comes in even setting the values has a whole bucket of uh theories around and it is really interesting
if you use a pit controller to set the values gee i can't see where this might become recursive
uh stack overflow um i i think that if you uh once you learn about these you will notice like you
know cruise control in your car and how does it behave uh you know you can sort of think about
this not necessarily that they do it but you'll start to see also some um if you have like a 3d
printer or as an example it will not know how much kind of like mass the heating the heated bed has
at the bottom or the part that squirts out the plastic.
So often you'll see during,
when you're turning it on,
it'll do some calibration.
And part of its calibration is heating,
cooling, heating, cooling to understand the response of those devices
in their current configuration
and their current environment
and setting the parameters of these PID controllers.
Yep.
Yep.
That makes sense.
Very cool.
Yeah, folks definitely read i mean pid controller is is is as patrick says like the simplest thing it's in every house it's all your
thermostat uh you know it's almost certainly using a pid maybe not nowadays i think but yeah
yeah i mean but if you have like one of these old honeywell ones you know whatever um but uh
definitely worth learning about i think it's it's a great foundation um all right my new story is
google releasing gemma so um most people have heard about llama um llama is this open source
llm from facebook um someone uh made a project called llama.cpp which is kind of a weird
name of a project but basically it's a way to run these llama models really fast um there's no
pytorch um they're loading the pytorch model um but it's all done in C++ and everything is super, super customized
and optimized for all these different architectures.
And it gets to the point where like on your MacBook,
you can run these large language models
almost in real time.
And so that's been really exciting.
There's been a whole ton of research
that's come out of that.
So Google released Gemma.
This is also kind of an area of debate.
But okay, well, just basically the Gemma models are also these models that are small enough that you could run them locally.
In fact, they even have one even smaller than the smallest Lama model.
So as a result, it can run on even less hardware like
maybe on your phone in real time um they claim the performance is you know much better um there's a
big debate because the way they're kind of um it kind of gets in the weeds here but you know the the the seven if you look at like the llama
7b or gemma 7b that that 7b number means 7 billion so there's 7 billion parameters
in the model 7 billion weights that have to be tuned and um whenever you execute the model to generate a token you have to you have to
at least use all of those seven billion weights um so you know the smaller that number gets the
less weight you have to use the faster everything gets um now the problem is like that's not the
only uh it's not so one-dimensional right so the gemma models have
a much larger embedding and i'm not gonna get too much in details here it's not it's not like
that relevant but basically because they have such a large embedding size um they could be a lot
slower than the llama model of the same size. But they can also perform better.
So there's a little bit of advertising talk around the,
oh, we perform better at 7 billion.
But the cool thing is, you know,
it's a whole new set of open source models
that we have at our disposal.
One of the most interesting large language models
that have come out that you could run
on commodity hardware is this one called mix drill where i think they mixed several different
open source models to make one kind of supervisor model um and uh and i think that's fascinating
and so now you have another model you can add to the mix, literally.
So I think there's a lot of potential here and folks should check it out.
The llama.cpp, despite the project name, actually runs a lot of these open source models.
And I'm sure they're feverishly working.
By the time this podcast is out, they'll have the gemma models in llama.cpp
so you could run them on your on your laptop so um definitely something to check out is the
number of parameters also like uh sort of related to how much memory it takes to use because that's
one of the things that that they always make a big deal about is like not just needing a gpu but
a gpu with like very large amounts of memory yep yep and so there's uh a bunch of tricks you can do
um they got to the point where they're doing four bit quantization um so they're only allowing each
weight to be one of 16 different values um but uh that's what that llama.cpp uh project that's one of the things they do but you're
right you know the more parameters the more either cpu ram or video ram you need to to run the model
very cool and then does it or maybe we can we move on but fine tuning fine tuning of these like
is it some models are easier to fine tune like so the training so running them and executing them presumably like jemma coming from google is related to you know
the ones that you can just go online and use so if you don't have an internet connection or
something it feels useful and being open source but to me the power and i think you've talked
about trying that before is like adding your own inputs and doing some you know additional training
or switching or customization to um are these ones equivalent when it comes to that aspect or is like some are better for that
and some are worse yeah so the the big difference between jemma and llama is the embedding layer
of jemma is enormous um so and we'll talk about this actually later in this episode but but basically the
embedding layer is the layer where you switch from understanding what you just said to deciding what
to say next and so that that layer is really important um and so that layer is enormous in
gemma so i would expect it to be harder to fine tune just because of that,
to require more data, more iterations.
But in general, yes, I mean, it's much easier to fine tune
the 7 billion model than the 70 billion.
The other challenge about fine tuning is you can't do quantization
while you're training.
And so now you need to store the 32 bit or maybe even 64 bit potentially float for each weight um and so that gets that
gets really expensive so that's where you need like the 64 gig gpu to do the training
uh this is a fascinating time fascinating time i'm excited need to learn here good thing we
have a topic queued up for this yeah you could start a large language model startup but i think
they took all the names there's no names left in the entire language there's so many of them
all right time for book of the show all right what's your book of the show
mine is the eye of the world by robert j which is the first book in the wheel of time series.
I guess I'm, I'm late to this, you know, classic fantasy.
Most people probably already heard of this.
Also, if you have not seen any of the large amounts of advertisement,
Amazon prime has done for Amazon prime video.
They have a wheel of time series, a tv series i guess it's called oh cool um and so
that's actually i had long known about this uh we talked a lot about brandon sanderson books and
brandon sanderson and ends up writing the ending of the wheel of time uh book series because oh i
didn't know that jordan passes passed away before the series could be completed. So he left his notes and Brandon Sanderson sort of released it.
So very adjacent to this series,
which is never,
if you've ever seen the books,
they're intimidatingly large.
So they're being,
you know,
I don't,
I should have looked at,
I think there's 12 of them,
but being there so many,
you know,
I was always a little intimidated to pick it up,
but I actually really enjoyed the TV show.
I've watched there's two seasons now and decided it's finally time to pick the book up being aware that, you know,
books and TV shows are not necessarily the same thing. But but interesting enough in the world,
and, you know, seeing some conversations, I guess, Brandon Sanderson is actually one of the
consultants for the Wheel of Time show. And, you know, you see arguments.
Oh, I guess I'm nerdy.
But in the parts of the internet that we all spend time on, you'll see people kind of debating
about the Wheel of Time time show.
And so it piqued my interest.
And so now I have to like know for myself and, you know, go look at the book so I can
judge if the TV series is a good reflection of the book or not.
Very cool.
I read this book probably when i was
like 14 or something oh but you've not seen the tv series i've not i didn't even know there was
one until you just mentioned oh no i'm going to have to catch up i'm one of those people that i
don't watch tv but i do watch youtube um and uh i watch premium, so I only get the ads that the creators, you know, actually
put in themselves.
Um, but, uh, yeah, actually you would, you would, you and I would probably have very
similar YouTube interests lately.
I've been binging on this guy, black tail studio who makes coffee tables.
Have you seen this guy?
Yeah.
Um, but yeah, I, I don't have, I have Amazon prime, prime but i've never watched watched videos on it but
i will check that out that sounds cool uh so my new my new so so sorry on a side topic here
watching youtube a lot i have uh apparently like too many interests slash hobbies that youtube
doesn't like it's not able to hold them all in its you know recommended video list so i have
subscriptions of course but like if i go to my front page, like whatever certain topics,
if I ever watch a video, so we were talking about power world, I watched a video about power world
instantly. Like my whole feed became like two thirds dominated by power world. Um, and I had
to like go remove watching power world videos from my watch history to like get back to any
semblance. But in just in in general as i like rotate through
my interests that like i noticed that it it feels like it can't hold them all and it's like you know
consideration matrix or whatever however it's doing you know not to anthropomorphize it but
it just feels like yeah if i yeah once you once you move to a new thing you get lots of videos
about that and you stop getting videos about the other even if you were watching them when offered
and so i don't know maybe that's a personal problem just
because i i should focus but well i think it just yeah there's a movement of the masses there like
like probably people get on a topic and binge it and then that changes the behavior of the system
for everybody else you know um my book of the show is how to
make a video game all by yourself uh this is a very short book um extremely useful book i've
been going to over the past you know since since covid let up a bit um i've been going to a lot of
video game developer meetups as people know i made this ai hero game but i've also been
talking to a lot of other game developers in the area and one thing i've noticed is a lot of them
don't see themselves as producers a lot of them are software engineers and they end up with like
a really cool tech demo or engine this one guy built this um thing where it was like kind of like a minecraft
world but you had uh you had rocket launchers for arms and every rocket actually caused a crater
in the world and so you flew around like blowing craters in the world and they even had the water working so like if you if you're underground and
you blew a hole up and you accidentally blew a hole into the bottom of the ocean like water
would just start filling in it was cool i mean i was super impressed um and and as the water filled
in like voxels of air like floated up.
So he had naturally had bubbles like you didn't have to like, you know, design the bubble separately.
It was fascinating.
But what it wasn't was it was not a game, you know, like I had a lot of fun, but I wouldn't say I was playing a game.
I would say maybe I was it was more of like an art kind of thing. Um,
and, uh, I feel like this book would have been perfect for this person because, um, it really, it starts off with the first principle of like, you are a video game producer.
And so it lists all these things that ultimately don't really matter. Um, like which engine you
pick or these other things right or
they're they're secondary to your to the goal um i thought it was really well written um i didn't
bother to research the author but i'm assuming it's somebody who's done a bunch of indie games
um and uh yeah it was a short read but I had a lot of fun. That sounds cool.
I, I fascinated by this as well.
I, one day I should go to a video game meetup.
That actually sounds like it'd be really cool.
I feel like it's a blast and I would start writing video games.
It's a total blast.
I've met some really nice people there.
Very cool.
Time for tool of the show.
All right.
So I'm up first again, my tool is not so much
about the tool, but just a shout out that like, I wish more companies kind of did this thing.
So we talked about it a while ago. And I recently used this tool. And that is Google released
an unlock tool for their online streaming gaming service stadia which shut down
and they had run sales where you could get a chromecast and a stadia controller which is what
you needed to play stadia games and the stadia controller interestingly connects straight to
wi-fi so you are wi-fi controlling your server box in the cloud that is running your game and then the
results of the video would stream down to your chromecast and onto your tv and so you were not
interacting you may be sitting in front of your tv but you could have been from another city you
know playing your video game on the tv in your house or whatever and it wouldn't care because
your controller just connects up to the internet via wi-fi um and then you know happens to control troll's device and there
were some ui loops to you know if you had multiple to make sure that you were controlling the right
tv and that kind of thing but they kept running specials trying to get people to sign up for their
service and i just wanted the chromecast so i ended up having a number of the stadia controllers
laying around but there's nothing really do with them because they connect to bluetooth and the service shut down so they released uh on the radio chip
that they had in there it also had bluetooth so they released a a website that you can plug your
controller in follow the on-screen instructions and will basically install a new firmware that
removes the wi-fi and adds bluetooth connectivity just use it as a bluetooth gamepad oh nice play on your steam deck i paired it to my steam deck and play it on my steam deck or
my computer as well um lots of things have bluetooth these days so it's really easy to
use as a game controller and i just thought they didn't need to do that like it it was custom
hardware for their thing but they found a way that was like easy enough for them and straightforward
to kind of do this and it's if you if if you're like me you probably have a drawer full of
things that correspond to services which are dead and not able to be used anymore and so
it's not always possible i understand the complexities of it but it would be really
cool to see this be like a thing that people try to do like at least give some functionality
to devices when they reach
their sort of end of service life yeah totally um yeah that's that's amazing yeah now i kind of
wish i had bought some of these maybe you can get them on sale on ebay or something i don't know if
they went up in price or down in price after this i've not followed the ebay price trajectory all right my tool of the
show is uh fuse and sshfs this is one of these kind of table stakes things where it's really
good to have this in your toolbox i use this recently um i guess just a recap so fuse is a way
for it stands for like file system in user space, something like that.
I don't know.
I'm not sure if I'm getting that totally right.
But basically, it's a way for you to mount a file system without being the root user.
And so, you know, if you're logging into a system at work, you probably don't have the
root account.
And so you need to use something like Fuse.
Even if you're doing something in your house where you do have root access it could be cumbersome to like have to sudo all the time and put in the root password
and all that you might just want to mount a directory um just right there you know as a
sub directory of your home directory um like imagine mounting your google drive or something
like that you want that to just happen you don't want to have to put in your root password um so fuse lets you do that um so sshfs is interesting thing where um you know
you can you can ssh into a machine and and you have now remote access you can run an editor do
all that stuff um there's also something called SCP where it uses SSH,
but instead of giving you a pseudo terminal,
it gives you a file.
So you can take a file from that remote computer
and put it on your computer.
So you can like SCP, foo.txt, my home directory,
and it'll actually, or sorry, SCP,
user at server colon foo.txt and so i'll
actually go to that server find the foo.txt file bring it to your computer that's okay but it could
be kind of cumbersome i was having a situation where i was creating files on a remote computer and i was wanting to look at
them on my laptop and work pretty quickly i didn't want to have to keep copying them to the laptop
and all of that also the files were kind of big and i only needed to look at the first part of
them etc so um so i just set up this shfs uh. And so I mounted a directory on the target computer as a directory on my laptop.
And I could just look at all of those files.
I could read the first 100 kilobytes or 10 megabytes of the file without having to copy
the whole thing.
It worked well.
The challenge is if your laptop goes to sleep,
just like any SSH connection,
when it comes back,
the file system's like broken.
You have to unmount it or remount it.
So, you know, it's not perfect,
but it's extremely useful,
especially at a pinch.
And it's one of these things
that's almost ubiquitous
because almost any machine
you can SSsh into so
you're only required to install things on your local machine i didn't know you were going to
talk about this i didn't know this is i actually used this for the first time a couple days ago
no way interestingly uh there is a windows version as well that will allow you to mount another linux computer over ssh to a drive
letter in windows and i was on a windows computer and my i wanted to copy some files from my steam
deck memory card which is ext3 formatted so my windows computer couldn't see it well don't ask
anyways long story anyways but the steam deck allows you to just enable SSH pretty easily.
So I just did that.
And this was the path.
The path was to basically turn on SSH file system,
have it on Windows show up as a mount.
And then I could have used, I actually did later,
end up using the SCP method you talked about.
It was a little cleaner.
But mounting as a file system was also something that was eminently doable.
And then other applications could have pointed at it.
So it has this advantages.
But yeah, so I did end up using, as you said, so many things have SSH on them, or you can
SSH into, and if you can, and it has files, this is a good way of getting access to those
files.
Yeah, totally. to those files yeah totally um all right let's jump into transformers and large language models
um yeah i mean there's it's one of these things it's actually agoraphobic like there's so much
content and so topical but but um we'll start with the basics, which is how neural networks store information.
If you don't know what a neural network is, we actually had some AI, an AI two-part series.
It's a bit dated, but it covers that in pretty good detail.
But basically, you have these layers, and each layer you do a bunch of dot products.
This almost becomes like a tensor product to produce the next layer.
And so the weights, the things that you're multiplying by, you can change those as part of training this model.
And by default, it goes from one layer to the next to the next.
Think of it as like a DAG, right? And by default, you know, it goes from one layer to the next to the next is like this.
Think of it as like a DAG, right?
And it can split and it can rejoin and there's convolutions.
But effectively, it's acyclic.
It just goes in one direction and then it ends with some target.
You compare that target to your expected value for that target and you use that difference to adjust all the weights um
now you know people very quickly wanted a way to store information they wanted these neural
networks to be stateful you know imagine if you are training a neural net to solve a maze. If it's the same maze every time, you could just train the neural net to
solve that maze and the weights of the neural net will just hold really specific information
about that maze. Like, oh, when I see this intersection, turn left. But if you wanted
to train a generic neural network to solve any maze and to to you
know solve it over time it has to keep a representation of the maze in the neural network
right and so it's it's a memory that is sort of online that's independent of training
um so this is this is the goal and um there's a lot of different ways to do it.
The kind of obvious thing would be, well, make it cyclical.
Like take some of the weights and instead of making it an acyclic graph of operations that just ends in this point, make some of those weights just point backwards.
And so you execute the network.
And when you execute it, some of the data is kind of left over right um and so they call us a recurrent neural network um
the problem with this is they're incredibly hard to train and to learn anything meaningful
um and there's a ton of reasons for this. But the biggest one is this
problem called the vanishing gradient problem. And so the idea is basically if you multiply a
lot of numbers less than one, you very quickly get zero, right's that's basically the gist of the vanishing gradient
problem there's more complexity than that but um you know if you multiply a bunch of numbers
together that are bigger than one then you quickly go to some huge number right that approaches
infinity and so that's not going to happen because you have something called regularization so that's
not an issue. But the
other thing where you multiply numbers smaller than one over and over again, and you get zero,
that happens. And so you kind of have this sort of dilemma. It's like either everything goes to
infinity or everything goes to zero. Either way, it's not really very usable um so someone came up with lstms long-term short-term
memory and they basically said let's have one process that's going to infinity and let's have
another process that's going to zero and then add them together and hope that like the two problems
cancel each other out and uh um again it's one of these things that yes in theory you have these these long-term
you know gradients these short-term gradients and and the problems of both of them cancel each other
out but in practice it just is really hard to get it to do things um throughout my career
tons of people have tried to do lstms for all sorts of practical
things that all these companies have worked at and it's never worked um um there's a time at
facebook uh someone came up to me and said hey i have this idea um i think we'll do an lstm to
predict uh um you know effects over time of people you know watching things on facebook
i was like forget it i'm not interested it's not gonna work like i've just seen it fail too many
times and it's it's like um you know they call this like a tar pit idea because um um you know
you don't really realize you're stuck and then and and it seems, it doesn't seem like it's a problem, maybe a quicksand idea.
It's like, it doesn't seem like there's anything bad about that idea.
Once you get into it, it just, it just sucks up all of your time.
You don't get anything. Um, so LSTMs, you know, not much success.
Um, but then something interesting kind of happened.
Differentiable algebra kind of took off.
So what that means is you used to, and Patrick, you might have done this like electrical engineering where they have you kind of derive all of the the updates for a neural network so they they show you like
you know here's how you calculate like the the derivative of of um you know this type of
activation layer and you get these like exact numbers and you know exactly like you know if
my answer was four and the answer should have been five then this weight in this
neural network needs to go up by exactly like 1.2 times the learning rate or something and so
you have these like very specific formulas and as long as you follow the formulas you'll eventually
you know get to the right place um the problem is the formulas only work
in certain circumstances so you couldn't for example say um you couldn't for example say
like take the maximum of these three values because now like you can't differentiate the
max function like there's no derivative of the hard max function
right and so what came what came out you know some what got popular in like around 2015 ish
was this idea of numerical differentiation it's like instead of trying to come up with
the derivative of all of these functions, let's numerically
differentiate all of them.
And so now you can actually have a gradient of the hard max function or any function.
It's just a numerical gradient, a numerical approximation of a gradient. And so what that means is now you can write pretty
much any code, virtually any code, and backpropagate through it. So you're not going to
change that hardmax function. It is what it is. But you'll be able to differentiate through it,
and things that happen before and
after it that you can change will start changing.
So for example, let's say I have three neural networks and then I take the max of the outputs
of those three neural networks and then I have a fourth neural network.
All four of those networks can be updating and learning even though they have
this function in the middle that's not differentiable so all of that leads into this
concept called attention layers so an attention layer is you know a set of algebra that you apply um on three things one is your query which is what are
you interested in right now your keys which is how does the thing you're interested in now relate
to the other things in your list and your values which are the other things in your list and your values, which are the other things in the list.
So for example, you might say the cat jumped over the moon. Those are my values.
My query is going to be, let's say, dog. And then my keys are going to be what the relationship that I think the other words
have to dog. So like cat and dog probably have like a pretty close relationship, even though
we joke about cats and dogs, but they have a close relationship because they're both animals,
right? But you know, dog and the probably don't have a good relationship because the is just related to everything
and it just washes out, right?
So given your query,
your relationship of that query with each of these items
and something that describes each of these items,
you can then create a total amount of attention so you get um so you can say like
you know dog has this much you know is capturing this much energy from that sentence
um so that gets into self-attention which is just a fancy way of saying given for example a sentence
take every single word in the sentence and find out how much attention the sentence offers each
of those words so if you say you know the cat jumped over the moon for each of those words you
know for the cat jumped you know how how much attention am i getting um from the
other words in that sentence and for that jason so if if you're saying like cat related to dog
i imagine you can look across a kind of training corpus and kind of like say how often do they
appear together you know appear next to other words but for self-attention, how do you figure out that like what's holding the weight in a sentence?
Yeah, so the way this works is the keys, like the weight between two words, that you're going to learn over time.
So in the beginning, it's going to be just random.
But then when you calculate the attention,
then you take all of...
And these are actually stacked on top of each other.
So you take these stacked attention layers.
So you're learning keys, you're doing the attention algebra, and then you're
learning a new set of keys, and you're doing another attention algebra step. And then at the
end, all of it is hopefully in service to some task. So by default, most of these models are
what's called self-supervised models or forward models.
So what that means is they're trying to predict the next word in the sentence.
So we'll just walk through the very first training step.
So you have everything token embedding that you train somewhere
else so there's some other process that takes a word and turns it into a vector of numbers
um you know that could be even all part of the same thing but usually it's broken up into two systems so so now what i have is for each word i have a
vector of numbers and so i pass in let's say the cat jumped over i pass in that matrix right so
it's a set of vectors of numbers and um in the beginning it's going to say well they're they're
all just randomly interesting to each other.
So we get a bunch of random numbers out of that, calculate the attention, do this a bunch of times.
And then that's the encoding step of the transformer.
So what comes out of that whole process we just talked about is a single vector.
And this is true if you're doing Lama, ChatGPT, all of these things.
After all these attention layers, what comes out is a just single vector of numbers
that describes the entire context of what you've seen so far um now you have to do something with that embedding
and so what you usually do is you say okay i want to predict the next word so i'm going to take this
embedding i'm going to create a decoder i'm going to create a function that takes the embedding as an input and outputs which word I think should go next.
And so in the beginning, it's going to be totally random.
Even the decoder is going to be random and it's going to output whatever, you know, foobar.
It's literally going to pick a random word to output next.
Then it's going to look at the actual word so i think i did the
cat jumped over the next word is going to be the and so it's going to say oh you got it wrong
it wasn't foobar it was the and so what needs to change so that next time i ask you that question
you say the and so that what needs to change or that's the loss and that loss is going to be
propagated all the way back and it's going to change all of the attentions among all the pairs
of objects and every attention layer it's going to change the entire encoder like everything is
going to change a little bit and this process then repeats a zillion times a zillion yeah
and uh and so then we talked about the trying to get it is saying the comes from some training so
you're just sticking in texts books whatever and learning hey try to guess this this next sentence yeah that's right yep so
um and there's been a bunch of work on this so sometimes they try to predict sentence fragments
sometimes they try to predict single words and they run it each time for for a different word
um but you're right i mean people are are going through you know all of wikipedia you can download
all of wikipedia there's something called common crawl which has like you know gigabytes and
gigabytes of text from the internet um and these systems are going through all of this
um you know taking the first n words and trying to predict the next one
um and this is happening at massive massive
scale um it's just ingesting like huge volumes of text um this also works on images and video
and all that as well but it's ingesting your huge volumes of text and and um and and trying
to predict the the next thing um and so um yeah and so so uh it's kind of remarkable that it works at all, but there's a lot more complexity
around if you dive layers deeper. Like for example, the sentence needs to make sense.
And so if you're just predicting one word at a time,
you might end up with things that,
like the system might paint itself into a corner.
And it might realize, oh, actually, I outputted this word,
but now that I've kind of gone three, four words in,
I realize I made a mistake three words ago you
know we even do this as humans when we're typing right and so there's uh it used to be a beam
search now they're doing something else but basically there's a way where you kind of look
ahead and and based on that you can kind of go back and make some changes and so it almost becomes like
a search system in the decoder that's doing like a best first search
yeah so the if you ever saw one of those i guess that's the hidden markov model train something
would you know you feed all the harry potter books to a hidden markov model train something would, you know, you'd feed all the Harry Potter books to a hidden Markov model and ask it, you know, or on your keyboard on your phone, if you it has like
word suggestion, if you just keep tapping the word suggestions, you get what I would call sentences,
but yeah, they don't go anywhere. You're just inserting like plausible next words right after
each other. And so you end up with something akin to a sentence structure but it it there's no like story
or progression or statement it just you know words that come close to each other you know
just happen to appear yep yep and so you know by default if you um um you know if you put like uh
is it the cat or the cow jumped over the moon right yeah isn't
that the same yeah i've been saying cat the whole time but if you put you know the cow jumped over
the it's gonna say moon with super high accuracy because it's just seen that a bunch of times on
the internet right um but um so that works pretty well and so a lot of these things when you for example ask a question
to chad gpt there's actually it's it's embedding your question in a prompt right because otherwise
if you just took one of these naive forward models and you asked and you typed a question
and said generate some text it'll probably generate more questions
right like it you know it's it's it's whatever people like who wrote that kind of question would
write next and so sometimes it'll generate an answer sometimes it might generate more questions
um and so typically you know it might not be obvious from the user interface but you know what usually happens is you will um
um you know it will put like question colon and then your question and then new paragraph
answer colon and then the model will start generating tokens and so you didn't see the
question the answer but they're there and that tells the model like like hey you know this is the broader context of what's going on um so this works pretty well uh the thing about it is um
you know it's it's uh we want to be able to make improvements and so it's such a huge corpus of
data that we can't it's not like normal machine learning
stuff where you say, oh, I'm going to just change my data set to improve the model because
the data sets like the entire internet.
So like you can't, can't really do a whole lot with that.
So there needs to be some way to fine tune things after the fact.
And, uh, you know, once you have this forward model and so the first attempt at this
which is what open ai did with their initial models was called rlhf reinforcement learning
from human feedback and basically the way this works is they would have um chat gbt and this
is before it was released or anything, generate four or five different answers.
They would give those answers to people.
The people would score the answers, and then the chat GPT model would optimize for that score and try and get the highest score possible.
The problem with that is numeric scores only make sense when there's a real unit of measure.
Is a score of four really twice as good as a score of two?
It's very hard to get people to think in such linear ways,
such proportional ways. And so people are just basically the human raters are just scoring everything either a five or a one, very few two, threes, and fours. Or even if they were,
it wasn't very proportional. A five wasn't five times better than a one, etc. So what's become more popular now is direct.
I think it's called direct policy optimization, but basically it's pairwise ranking.
So you have the system generate two answers.
You give it to a person and the person decides which answer is better um and then the system is is encouraged to give the better
answer and discouraged from giving the weaker answer directly um and so there's a paper that
came out i think only six months ago maybe a year ago on this uh but it's as much as i love
reinforcement learning you know that was my ph, I also was pretty skeptical of this.
I kind of felt like this was a poor use of reinforcement learning.
And so I was not very surprised when direct policy optimization came out.
And so you can imagine this is how a lot of these um things are are ironed out so a lot
of ambiguities and sentences and things like that you know things that require a lot of common sense
reasoning those are ironed out through through dpo so the but this works so they do the initial
training you described and then they sort of like go into this fine tuning step but the answers i mean i guess they could become part of like a future
training corpus but they aren't trained on the same way because if you have it generate an answer
from its current weights and then you try to tweak them that's different than showing answers from
another version of you know chat gpt one or whatever and saying hey here's the responses
they got or
like how does that work you just sort of accumulate all of them over time
yeah it's a good question let me see if i understand so so in the beginning you're you're
just trying to predict the next token and there's no preference right then later on you're given
a preference but for an entire answer.
And so there is somewhat of a credit assignment problem there.
It's like, which token caused that answer to be good or bad?
But again, with enough data, it works itself out.
But you're right, the loss is really different. And so the way that the network is changing in that second phase is very different
and so yeah you can't really go back once you've started this preference approach i mean you know
you could there's nothing to stop you but it's like the two are kind of disrupting each other
and so you just disrupt the first with the second and then you get what you want so yeah so if you
did like 10,000
rankings i don't know how many they do at the end they don't it is not necessarily reusable like if
they had a new update to the corpus or whatever they would have the weights they'd fine-tune
roll them forward but they would go do another 10,000 they wouldn't just like play back oh
10,000 oh now i see what you're saying this is really interesting so um okay so you're saying you know
i have chat gbt6 and uh it generates a totally different answer than the ones than either of
the ones i sent to the human last time yeah so yeah this gets complicated but there is something
called importance waiting and basically the gist of it is like this.
When a token is generated, it's generated from a distribution.
So it's not like the neural network runs and at the end it says cow.
It actually runs and it outputs a probability over the entire space of all words that it could output.
And it just happens that cow had the highest value, right?
But you can normalize that output vector.
And now what you get is a probability mass function over all the words so there is so for example there is a chance that chat gbt6 and chat
gbt5 or chat gbt5 with two random seeds two different random seeds there is a chance that
they will both generate exactly the same answer um it might be a low chance but it's there and and
so there's actually, there is a chance
that it will generate anything, right? Any sequence of words. And so because that chance
is non-zero, you can do something called importance weighting where you say, okay,
the new chat GPT, although it generated nothing like the two answers i sent to a human last time it still has
a chance of generating those two answers and so if it's it should be it should be more likely to
generate the better answer even though it it really didn't want to generate either of them
oh so i think so it doesn't give the yeah like it doesn't pick the final answer it goes and looks at
the distribution and says was your distribution closer to the better one or the worst one and then you can
you can somewhat reuse i see oh that's that is interesting yeah exactly now the thing about
importance weighting is has two problems one is um you know the easier problem is is uh those
numbers are going to be small you know the chance that you do anything
like with that complexity is going to be small and you're dividing small numbers against each
other and so things get a little crazy there um and the other thing is oh yeah you might you might
have to move a lot um or actually rather um it might be that the answer is really orthogonal to either of those two
answers.
It could be like almost like equidistant from both of them.
Yeah, exactly.
And so in that case, it doesn't matter what answer you pick then.
So it's not perfect, but you still get a lot of value out of it.
It's not like that old work was wasted.
Oh, that's cool yeah i mean because i
guess like when you're moving up versions you could ask it a question and in one case it generates
two different diatribes about cows and moons but then in the next one it generates a picture of a
cow jumping over a moon and actually that is better but then now you're stuck with this problem of
like oh i have a picture versus two sentences and so yeah i hear yeah yeah makes sense but yeah the
data mode is real you know i mean all this data and all that human effort that's gone into labeling
and ranking all of that data is just permanently valuable so um so yeah it's a huge boon for
for a lot of these companies have started early that was that was uh so the large
language model uh and then we talked about all this so the large there just comes from the fact
that the parameter count has gotten so numerous and it happens to be that these large language
models today have a very similar architecture that you're describing and not so much that there's
like you couldn't do it a different way right so no of a different
better way yeah exactly so so the the recurrent neural network you couldn't make it large because
the gradients would vanish the lstm you couldn't make it large because it was an unstable equilibrium
and so the current the gradients would either vanish again or explode
and so so you couldn't do it but with the uh you know the attention layers in the beginning
a lot of people myself included were a little skeptical of the attention approach because
um it felt like a strange compromise between you know convolution where you have like a relatively
small mask but because the mask is small it can kind of like rove roving eye you know through an
image kind of compromise between that and an lstm which in theory has like an infinite horizon. Like you could, in theory, you could feed, you know,
a near infinite amount of, not literally infinite,
but you could feed an extremely large amount of data to LSTM.
It could remember all of it.
There's no limit.
So attention was strange in that, you know,
it had a lot of the features of long-term, short-term,
but it didn't have the
benefits um but you know the fact that it was a stable training that was really incredibly useful
it's one of these things that's hard to know because in theory everything is stable right
it's hard to know in practice like what will this do if you have seven billion
worth of this right and it held up really well at that scale yeah that's really interesting so
it's like when you grow really big it was hard to predict what would happen but that yeah it has the
attributes that sort of end up working at that scale and the compromise. But I guess like to the same thing,
it doesn't mean it's none of this means that it is the sort of like optimal
or correct answer.
It's just the best we know today.
So if we found a new LST,
I mean,
maybe not,
but like LST,
I'm like,
or you found a new way of,
you know,
solving the problems or adding another thing.
It could do better,
but we just don't know that today.
So someone would have to figure it out.
Yep. Yep. Um,
now here's where it gets really tricky is, um,
you know, all of this is, um, very direct. Like, you know,
you have this direct policy optimization, you know,
the RLHF was technically reinforcement learning but um
it was it was you know a very direct approach and so a lot of these systems you know they can't
actually do things like they can't they can't get rewarded or punished really um
and if they do it's this really sort of brute way where you know you could tell chat gpt like
go invest in stocks like oh you're bankrupt that was bad you know it's like but it can't it can't
it can't really reason in a way like it can't sort of build a world model that it can then use to reason and to make decisions.
So I think the future of this is this really interesting work coming out of Facebook called
JEPA, which is I think Joint Embedding Policy Architecture. But it's basically a way to
combine a lot of these embedding approaches, whether it's a large language model or a large image model, or maybe it's a large, you know, actuator motor response module or something doesn't necessarily have to be language, but a way to sort of combine those with decision making.
And I think that's really going to be super exciting but it's going to take
a while for that research to mature yeah so i guess like to your point we were talking about
earlier uh picking a racing line and driving your car towards it you could tell chat gbt or one of
these large language models how you want and it maybe could generate your code or something but it can't actually go play the game itself like it doesn't it doesn't have the hooks
it doesn't have the inputs to go do it there's no module and this the architecture isn't really
set up to you know be a mario kart ai agent right and then you know people joke about telling chat
gbt like your answer is terrible and then it generates another one and yeah that's
funny but but really it's not it's not really um you're trying to optimize for some objective
like it will pivot um but but even that pivot is is kind of artificial it's not really
directionally heading towards some place of greater value and so you
know yeah you can't plug chat gbt straight into mario kart and like over time get something that
get it to produce python code that gives you a higher and higher score over time like there's
not an easy way to do that yet yeah yeah so the i always find that funny i guess like that i don't
know it's anthropomorphizing it's not the right way but you're right people like fuss at chat gpt
like i'm gonna i'm gonna kill you if you you know i'm gonna unplug you if you do one more thing but
it's all just adding to the context it's not like you said it's not actually evolving forward so you
you could just written all of that out and claimed what it told you like is what it told
you even though it didn't and just put it all in the prompt and it would do the same thing
right in other words like if someone comes back to you and says jason you're bad because you did
this thing and you didn't do that thing you're going to be like what no like you're just going
to ignore it but with chat gbt if it gives you an output if you were to start a new session and
copy that what it told you and what your response was into the very beginning, it's the same. Like it's not actually like remembering and evolving
in that way. It's just building this ongoing context that kind of feels similar.
Right. And, and all of this, you know, even what's coming in the future is not going to
kill coding. Like we should spend probably the last five minutes of the show talking about this. But if you are a follower of this show,
you know, maybe you are working professional or not yet,
you're in college and you're thinking,
oh, if I major in computer science,
there won't be a job for me
because Jack GVT will have my job.
That is not going to happen.
You know, I think coding is just a way
of solving problems at the end of the day.
And so the medium might change, but the need to solve hard problems is not going away anytime
soon.
And if anything, ChatGPT will actually automate everyone else's job.
People who are solving easier problems, those are the people who really should be worried.
If you're in here listening to the show, show going to college or even on your own learning to be a
programmer you're doing it because you want to solve hard problems and that's doing hard things
really mentally difficult things is give me one of the last things to get automated yeah i i mean i
think they've always talked about it for a long time and accuracy hasn't been there but i the one i always think about is like i don't just use it so much like
x-ray tech it's like you go in and you know you have a broken arm and you get your x-ray and the
the job of the person who i don't know what the right word reads the x-ray like looks at the x-ray
is supposed to look at the notes from the doctor about what they think was wrong look at the x-ray
and check for problems and then like write a. But if you gave it to 100 doctors, they won't all say
the same thing. But there is a correct reading of the chart. And hopefully, most of them would
give the same answer. This one feels like very difficult before you would just given it to like,
you know, something that would use like Jason was mentioning a convolutional neural network or
something try to highlight where the fracture is. but now you're getting to the point where you could give the context of the
doctor's notes and hey i was in an accident and my arm is hurting and you know here's my and so
it would you know kind of understand what it's attempting to do and you're right those things
are problem solving but not really in the same way that like like you said, hey, I want to build a game where you're a go-kart racing turtle and throw shells at each other.
And that kind of problem solving is a fundamentally different approach.
Right, right.
And also, you know, when you're building anything, this is true of anything you're building, software or anything. There's really two things that you're constantly adapting to.
One is product market fit.
Are people enjoying my game?
Who is enjoying the game?
And then the second one is quality of life.
So maybe my game is actually fun, but there's too many buttons in the menu and people like are like not even getting past the main menu.
Right.
And so you have to constantly adapt to all of these changes and you have to decide like what parts to be flexible, what parts of the code should be inflexible and written quickly. And these are all things that, you know, trade-offs that ChatGPT is
not going to make very effectively or any AI is not going to make very effectively. So,
you know, and who's to say what's going to happen way out in the distant future. But I would say,
you know, if you're listening to this show, if you're interested in this topic,
your job is extremely safe. I think that you can if if you
said oh i'm going to pivot to accounting well that's that's probably uh not unwise no i mean
i brought their laws and accountants so it's not an insult to accountants but i'm just saying you
know you are in a very safe profession this idea that coding is dead or will be automated is absurd and uh don't worry about it
and yeah in like normal term i think there's a caveat there like who knows what happens in a
thousand years but yeah exactly i but again i think you know your job will be one of the last
ones to go so by then i saw this crazy stat that like 99 of job titles didn't exist 100 years ago something like that oh interesting yeah
yeah i mean i think like uh taking taking as an example like someone who's an actor or something
which i know that there there's a bunch of politics around that and fighting whatever but
taking an actor and their voice and their body image and like the way they like this is very easy easy to join on and then like get them to do new things
and become an ai agent of some sort and you know probably like they call those people quote unquote
talent but like their talent is something very very specific uh and mostly a function of how
they look and how they sound and so um you know even the things that
they do on camera all scripted right the writers and the producers and and all of that stuff and so
yeah i i think you're right without saying when or if i mean saying one of the last to go is a
reassuring fact yeah i mean you would be late enough that you you would see the writing on
the wall and you would pivot to one of the 99
percent of jobs that are coming out in the next hundred years i've seen i've seen the matrix your
job is to become a heater eat food walk around in the matrix and provide warmth for the robots
the robot army so good all right folks i think we'll put a wrap on that if you have any questions
about lms join our discord um i do have discord's one of the few apps i actually have notifications turned on
so when people post in discord i do see it right away um join our discord uh you know support us
on patreon we really love and thank all of our supporters um you know we're putting all that
money back in this show trying to get more more people, kids, adults into programming.
And we will catch everybody next show.
Thanks, everyone.
Music by Eric Varndollar. music by eric barn dollar programming throwdown is distributed under a creative commons attribution
share alike 2.0 license you're free to share copy distribute transmit the work to remix adapt the
work but you must provide an attribution uh to uh patrick and i and uh share alike in kind