Programming Throwdown - 145: Unsupervised Machine Learning
Episode Date: October 24, 2022Today we discuss adventures, books, tools, and art discoveries before diving into unsupervised machine learning in this duo episode!00:00:22 Introductions00:01:28 Email & inbox organizati...on is very important00:07:28 The Douglas-Peucker algorithm00:11:48 Starter project selection00:17:01 Tic-Tac-Toe 00:21:41 Artemis 100:26:25 Space slingshots00:29:47 Flex Seal tape00:32:38 The Meditations00:37:58 Flour, Water, Salt, Yeast00:40:55 Pythagorea00:46:13 Google Keep00:48:05 Visual-IF00:50:49 Data insights01:03:07 Self-supervised learning01:10:26 A practical example of clustering01:15:10 Word embedding01:24:02 FarewellsWant to learn more? Check out these previous episodes:Episode 27: Artificial Intelligence Theoryhttps://www.programmingthrowdown.com/2013/05/episode-27-artificial-intelligence.htmlEpisode 28: Applied Artificial Intelligencehttps://www.programmingthrowdown.com/2013/06/episode-28-applied-artificial.htmlEpisode 109: Digital Marketing with Kevin Urrutiahttps://www.programmingthrowdown.com/2021/03/episode-109-digital-marketing-with.htmlResources mentioned in this episode:News/Links:Simplify lines with the Douglas-Peucker Algorithmhttps://ilya.puchka.me/douglas-peucker-algorithm/ How to pick a starter projecthttps://amir.rachum.com/blog/2022/08/07/starter-project/Tic-Tac-Toe in a single call to printf()https://github.com/carlini/printf-tac-toe Artemis 1https://www.nasa.gov/artemis-1/Visual-IFhttps://www.visual-if.com/Book of the Show:Jason’s Choice: “The Meditations” by Marcus Aureliushttps://amzn.to/3C3Kg7bPatrick’s Choice: “Flour, Water, Salt, Yeast” by Ken Forkishhttps://amzn.to/3CqFwKaTool of the Show:Jason’s Choice: PythagoreaAndroid: https://play.google.com/store/apps/details?id=com.hil_hk.pythagorea&hl=en&gl=USiOS: https://apps.apple.com/us/app/pythagorea/id994864779Patrick’s Choice: Google Keephttps://keep.google.com/References:Clustering: https://en.wikipedia.org/wiki/Cluster_analysisAutoencoding: https://en.wikipedia.org/wiki/AutoencoderContrastive Learning: https://towardsdatascience.com/understanding-contrastive-learning-d5b19fd96607Matrix Factorization: https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems)Stochastic factorization: https://link.medium.com/ytuaUAYBjtbDeep Learning: https://en.wikipedia.org/wiki/Deep_learningIf you’ve enjoyed this episode, you can listen to more on Programming Throwdown’s website: https://www.programmingthrowdown.com/Reach out to us via email: programmingthrowdown@gmail.comYou can also follow Programming Throwdown on Facebook | Apple Podcasts | Spotify | Player.FM Join the discussion on our DiscordHelp support Programming Throwdown through our Patreon ★ Support this podcast on Patreon ★
Transcript
Discussion (0)
Welcome to another, as we've termed it, duo episode.
That's just Jason and myself.
No special guests this time.
It's because we have a lot to say.
We have it all pent up.
So expect a high energy episode today.
That's right.
We'll start off by talking about email.
What's more high energy?
Spoiler alert.
I do tell this to people in my meetings.
I'm like, if you're in an afternoon meeting,
I'm typically more ramped up and like
you know kind of you think you'd be tired by the end of the day but i get i get kind of like
riled up throughout the day yeah i wonder if maybe we're more common because i feel like
isn't that more natural you wake up your brain still hasn't really like kicked in yet
so maybe i don't know and then i just crashed at the end of the day but that's another another
problem for another time.
So as Jason foreshadowed,
we're going to talk a bit about email.
So this is actually a topic
Jason and I were talking about.
So we decided to talk about it on air,
which is if you are not in a large organization,
maybe this is a foreign thing to you.
But most folks, I would say,
I've never looked at it actually.
But I would assume a lot of people work in big companies. And in big companies, you get a lot of email, a lot of email,
not junk mail, not spam, although I guess you could. But at my company, they're pretty good
about filtering that out. But just like emails from random teams, random automated announcements,
automated meeting notices, everything just gets pushed to email. And
it's important to stay organized using either rules in your client or on the server, or just
making sure you put stuff into various folders or flagging stuff. I think everyone has a different
scheme. I got criticized actually by my kids because I have like over a thousand unread messages.
Oh, see, I can't handle that.
Oh, no. Oh, you're going to criticize me too. But I have a method. Like it is,
I'm going to say like methodical. Like I do have a method to my madness.
Like it is organized for me. I can't not treat incoming email.
Actually, wait, wait, hang on. So wait, do you have a thousand unread because you,
like you might somehow know the last
one you looked at, you know, the subject line. So you actually read them all. They're just
all like handled. Okay, okay. Okay. So it's not that they're not looked at. It's just like
sometimes, like you said, I kind of moved up, I'll move the meter for is this on me,
this is bad habit, I would not recommend it. Like you said, I'll move the high watermark forward
as like what I've kind of read through
by looking at summaries.
And most of me don't need to do anything,
but I don't necessarily take the extra few milliseconds,
seconds, whatever, to kind of put those
in an appropriate folder or get them out of my inbox.
So then every so often I've just piled through
and via searching, you know,
call all the junk that needs to get marked as SRED. So it's
just sort of like another folder for me. Like there's the unread and read folder and they sit
in the same logical folder in my inbox. I would not recommend this approach. You should definitely
be more organized than me. So I'll tell you what I do for unread is I have the, well, at least on desktop, I have like the double paned Gmail where like you
still see the list of emails while you can see, you know, an email you currently have clicked on.
And then I have it set up to where whenever I like hit the up arrow, it marks it as red,
you know, like as soon as it gets to an email marks as red. So I could just go up, up, up, up,
up, up on the up arrow.
And, you know, and I'll end up with everything red at the end.
But I won't actually look at ones that are like marketing and stuff like that.
I do.
I am pretty good about taking the time to unsubscribe to as many things as possible.
Does that work like in general?
I guess at corporate mail, that should work pretty well, right? Like unsubscribing from something should make it stop emailing you. time is just like a one click on subscribe. But I think the most important thing you can do is to
have folders. Folders are a total lifesaver. I'm not too sure what's the difference between
the server side rules and the client side rules. That's one thing I wasn't ever totally sure about.
So I think in some cases, so like you mentioned Gmail, I think it's a bit different because
the most common ways of accessing Gmail are either
web apps or progressive web apps, like they're on your phone, but they're still a web app.
And so they're like the state of the kind we said, but if you're using like,
you know what, I guess that's pop or IMAP or whatever, you could have multiple different
clients. And so if the clients are filtering, they're kind of notifying the server and moving
and you could have two clients with two different sets of rules so if you access your emails from two different places and the clients
weren't synced you could end up with sort of like not good so i think in gmail as an example when
you write a rule it's just always a server side right like it always in the background when the
mail comes in not when you fetch the mail, the server side stuff gets executed. Oh, that makes sense. So like you have a, you have a client side rule, but your desktops
like not on. And so now like none of those rules are taking effect. So you're looking at email on
your phone and it's unfiltered. Yep. Something like that. Yeah. I think most places have moved
to server side. Although I will say, I feel we at my company don't use Gmail. And in fact,
I actually don't like gmail i preferred they killed
it this is a rant another time inbox i i jive with the inbox way of doing it i was very organized
when inbox was a thing it was nice they killed it and now it's gmail and now i'm unorganized again
so i blame i blame bad tools and i know there are other ones out there people will send us links i
know there are lots of other people and folks doing things that are clever.
I just, yeah, I don't know.
You know, I don't know if at some point I stopped getting ads.
There was a period of time where I was getting, you know, Gmail ads.
So like you go to look at your email and then the first email would actually be fake.
It wouldn't be a real email.
That was super,
super annoying. They got rid of that maybe about a year ago or so. And so that made Gmail kind of
like, there's not that many downsides to Gmail. I did like Inbox. What was so magical about Inbox?
I also remember loving Inbox, being really sad sad when they shut down. But now it's been
so long, I don't remember what was good about it. They've morphed Gmail over time to copy some of
the UI things. But it was just like the way of handling the UI and like what was shown in the
ordering of like how you went through messages and stuff, which is different. I think the complexity
on their side is around like needing to support both the old way and the new way.
Well, one of the things I think they added was the swiping, right?
So like an inbox, and then later in Gmail, you could swipe right to archive, swipe left to delete, or I don't know, maybe I have it backwards.
But basically, you swipe and it goes away.
You just know, it's just one of those things you could just do.
All right.
Well, we get on to our news of this show yeah my news of the show is simplifying lines with the douglas puker did i get that right puker puker algorithm anyways
i know what this is but i don't know how it's said i have no idea how to pronounce that person's name
i'm sure i butchered uh that person's last name anyways this is, but I don't know how it's said. I have no idea how to pronounce that person's name. I'm sure I butchered that person's last name.
Anyways, this is really cool.
And I found a particularly cool article where they kind of walk you graphically through how to do this.
But I was always really interested in the game Worms.
Do you ever play this game?
Yes.
Yeah.
So Worms was this, you know, pixel based physics thing where it, you know, there's also Scorched Earth that was kind of similar, but much more simplistic.
Yeah, and Scorched Earth, like, things couldn't float, like the ground always settled.
And so it was somewhat of an easier problem.
But, you know, in Worms, there would be these floating islands and you could walk on them and you could bounce grenades off of them and stuff but anytime they there's any kind of explosion or anything you know it would chip away at these
islands and it would start to take away some of the pixels and so i was always fascinated with
this like like how do you bounce a grenade off of this pixel like you know this this 2d array
of uh this binary array of pixels and so so that kind of led me to this algorithm,
which is really cool. So basically, you could imagine kind of tracing the outline of one of
these islands. And then just, you know, anytime there's an explosion or something, you know,
retracing that part of it. and you know maybe an explosion caused an
island to split into multiple islands and that's fine you could detect that but then you have this
problem of okay now i've traced this island and i have this like you know the outline right the
shell of this island how do i like put that into a physics engine right and and you know even like
if you were to put every point in a physics engine,
the angles would be all really jagged and stuff, right? And so looking at some game development
sites kind of took me to this, which basically this algorithm is this really fast way of saying,
okay, I have this, it could be a line, it could be a polygon, it could really be anything,
but I have this, you know, line that's
made up of a ton of points. And I want to keep sort of the structure, you know, the essence of
that line, but get rid of, you know, a ton of the points. So it's a simple, unsupervised way to do
that, which I thought was really cool. And the gist of it is you essentially
like go from the start to the finish. I guess this wouldn't work if it's a loop. So there has
to be other ways to do that. Oh, I think it actually it still works. We have to adapt a
little bit. But you go from the start to the finish and you you kind of draw a fat line,
you know, like imagine like a rectangle that's oriented, right?
Like a fat line from the start to the finish.
And then you see like, okay, are all the points just in that line, in that rectangle, in that zone?
If they are, you could just delete all of them.
And now you have just a start point, a finish point, and a straight line.
But if there are some points that fall outside of that zone,
then that means, you know, if you were to delete that point, you'd kind of be changing like the
essence of this. So maybe it kind of like is more of like a parabola or something, right?
So if you find a point that's outside of the zone, then you kind of do this subdivide type thing,
or it's like a divide and conquer approach where you say, okay, let's go
to that point and create a rectangle from the start to that point. And then a second rectangle
from that point to the end. And now with these two rectangles, did I cover everything? And so
you kind of keep going in this way. And yeah, it also looks like really mesmerizing. So it's a,
if you ever wanted to make your own worms game this is
how you would do it yeah i think there's a a few ways of doing this like line simplification this
one uses like you're kind of saying this fat line as like lateral stuff i think there's some that
use uh sort of circular distance as well depending on what you want to do but yeah you're trying to
like preserve the shape but also simplify the number of points yeah exactly cool my
news article is how to pick a starter project now ironically or i guess this takes the
stance initially of telling you how to pick a starter project for someone you want to get rid
of um and by illustration of all the things that you do to get rid of someone, highlighting
how to not pick a starter project.
And it's kind of a funny thing, but I guess nothing in here is sort of earth shattering.
But reminder that starter projects really are important for onboarding new folks, like those
first things that you have them do and making sure that it works well. It also gets in a bit
to choosing mentors and that kind of kind of stuff. And I just wanted to highlight because
I think this is a really poorly thought through thing, like how to onboard new people to teams.
And some companies will have like a company wide onboarding process. But I think like still the importance of within a team,
making sure new folks get up to speed
and how that's like a joint thing across the manager,
but also the members on the team.
And so here it's highlighting things like
pick a gigantic project that has vague requirements
and goes cross team and cross discipline
and is tracked at a very high frequency
by upper management. These are all ways to inspire someone to quit on the spot and to not get brought
up to speed. I think like oftentimes, even under poor circumstances, people do integrate into the
team eventually. But I think like often those formative first few months are critical to kind of
having a really positive experience and making sure that the team knows that they should be
taking time away from their other tasks to onboard a new person. And I think like,
whenever you change jobs, it's always difficult in the beginning. And I can look at these comments
and think back to some of the times I've switched
teams or changed companies and had starter projects. It's more or less sort of, I guess,
we're intending to push me away. And maybe I was too dumb to realize that I should have changed
jobs. No, just kidding. Your starter project is to delete all the code that your boss wrote.
I did have a starter project that was, you're new to the team. The team's been
divisive about coding style. So why don't you come in and start enforcing coding guidelines?
Oh, no. You've got to be kidding. Is that real?
Nope. That's absolutely true.
Anyways, it is important.
And I think as well, if you say, well, well okay i'm not a manager or whatever i'm not
assigning starter projects i i think it's important to to know that this is a thing and i think that
people don't always do well and so if you're new to a team and you're not getting good work like
be communicative to your boss that like you know or your co-workers like hey this seems like kind
of vague like i'm happy to work on it but is there other stuff that i could you know, or your coworkers like, Hey, this seems like kind of vague, like, I'm happy to work on it. But is there other stuff that I could, you know, be doing as well, like be, be proactive,
and being an advocate for yourself and trying to, you know, assuming that they're not doing it on
purpose, like trying to get work that you think is cleaner. I would say like, one of the more common
unintentional mistakes is giving someone something you didn't realize that was
a lot harder than it actually seemed. And so you give someone a project, you think it's really easy,
and then they really, really struggle. This has happened to me before. And then it turns out,
once you sort of start reviewing what you're doing, people are like, oh, I did not realize
this was that complicated. I just thought you were kind of slow. And it's like, well,
people really should be assigning work that almost they would just sit down and do in an
afternoon. Like it really should be too easy, better for them to finish too fast than to get
bogged down. Yeah. The other mistake I see a lot is, is where, um, if there's a junior engineer,
um, you know, someone will, um, you know, a manager might say, oh, you know, I really wish
I had time to, you know, work on project X, which is some new project that the team, no one on the team has time to do Project X.
But if we did it, it's like this big win.
And then somebody joins the company.
They say, oh, this person has 100% free time.
They just joined.
We're going to put them on Project X.
And it ends up being this
really risky thing. And for your perspective, you feel like as a manager, you might feel like, well,
I'm giving them opportunity. Yeah, this person has a chance to hit a home run. And there's no
downside because they just joined and they can pivot and they're not going to get evaluated right away or anything.
This is actually a terrible idea, right?
So the reality is people are under kind of the most pressure to perform in the beginning.
And so this advice here in the article about put somebody on the main bread and butter
project.
I see people mess that up
so many times. But yeah, as Patrick said, you want, especially for junior folks, but even for
senior folks, you want them to be exposed to, you know, the core essence of the company, even if
your long-term plan is for them to try and invent something new. Cool. So my article is tic-tac-toe in a single call to printf.
These code golf things, I love these things. I've never tried it. I don't think I would enjoy
actually building this. It's more of like an art installation than a coding exercise,
but it is so freaking cool. So as you might imagine, there's a ton of pound defines to make this a reality.
There's a certain way in printf
you can actually capture input.
I did not know that.
I didn't know it either.
Yeah, it's, I'm trying to find it here now.
I mean, there's a whole document here
on how they actually did it.
Anyways, there's a way you can capture input with printf
and they're hijacking that.
And so basically this prints a tic-tac-toe board, lets you type in input and play games against yourself.
The entire thing is done in one printf call.
There is a while loop, while printf.
I guess that's because every time you enter a key or something, right?
But really freaking cool.
And they actually took the time to make it like
actually visually artistic so um the entire program is spaced in a way where there's a
like ascii art percent n um in the program so it's uh um oh they're actually actually i take
it back there is a scan f yeah i see that, I take it back. There is a scanf. Yeah, I see that.
In the argument to printf, there's a scanf.
Oh, okay.
So actually.
No, but that's still crazy.
I didn't know you could, it does not seem like a good idea.
Oh, the scanf is inside the printf.
Yes.
Oh my gosh, my mind was just blown.
Oh, that's not that bad.
It's just in the argument string. so like comma scanf open print so it's
just a function call that's taking place and then the result going in as an argument okay okay okay
this is not as like devious as cynical like as yeah okay this is so crazy but i got it now
okay okay got it got yeah i got it too yeah so okay so it's a call to printf but that printf
has as an argument i call it to scanf. So there's two functions, but the whole thing is freaking cool. I just I love seeing stuff like this. So definitely check it out. It will give you a chuckle and you can read through how it's implemented, which is very clever. It will give you a chuckle if you're either very against C or very deep in admiration
for your C programming language.
I think that those two folks
may like everyone else.
Jason's description may be sufficient
for you to realize
you could save yourself the click.
Well, you could copy paste this
into your terminal and GCC it
and then play tic-tac-toe. It'd be a fun
exercise. They even have instructions
on how to do that if you're new to the program.
If there were only a web browser where you could just
type a website and play web games that
don't involve incredibly
obfuscated C code.
Okay,
this is pretty cool. I would say the name
almost is also a good competitor
for obfuscated. They did a good job naming it to seem to seem really sinister it also claims printf is
turing complete which doesn't surprise me but sounds horrible yeah oh my gosh that's wild
uh that's probably like a complexity measure of your language which is like how many features
of your language are themselves turing complete? Oh, interesting. Yeah. I wonder, you know, I wonder what,
so there's been historically been languages where I've had a hard time working with them.
And I feel like that's pretty common. So one of them is Scala. Another one is Haskell.
Those are two languages where I don't have strong opinions i mean i actually enjoyed scala
but uh but i found that the programs that you know when you started to scale up the team size
scala and haskell both became really difficult for me at least to understand um and so yeah i
wonder if there's somehow a connection like i wonder if you could somehow take all this anecdotal evidence and and regress
it to say okay you know yeah somehow some complexity metric of the language causes it
to balloon in this way interesting so why is python so bad as a programming language then
what is what is the uh zeitgeist on python oh Oh, no, no, no, no.
We're going to get demoted.
We've got to keep making progress.
Okay.
I don't know.
I don't know.
We also have to throw JavaScript in here, too.
Oh, my gosh.
Okay.
Mine is completely unrelated.
It's not an obfuscated C code,
although it probably does have C code on it.
And that is the Artemis 1 project.
By the time you listen to this,
hopefully it's successfully launched the time of recording, they've sort of attempted the first
launch, which got scrubbed. But I want to say this had been a big deal for a long time for
the United States government trying to send a rocket effectively back to the moon, this first one unmanned,
eventually, you know, a manned moon base. Contrary to the last time that the United States and this one's actually being done much more internationally. Some of the sections of
the rocket are actually done by the European Space Agency in collaboration with NASA. So I feel like
this is a bit different in that way, leaving all the politics aside of the last one. But I want to say, I feel like this came a bit out of the blue. Like, it is a big deal, I think. And I think that it had been going on for so long, so much of a run, people kind of forgot about it. And then it was sort of like, oh, it's on the launchpad getting ready to go. And so that's pretty exciting. So yeah, I mean, again, politics aside,
but I thought that back in like 2010 or something, they totally defunded NASA. So I was like really
surprised to see this. So I really don't don't know how like what really transpired there.
Yeah, I don't know about defunded NASA. But there's been a history of telling various government
agencies in the United States like what to do, but then not giving the funding. but there's been a history of telling various government agencies in the united
states like what to do but then not giving the funding so it's like a two-step process first you
give them like a new mandate and then you're supposed to like go in in the budget sort of
like budget for that new mandate and so there can be like hey you need to go do x and then we go to
make the budget we don't give you money to do x so are we really telling you to do x or not it's a
bit ambiguous.
And so I think that's true of NASA, but true of other government agencies, but I'm not sure on the particulars. But no, it hadn't been defunded. It had been sort of, in some ways, people will
say uncancellable because it was involving so many states, so many companies, so much international
cooperation. There's just all this stuff in it it just sometimes you know estimation for timelines we're not that software engineer is the only one who get them wrong
um but other companies get it wrong too but no i i mean leaving that aside i think it's exciting
like people quibble over the expense of it but i think that giving that that sort of optimism and
hope of going to space and putting people onto the moon and the immense amount of
technological research that falls out as part of this i i think it's an exciting thing i won't
justify that it's worth the cost i don't know i kind of it's hard to say yeah it's exciting and
like for you know having myself children who are elementary age you know their teacher turned it on
they get excited about things about science.
I think like from that stuff, it's really hard to measure just how big of an impact these kinds of things have.
Yeah, I was talking to somebody about something related to this. look at like the sistine chapel and and like you know the pyramids like how you know we did these
things that were kind of like really powerful and then also like like wholly unimportant at the same
time like like uh like the pyramids so the pyramids are extremely cool as a tourist attraction and
and and i'm sure giza like generates a lot of tourist revenue from it. But like at the time, like it was just maybe like a thing to do or is like a religious endeavor. I don't actually know the history of the pyramids, but I feel like the humanity is full of these sort of endeavors where it's like not totally clear what to do, but you do feel like this is sort of like a milestone in humanity,
whether it like turns it to something economical or not, right?
It's always hard, right? I think these things are complicated, but from a celebration of what
humanity can do for along this like technology, you know, pushing the envelope, doing these new
things, I think it's incredible. Also, there's been a theme of various space and rocket related things throughout the history of
the podcast but i mean i think i i said it i think even for our predictions for this year which
probably end up needing to to slide out or whatever but the just the amount of new rocket
hardware coming online and people doing things we haven't had as many of these duo episodes,
but there's a number of other sort of interesting rockets coming to fruition.
It's really cool that soon our ability to access orbit and access things like the moon
are going to be just so different than they were before.
And I'm really excited to see what we're not anticipating about such a transformation.
What do you think about the space slingshot?
Is it spin launch?
I think that's the name, right?
Oh, yeah.
I don't know.
You're talking about the one that spins in the centrifuge that's in a vacuum and then
launches.
I mean, I think it's one of those people have done stuff similar before in research and
giant, you know, artillery know artillery guns basically that would
shoot rockets out the the rocket equation which is oh this is not my area of expertise but basically
the fact that you need a huge amount of rocket fuel to lift your rocket but you add rocket fuel
and rocket fuel is heavy so therefore you need more rocket fuel to lift the rocket fuel to lift
your rocket and you end up with this sort of like cascade of stuff and the earth is in a pretty heavy gravity well right like it's it's the atmosphere is dense
the earth is massive so it's pretty hard to get to orbit and so if you can get uh just a you know
little percentage of earth you know is diameter away or earth's radius up and sort of lessen the
effects of gravity lessen the effects of gravity,
lessen the effects of the atmosphere and just sort of get through those really quick
by energy held in your launch device, not in your rocket, you simplify a ton of things.
Of course, doing that has its own caveats with acceleration and stuff. But I actually,
spin launch seems to know what they're doing. That is not necessarily always enough, but they
built something at a scale people didn't really think was possible. I feel there's a lot of launch seems to know what they're doing that is not necessarily always enough but they built a
something at a scale people didn't really think was possible i feel there's a lot of armchair
quarterbacking so i'm hopeful that it'll work because the idea is you have this giant
circle that stands up on its side it gets gets almost all the way to a vacuum and just very
very little air left in it and then they rotate this huge arm in the
middle with a rocket on one end, basically. And then at the right time, they unleash the rocket
and it's, you know, shoots out like a sling, you know, shoots up into the atmosphere and it gets
through all that hard, dense part of the air, gets to a pretty high altitude and then lights its
rockets. And so it can be a much cheaper, smaller, more compact, and you can launch a lot of times because you just load a new one up and you do the
same thing again. And so, I mean, if it works for small satellites, it would be a huge unlock.
Yeah. I mean, it looks really freaking cool. I mean, Patrick did an amazing job describing it,
but you have to watch the video
it's just really really cool to to watch fingers crossed i hope it works and also good luck artemis
and if you already know what happens in the future well yeah wait actually so we should predict do we
predict artemis will launch in the next month okay well we're recording at the very very end of
august you're saying by the end of september
yeah so so i think this show will go out in october but anyways so let's say september will
this uh will the will the rocket launch in september well the rocket will have had launched
in september um yeah that's right nice job i'm gonna say yes just because i want it to be true
all right yeah i uh i feel like it will too i mean
they're so close right i mean i think it was just some kind of uh liquid leak or something i mean
they could easily fix that flex seal just slap it on there and it's good to go
now you know why jason and i are not rocket scientists all right yeah exactly oh my gosh
the basic flex i bought flex seal tape the other day it actually is pretty awesome i used it to
fix a hole in our our pool like you know the pool has uh your pool has that thing which like goes on
the bottom of the pool um and like cleans the bottom you know okay yeah like a vacuum call it
yeah basically a pool vacuum type well pool vacuum means something else that's like cleans the bottom you know okay yeah like a vacuum call it yeah basically a pool vacuum
type well pool vacuum means something else that's like a thing that you manually do anyways so i
use flex seal to uh fix a hole in that it actually worked pretty good i was impressed i think there
was this uh there's this one i found online that i i also end up using for some other project that
actually it gets hot when you stretch it and it like that the tape like
chemically bonds to itself or whatever it's like wild i mean they have amazing crazy tapes now on
amazon we can talk about the tape that's like electrically conductive sort of through the thin
part but not across the tape so along the long part it's not conductive but like so you can if you you can
pass through the tape but not along the tape electricity whoa that's wild okay anyways all
right oh actually one more shameless plug so uh there's a gentleman that i know who started a
company called bit rip and bit rip is uh basically i mean i'm probably gonna
totally butcher this but imagine just like a roll of tape with qr codes and so the idea is if you're
out in the field like you're an electrician you know like a you know a public utilities worker
or something you could just like slap this on anything and then you scan the qr code uh and
put some data into some app and then someone behind you
can like scan the same tape like a year later i see this so it's like every rip is like a unique
code is it so is it like a non-repeating pattern like uh what do they call that oh man you know i
don't know the details i always thought it was just like tape with qr codes on it but i haven't
actually seen the product.
Oh, okay.
Let me see.
BitRip.
Yeah, it looks a little bit like QR codes,
but they're not exactly QR codes.
Oh, yeah, you're right.
It's almost like a barcode or something,
but somehow there's a unique fingerprint.
Now I'm curious with it.
You brought this up, man.
Now we got to know what the secret is.
We need to get the BitRip guy on the show 600 gps tracking tags embed photos documents audio oh yeah it's definitely not a
qr code it's it's some kind of like a penrose tiling if i just had to take a naive guess i
think it's a penrose tiling which is like a non-repeating infinite series i think you're
right yep yep but okay oh
we probably messed up his pattern we probably should be here like reverse engineering tape on
yeah so so how many of you think that in the month of september we'll get sued by the bit rip guy
all right it's time for book of the show my book of the show is the meditations by Marcus Aurelius. So I remember first hearing about stoicism a long, long time ago and thinking to myself, oh, that's pretty much me.
I live like a really simple, relatively simple life, always trying to simplify things.
And it kind of really resonated with me.
And so years and years later, I decided to read this book.
Marcus Aurelius is, I wouldn't say he's the founder of Stoicism, but he's the person who really kind of popularized it.
And I guess, you know, it was a bit of a, what's the word?
Like, I think I'd overhyped it too much in my mind.
You know, it's kind of like, actually, Patrick, you brought this up before the the show how it's like another one of these books is the art of war by sun tzu
where like everyone talks about these books all the time and you think oh this is like
gonna be something that's gonna totally blow my mind and what i actually found was that so many
people have already talked about this book that like i already kind of knew uh what was gonna
happen and so it really it's almost like kind of knew what was going to happen. And so it really,
it's almost like kind of watching the movie of Jurassic Park after you've read the book,
or maybe Harry Potter or whatever, any of these. So it's kind of like, you know,
it was a little bit underwhelming because I'd already kind of known the material. But
I felt like it was still a good book about halfway through it. If you don't know what stoicism is,
or if it sounds interesting,
if you're interested in how to lead a simple life, like what that means for like metaphysically and
everything, check it out. It's a good book. It's also, you know, obviously really dated. I think
Marcus Aurelius was what, like a Greek emperor, I think, or Roman emperor. I think Roman emperor.
Yeah, Caesar, right. A Roman emperor.
Roman emperor. There you go. So, yeah. So,
I think it's going to be a very dated book, but it's a little difficult to read. But I think it's
nice. And it could be sort of something that you do while you're in the car and just kind of have
it in the background. And there's probably a lot more contemporary books on St stoicism and the other kind of philosophies I highly recommend.
I think understanding some of those philosophies and approaches, even if you're not saying,
hey, I'm actively seeking to model my life or to do this as like a key tenant, I think
still are useful for helping see how other people think to just like have new ideas and to kind of like question if there's
some nugget of value there for some part of your life rather than necessarily reading a self-help
book where you're saying, I'm going to adopt this as like everything about who I am and make it my
core identity. I feel sometimes there's, I don't want to say like a pressure, but in my head,
at least like a thought, like, well, if I read this, I'm kind of wanting to adopt it as like, and I don't think that's accurate.
I think like Jason's pointing out, you can read it and still learn a lot from it.
Yeah, I think, you know, I was really fascinated when, you know, we were working on YouTube.
I was really fascinated at like, what made things go viral?
You know, like what actually makes things go viral and one thing
that uh i learned in that process is that a lot of things that you think are organic are actually
not organic right so there's there's actually a lot of uh viral videos where you look at that
and say oh man you know perfect timing but it's actually highly, highly scripted. So it's very hard to tell what's real,
what's fake. But either way, I think that what I noticed was a lot of viral content on really anything, any type of media, it sort of taps into this kind of like latent, I don't know how you
describe it, just like latent, like shared common sort of culture, like the sort of like hive mind of humanity or maybe hive mind of, if you want to be more local, like hive mind to your country or your region or whatever.
It's like it doesn't directly say like, hey, you know, we're going to talk about stoicism today.
But it just like it sort of taps into a lot of that sort of latent energy
like harry potter is the example that keeps coming up in my mind how if you actually look almost all
of those stories like like the at one point they fight a snake and that snake like harry potter
fighting the snake is like you can you can see the parallel to this other like ancient story that
that people read for like hundreds and hundreds of years and so it's like you can you can see the parallel to this other like ancient story that that people read
for like hundreds and hundreds of years and so it's like you have this evolutionary like footprint
you know and really popular content sort of like taps into like piggybacks on that on that footprint
um and so yeah reading like a lot of these canonical books will give you kind of like an
understanding into into that which would help you if you ever wanted to write uh you know produce some content uh like like write a
book or or or anything like that that that uh wants to tap into that that same energy
well as tradition holds jason gave us a very highbrow uh very thoughtful book. This may be a bit late, but we've not done as many of
these recently. So it was all the fad during sort of like the COVID at home stuff to kind of bake
sourdough. I see that. I see that meme. I remember that.
Mine is a book about bread baking, Flour, Water, Salt, Yeast by a gentleman named Ken Forkish. And there are a lot of books about bread making and artisan bread.
And I think this one, for whatever reason, just reading it in the stories,
it really kind of like made me excited to try the recipes, to do them, to take the approach.
I just thought it was a very thoughtful way of thinking about bread baking.
That's a tough one.
There we go i can uh and um you know just using
simple ingredients i guess they call it sort of like a lean bread which is there's no there's no
fat in it right just flour water salt and yeast um they of course have is that literal like do
they literally put fat in bread yeah so an enriched bread so like if you i mean like a challah bread or uh you know something
like an egg roll right or like a yeast you put egg or butter or oil like a lot of pizza doughs
have oil in them wow okay that makes sense yeah pizza dough yeah and so and so they those would
be like enriched in some way with some kind of fat and so they're like and a lot of bread we eat
has a little bit of that and so
kind of going back to this very basic sort of loose shaped you know round bread gained popularity i
know in like on the west coast of america the the sort of like portland and san francisco these
places have like kind of developed a a renaissance of this style bread making.
But if you're interested in bread making, which most of you probably aren't, that's fine.
I would encourage you to check out this book. I really like this book. I baked a lot of the
recipes in here and had a good time. You know, I don't get super, super into it. Probably should
do it more. But it does take quite a while. It's an endeavor, but it feels good at the end to really
eat it. And there is something hugely different about eating a loaf of bread, you know, smelling it, cooling it, eating it that you made
versus, you know, going to the store and buying one. And so I think it's something that people
should try at least once. Yeah, this is super cool. So I was looking up the author to see if
Ken Forkish is the real name or not. It just seems like too good to be true to
write a book about cooking and he has fork in it. But so far, everything I look up say
it's not a pseudonym. There's actually a real guy named Ken Forkish.
I didn't even think about that until you said it. Yeah, I think he runs a restaurant in,
I think, Portland.
That's right. Yeah, you got it. He runs a bakery in Portland. And he actually,
before opening that bakery in 2001, he worked in
Silicon Valley as a tech worker for 20 years. Oh, maybe this is why it resonated. I didn't
know this. Yeah. So that's wild. So 20 years, so that means he joined, he went to Silicon Valley
in 1981. That was probably when it was literally Silicon, you know, like making chips and everything.
Then yes, worked there 20 years, opened a bakery.
Good for him.
Very cool.
All right.
So time for tool of the show.
My tool of the show is this app called Pythagoria.
I guess sticking with the Greco-Roman theme that I have going here. So Pythagoria is this game where you basically have to solve geometric puzzles. And so that sounds like it would be really boring, like,
you know, doing math homework as a game or something. But they actually do a great job
of making it really engaging. You know, one of the things that they do really nicely is,
you know, the game is played on this 16-point grid, or maybe, you know, one of the things that they do really nicely is, you know, the game is played on this, this 16 point grid, or maybe, you know, it's more than that. It's 36 point grid. So there's
a six by six grid of dots. And so, you know, the first level is very simple. There's a dot on the
left side, dot on the right side. They're like, you know, find the dot in the middle, you tap the
middle of the screen, you move on. And then, you know, it kind of tells you, hey, you know, you have two dots, you know, make like an isosceles triangle. And so you can drag
lines between the dots to make triangles, but you're restricted by these dots. And the dots,
the fact that you can only draw a line either from dots to dots, or you can make new dots where two
lines intersect, that's where it starts to get
really complicated. So for example, there was one puzzle where you kind of needed a point that was
sort of in the middle of four points. So you needed a point where there wasn't one.
And so what you have to do to solve that is you can just make a little X, right? So imagine your mind like four points, you know, in a square shape, right?
And if you draw the two diagonals, now you have an X, right?
And so then the middle of that X, you can actually tap that and now make a fifth point
in the middle.
And so you can now like when you get to the harder levels, now it really opens it up because
you can really make a point anywhere as long as you can figure out how to get two lines to intersect at that place, right?
So the hard puzzles start getting really hard where it's like, okay, I need to go to like seven
sixteenths of the way between these two points. And so like, what lines can I draw to like make
that happen? And so it's actually
it's really fun i mean it's it's one of these things it's like very hard to explain you know
audio um through audio but i highly recommend you check it out the other thing is it's completely
free um it's a donation where um game so i went and gave them i think it's like a dollar or
whatever that they're asking for but um but you can play the entire game cover to cover totally free no ads uh nothing like that and uh i found
it really stimulating like the other thing is as soon as you get it right it kind of dings and you
know you got it right you're not really guessing and so some of the levels you know i would stare
at it stare at it stare at it and i kind of find out, okay, here's sort of the trick.
And then you get that trick, you solve the puzzle.
It's very satisfying.
I felt like they did a good job with the pacing.
A game like this, it's very easy to make a level that's extremely difficult,
and then you just can't move on,
and it's really frustrating.
They did a good job of ramping up the difficulty. And one of the
other things they did to help with the pacing is the game is broken down into chapters, but the
chapters aren't in increasing complexity. They're just different phenomena, geometric phenomena.
And you can actually play all the chapters asynchronously. So if you get stuck in chapter
two, you just go to chapter three. So felt like that was uh also really clever game design and uh yeah definitely check it out
totally free so there's nothing to lose awesome that's really cool so now you're going around
with like a compass and ruler and making dodeca guns and like showing all your friends everything
looks like a geometric problem it's like okay you know this door won't close let me
pull out pythagoria you know that's awesome uh mine is we might have even had this as a tool
to show before but that is google keep um i feel a lot of people may have heard of this before
but some some may have not and it's a way of doing note taking. But I think the power that I
had recently realized about Google Keep is having sticking with it and using it and jotting what
amounts to kind of like post it notes in the app or on the web or sending links as like a way to
doing bookmarks and putting pictures in and just sort of gathering a lot of unorganized data and
then just being able to be at search it to be able to go to like dates to be able to you can do categorization, but even just leaving it
messy, and then putting stuff there over time and building it up. And then, you know, some something
happened to me where I was like, I think I had written this down one time, or I was cooking some
dinner. And I was like, Oh, I think I last time debated what temperature to put this at. And like,
well, let me go see. Oh, yeah, sure enough. I, I took a note here cause it felt like something I'd want to remember.
And so just putting these little like shots of a node or a picture recipe or a link and being
able to find those things later, anytime where I try to figure out something that I knew and I,
I don't, or couldn't find it, always try to make sure to go put it in there the second time.
Cause if I needed it, you know, twice, probably going to need it again.
And so building that up over time,
I think there are some open source
or different alternatives
and other platforms that people use.
So this is, my tool is Google Keep,
which Google allows you to use for free.
It comes with all the traditional Google cons, I guess.
But that is a pro.
But then, you know-
Are there ads or no ads?
I don't think I've seen ads, but people in general have a love hate relationship with Google, which I completely
understand. And so don't put any private information there, I guess that you're not
willing to share with Google. But, you know, using a tool like this, I guess would be my
shout out, which is something where you can just very low overhead, not superstructure,
just sort of put your information in and allow it to kind of accumulate. We talked about the importance
of email organization. If you're super organized, maybe you don't need this. I already fessed up in
the beginning that it's something I need to do better at. But, you know, I think here,
this is a way for me to kind of not lose those little ideas.
Yeah. Do you rely on search then to to retrieve the notes okay so
the notes aren't like hierarchical or anything no i've seen stuff and always been intrigued about
doing that i feel like the hesitation for me for hierarchical or like linked notes and stuff would
be really cool except that i just know that i'm not gonna put i'm gonna worry more about where in the
hierarchy goes and then therefore i'm not gonna put stuff in if i don't think it'll fit in the
hierarchy that makes sense here it's like i need to just record it yep yep totally makes sense
i wonder if like maybe we could automatically generate the hierarchy
oh what do you mean pretty cool Like maybe with some unsupervised learning.
Speaking of which, if only our episode today was about... How did that happen?
Cluster my notes.
Clustermynotes.com.
That needs to be...
Oh, did I tell you it's a bit of a side topic?
Then we'll jump in on supervised learning.
I made a website called visual-if.com. That needs to be, oh, did I tell you it's a bit of a side topic? Then we'll jump in on supervised learning.
I made a website called visual-if.com.
The idea is the UI is very clunky, but it's an interactive fiction game,
but it runs Dolly as you're playing the game.
So, you know, you type in, like, you know, you get past the intro screen,
and it's like look up look down movies like one of these interactive fiction is like zork or adventure right but
anytime there's a room description or anything um dolly is running in the background and you get
like some crazy you know art installation of uh of whatever that is. And it really actually makes the games really fun.
If you've played interactive fiction before,
if you've played a particular game before,
you could play it again
and just see all these really trippy pictures with it.
See if it matches up with what you were envisioning
when you played it the first time.
I'm not sure if I went to the right spot or not.
It's either really well done to be confusing
or I'm on some other random person's website.
So it's Photopia is the game that starts when you...
Okay, yeah, yeah, yeah, yeah.
Yes, it says, would you like instruction?
Yes, the UI sucks.
I'll admit it.
So it says, would you like instruction?
You actually have to click on that
and then type no or yes. Oh, okay, okay says will you read me a story you have to actually click again
and then now you're in the game i need some way to like maybe i should make like a little intro
screen before the game kind of tell people what they're in for nice this is cool i mean these
pictures are like kind of creepy i'm not gonna lie the ones i'm
getting yeah yeah it's like uh there's like a person skiing or something at least that's what i
see um but yeah it's uh it was pretty pretty fun i might try and uh double down on that a five time
so i found out people didn't really understand how to use that product.
I just asked people and they're like, I don't really understand what I'm supposed to be clicking on.
But as you kind of work on any type of product, definitely you can show it to your friends.
You could try it out yourself.
But you will eventually want to show it to strangers, right?
And as we talked about with Kevin in the marketing episode,
right, eventually you're going to give your either app or website or whatever you're building
to the public. And you're not going to be able to really look over their shoulder and find out
what they were thinking about it. So, you know, definitely we talked about marketing and surveys
and all of that. So there's a whole human element too. But another thing you want to do is you want to
kind of gain insights from data. So ideally, you know, imagine if you're making like this app,
for example, this visual IF, you know, I could, I mean, I didn't, I didn't implement this, but
you could imagine, you know, I could track where people are moving their mouse or what
they're clicking on, or if they're clicking, how long they're spending on that site. And that could,
you know, all go into some kind of report that would give me information. And then I could even
go a step further and A, B test. I could try a new version of the site, see if it improves, right?
The challenge is, you know, all of this data that you're going
to get is going to be highly unstructured, right? So imagine, you know, it's going to start with
these print logs that maybe you did when you were doing development. So, you know, print, you know,
person clicked on a page or print the page ID or print, you know, person move the mouse to the bottom of the page, right?
And you have to somehow turn that into something that you can look at,
some type of graph that you could look at and say, oh, here's some thing that I can do to make things better.
And so that's, at a high level,
one of the main things that we want to do with unsupervised learning
is take you know a
lot of raw information and like you know the the one of the biggest examples that is used repeatedly
is wikipedia you know take all of wikipedia and can you just learn something from reading you know
you're having a computer read all of wikipedia so that's basically what what unsupervised machine
learning is all about and uh yeah i think pat I think Patrick, you brought this show topic up to the table.
I think it's a great show topic and something that I am really passionate about.
Yeah, I mean, I think like Jason was saying, this taking your data and sort of like helping to structure it but i think and i made the joke about about clustering earlier i feel that sometimes
um jason is a machine learning let's say practitioner right like that's that's his
trade i'm not so in general i try not to do it not because i don't know what it is or can't do it but
just because it comes with certain i don't know honestly expectations and so when i set out to do
work it's like the
engagement model I have with the data that I have in front of me and with the task at hand
is a bit different than Jason, Jason might have as a machine learning practitioner.
The same time, I think this area in parts has a lot of overlap between and I'm just making I don't
know if there's probably a better word, but sort of machine learning practitioners, and sort of
other folks. And so I think there are things like, for example, like clustering, where there may be features of your data or heck, there might even already be numbers that, you know, you have along multiple dimensions, you know, just even two or three dimensions, where if you clustered them together and looked at them, that would already be quite helpful.
And you may say, well, that's not machine learning.
That's not, you know, there's not machine learning. That's not,
you know, there's no neural network. There's no, you know, whatever. But I think that's okay.
I think things where you're fitting a, you know, a regression to your data and trying to say,
you know, hey, look, there's numbers here, and I'm fitting a line to it. So I can think about what the next number would be, or between numbers or out past the last data point I have, right?
I think those are kinds of things where you are overlapping in a lot of this and learning about those things and thinking about them.
And there's another tool in the toolbox.
We talk about that all the time.
And so I'm pretty excited to talk about some of these things because I think a lot of them have value even to people who wouldn't call themselves machine learning practitioners, but anyone who has data, which tends out to be a lot of people,
or most people have some amount of data that are trying to work with and ways of organizing,
you know, cutting that data down, smoothing the data, thinking about the data, all those kinds
of things. Yeah, right. Yeah. So I mean, imagine like you have a bunch of data around, you know, people who visit your website.
Right. And so you want to set up, you know, clusters, you want to set up kind of cohorts.
For example, you might learn that there's somehow like there's a lot of identity around age.
So it's like the people visiting your website are either, you know, really young or really old for whatever reason.
And so maybe they're coming for two different reasons. You have to figure that part out. But clustering
will kind of looking at the centroids. The centroids are the centers of the clusters
that you develop can give you a lot of information. So for example, you might have a whole bunch of different features,
and then you might do clustering, which basically says for each of the data points,
and a data point would be sort of like a set of features. So maybe a data point is a person who's
come to play your game and a set of features about them. Like how do you describe this person,
right? And so then
after you've done, you're finished with a clustering algorithm. Now each person, they're
going to be assigned to one cluster. If it's a hard clustering, if it's a soft clustering,
then they could be assigned to like some mixture of clusters, right? But that's neither here nor
there. So now you can look at this at this sort of like these these clusters that people are assigned to. And you could find the center of them. In other words, given this group of people who are all assigned to cluster A, you know, what is the center point there? So what is like the person who would be most aligned with cluster A, like this hypothetical person who just perfectly lands in the middle of the cluster. And you say, okay, this is sort of an archetype. There's something unique about this group of people. You can also
do this with faces if you're trying to do face recognition, or even if you're trying to do object
recognition, you can even do clustering on images and say like, i have this huge bank of images and they're falling into
one of several categories so i think you know clustering has uh yeah it's been used for for
tons of different things you can also then you know do machine learning on top of the clusters
what have you used clusters for in your your work yeah I think like one of the things that came up recently is
we noticed that, which I guess like just thinking about your data is like we were doing some
processing and sometimes certain configurations of input were causing like the data to take a
lot longer than other configurations. Sorry, speaking vague, but whatever. And so we were
trying to kind of understand like, is there a difference or the features in one? And so this what you kind of alluded to, it's exactly right, which is, hey, we have a whole bunch of measurements, like the size of the input data, the like, let's just say it was text, right? Like, how many characters are there? How many lines are there? You know, how many punctuation marks would there be, right? Things which we could kind of look and say, hey, is there something about this that's making the processing take longer or not?
And once we sort of like, you know, kind of started plotting it out and saying, okay, hang on,
let's look at what the clusters are of these like sort of easy inputs and hard inputs. And like,
what, what? Oh, okay. Well, look, these hard inputs all have you know a lot of extra
punctuation in them and then you know realizing that the processing we were doing was going to
cost a lot more when that happened right but we didn't kind of it was vague enough that it's it's
a bit difficult to uh sort of know that in advance and just look at your code and say hey actually i
see here we do all this extra work in punctuation. It was
the sort of like second order effect that was causing it. And so by doing this clustering,
and just looking at the results and saying, oh, look, these things are different than those things
allowed us to kind of say, hey, up front, let's check for that and handle them specially,
or maybe decompose them further or do something special. And so that's what we use Zipf.
Yeah, that makes sense.
I mean, one area where you see a lot of clustering
is around log ingest and log reading.
So imagine you have a website.
The website has a MySQL database.
It's got servers, backend servers.
It's got the JavaScript on the client.
And all these things are generating logs, right?
Your server, your database is generating all these logs.
Like, oh, I'm getting full up or, oh, the utilization is too high.
Actually, to be honest, have you ever looked at a database log, like a MySQL log?
No.
I've never done either.
I just assume it works. it shows that we're terrible
dbas but you know it's generating a ton of logs and you know if your database like goes down or
all of a sudden it takes i i do have some uh actually like there's a lot of popular websites
that are only run by like 10 people you know what i mean like or i think craigslist is famous for
having just extremely small staff for such a popular website.
But eventually, you'll run into this where your site just doesn't load.
And you're going to have to go step through all of these.
So you look at the client, say, OK, the client's fine.
You go to the server logs.
And it just says, it's just waiting on the database access.
Wait database results. Then you go to the database. It's like you know, it's just waiting on the database access, wait database results.
And you go to the database.
It's like, yeah, utilization is 100 percent and you have to end up doing something.
Right.
So you're getting all these logs.
A lot of it is code that you haven't written.
Right.
Because there are logs from programs that you're using and you need some way to say,
OK, can I separate the signal from the
noise right like like is there like like is this log actually interesting um and and actually a lot
of these systems like sentry and bug snag and these other systems use clustering so what they'll do is
they'll take every log line and uh they'll do what's called an embedding which we can get to later
but but they'll basically take every log line and turn it into a point in some space so imagine some
cube but it's like a hypercube it's like a you know a 200 dimensional cube or something actually
we talked about it in my mind yeah i've got the 200 dimensional cube i'm picturing it in my mind that's not like magnus carlson plays chess or
whatever you know but uh we talked to ito liberty about embeddings on on that show um and so yeah
you have this big cube and you've you've figured out a way to take these lines of log and put them
in this cube, right?
So if I have a log line that's printing every second,
that's like, you know, things are good.
You know, it's like 1901, things are good.
1902, things are good.
That's going to look the same, right?
It's always going to say things are good and then some kind of date, right?
And so since it's so similar,
and even the date, the number, the timestamp is also kind of date, right? And so since it's so similar, and even the date, the number, the timestamp is also
kind of similar, those will likely end up close together in this space, right, once you've done
this embedding. And so you can throw all these logs into that space and then do clustering.
And chances are the, you know, things are good message will get its own cluster if there's so
many of them and then you could just throw them all away all right uh you could also do things
like say okay this line of log isn't even really near any of the clusters so it must be something
pretty unique pretty special um so maybe this one i should you should send an alert or something like that. And that's called
outlier detection. And that's also, it's a really hard problem, but there's a bunch of great
libraries. There's PyOD. There's a bunch of great libraries for outlier detection.
And they're all kind of, it's all very related to clustering.
So what are some other examples of unsupervised learning?
Yeah, I think a lot of these words like unsupervised learning, reinforcement learning,
a lot of them have become kind of really nebulous, right? As all of these fields have kind of like
overflowed, right? But now there's the hot thing is sort of self-supervised
learning. And the idea with that is it's still unsupervised in the sense that you don't have a
human... Actually, we should probably talk about that. So supervised learning is typically where you have a human in the loop. So imagine if I'm playing chess and I train some model to mimic my moves.
So if I move the pawn, then I tell some model, hey, when you see this board, I want you to
move the same pawn to the same place.
And so it's supervised.
I'm a supervisor, right?
And it is just trying to mimic this, right?
Now unsupervised would be a little different.
Unsupervised would be where, for example,
you might train an autoencoder.
So you might say, here's a picture of a chess board.
I want this algorithm to embed that picture. So find a function
that takes this picture of this chessboard and creates a point for that picture somewhere in
this space. And then I want another function that takes that point and creates the picture again, right?
And so you're going from the picture to the point back to the original picture.
And so when you do this and you train this model, it ends up having to represent, you know, the essence of that picture in that point.
So, for example, let's say all of my pictures
have the same chessboard and it's on a black table, right? And it's the same camera setup.
It's like a tripod. So it's very reliable, very stationary. It's all pictures of this black table
with this chessboard on it. So it can just recover the black table without needing any extra
information right so because every single point that we draw in that cube when you go back to
draw the picture you're going to need that black table and so that's where like this really powerful
compressive ability comes in so you actually know, the points now don't need to
differentiate based on the table. They're all going to have the table in it. And so, you know,
if two points are close together, then that means that the two images they generate must be similar,
even given the similarities that there are broadly, like they must be even hyperlocal.
They must be similar.
Otherwise, those two points will get pulled apart.
And so the way the autoencoder works is, you know, you generate the chessboard.
It's not going to look exactly right.
So you have some error.
And then you say, OK, you know, this pixel is like too dark or you drew a pawn here and
you really shouldn't have it's empty.
And so you, you know, given that you know the right here and you really shouldn't have it's empty and so you you
know given that you know the right answer you just tell the model hey here's the right answer
adjust yourself and it will figure out how to use that volume that embedding volume in the best way
or in a good way to be able to generate all of those pictures not just one of them does that make sense yeah i think so
i mean that was that was pretty deep but here i guess the when we were talking about like clustering
you don't necessarily like you mentioned they're like the right answer there's no way to necessarily
feed back so you're like you might as a human tune something or do something like the number
of clusters but and there may be algorithms
you do that there's really no necessarily right answer when you're talking about it's still
unsupervised but this sort of like auto encoding you're kind of giving a problem constricting the
amount of information that can be shared between sort of like the left half and the right half and
then trying to say like you need to simplify down to a representation and then reconstitute that representation back to the original and then look and compare the two.
So you have a well-defined metric for saying, like, hey, how well did you do at your task?
And so you algorithmically are supervising it.
But as a human, you're not sort of like at each interval sort of like labeling something or giving a behavior to emulate.
Yeah, that's right. And the reason why
all of this kind of comes together is, you know, clustering, you know, imagine you're looking at a
group of people, like you're in a helicopter, and you're looking down at a stadium full of people
or something, or you're looking at a football game or something, right? There's little dots,
like they're maybe the size of ants or something running around on this football field so like when you cluster you're going to be using sort of the geometry of the
field right so if somebody is twice as far away then that really has a big impact on whether
they're going to be in that cluster or not with these other group of people right right? And so for all of clustering, you need to have a space that's
pretty uniform. So like, for example, let's say you fed a bunch of features into some clustering.
And one of your features is person's age in milliseconds. And the other feature is person's
height in meters, right? Well, like one is enormous, right? Your age in milliseconds, it's a huge number.
And so the clustering algorithm
will totally ignore the other feature
because your height in meters is, you know,
I don't know from, you know,
I guess 0.5 to three or something.
You know, it's such a small range.
Actually, I guess there's nobody nine feet tall,
but anyways, so you're hiding,
I'm trying to figure out the tallest person in the world
is what, eight foot?
Anyway, so your range is tiny.
It's like three units, right?
But your age in milliseconds is enormous.
And so the clustering algorithm will just cluster ages
until you're, you know, not pay attention to the other one.
And so, you know, if you're trying to cluster images or text or some of these things,
you quickly run into this problem where the thing isn't geometric.
And so the clustering can't really take into account different dimensions
in a way that's fair.
And so the nice thing about this auto-encoding is the way that the
loss kind of propagates backwards from the correct chessboard to that latent space to the input
chessboard, the way that those dots move and the way that the things kind of shake up ends up
creating like really nice spaces
where all the dimensions have relatively the same importance.
This is interesting.
So yeah, so you're training both halves,
but you may be taking,
I guess you were calling it like the latent space in the middle,
the encoded thing,
and using it as input to other parts of your system
or sort of like clustering in that sort of more
well-formed space so that you can say things about it even if you never end up reconstituting like
there's really no reason for you to get back to the original chessboard like you had it as input
like you could just use it you didn't really need that part but it helps you to get that middle part
that you could then use to do clustering on. Yeah, exactly. So now let's
imagine we have a bunch of people who go to your website, or we could even stick with the football
analogy. We have a bunch of football players, right? And we have a bunch of statistics about
them. And these statistics are all over the place. Some of them are important. Some of them aren't
important. The units are all different. And so if we just feed these players into some clustering algorithm, then it's going
to have a really hard time. Maybe basketball. I know more about basketball. So basketball,
people score a lot of points, but their height in feet, let's say and in decimal feet is going to be relatively small
so all my score you know 20 30 points but they're only like seven feet tall and so the so you have
like different scales there as well right um you know or assists or rebounds um you know number of
minutes played um you know and so all of these have different, you know, and even slight differences
in units can really matter. You could do some type of, let's say, contrastive learning, which is a
self-supervised approach. So you might say, here's a list of players who I felt played very similarly.
So I could come and say, okay, you know, Shaquille O'Neal and Dikembe Mutombo, they're both centers, really tall, strong people who just can, are strong enough that they can just push their way through and dunk the basketball, right?
Those people are very similar.
So I'm going to pull these two people together.
So, you know, whatever their features are, you know, they're going to, we're going to create sort of a point for these two
people based on their features. And then we're going to say these two points need to be closer
together. Then I'm going to take, you know, Shaquille O'Neal and like Anthony Hardaway.
And so Anthony Hardaway is like a three point shooter, like small person, like for basketball
standpoint, small person who goes and shoots shots from far away. So these people clearly are
far apart. You might even actually just use the positions, right? You might say, okay,
all the centers should be close together. And then take two people who are from two different
positions, they should be far apart. And so in this way, you're not, it's not supervised learning because you're not saying,
okay, you know, Shaquille should be right here or, you know, Shaquille should make like this
many points or something. You're basically saying, you know, these people should be closer together.
These pairs should be close together. These pairs should be far apart. And it's contrastive
learning. It's self-supervised if you can automate all of that
without a human in the loop. Basketball is a weird example because at some point a human did decide
you should be a center, right? So maybe not the best example, but you could even imagine like
doing contrastive learning on images. So you could say, here's a bunch of images that are on the same
website. And because just by virtue of them being on the same website, they should be pulled together.
And then here's two images from two different websites.
They should be pushed apart.
And so if you do this and you have a low learning rate, because that's going to be a weak signal,
right?
But if you do this and you have a ton of images and you've scraped a lot of the internet,
you'll end up with an embedding that's
that's really powerful and so contrastive learning and auto encoding where you feed in the same thing
that you're trying to predict are two ways of generating like really nice spaces that then you
can do clustering and other things with this is is awesome. So we were sort of giving examples
and sort of saying the algorithmic approach,
I guess, to doing this.
What are some applications of what people do with the...
I mean, we talked about outlier detection for logs.
I think that was a good one.
We were talking about classifying things.
What are some other examples of applications of this process?
Yeah, I mean, you know, all of the language processing is now pretty much done in this way.
So, for example, you know, it used to be that if you wanted to train a model, let's say, to translate French to English, you know, you would have to pay people to,
you know, manually translate tons and tons of things, like literally millions of sentences.
And then you would train, you know, your model on these sentences. And you would have some
translation that just goes from French to English, right? It's extremely expensive. Right. So now what they do is, you know, they will do what's called a word embedding.
So basically, there's a whole bunch of different ways to do this.
So one way would be a self-supervised approach where you say, given all the words up to this word.
So, you know, like what was it called? Like the brown. What is that one that's like you see all the words up to this word so um you know like what was it called like the brown
what is that one that's like you see all the time brown fox jumps over the lazy dog yeah that's it
okay so you say like i didn't know this that the reason that's a sentence is because for handwriting
is it uses it's like the shortest sentence that uses all the letters of the alphabet or a very
short sentence which uses all the letters of the alphabet what a very short sentence, which uses all the letters of the alphabet.
So it was a penmanship test.
I never knew that.
I learned that like a week ago.
What?
Shut the front door.
Wait a minute.
And then I was like sitting there counting them all.
Yeah.
Nope.
Nope.
Yep. They're all there.
Wow.
Oh my gosh.
My mind is totally blown.
I feel like that.
Have you seen that video where the guy pretends to do a magic trick and he
takes the straw
and he has his friend like put the straw behind the other guy's back and it blows his mind.
Anyways.
Yeah.
That's me right now.
So, wow, that's freaking awesome.
Okay.
So, so is what the quick brown fox or no, the quick, I thought the dog was brown.
The brown fox jumps over the lazy dog.
I thought the dog was brown.
Anyways.
So let's say the quick brown fox. Any word probably works.
So, you know, this algorithm will learn, you know, like you give it the quick brown and then it has to produce fox, right?
Now, if you look at that in isolation, that's like almost impossible.
But you give it a ton of these sentences.
And, you know, in every sentence, you say, okay, here's the first word,
predict the second one. Okay. Here's the first two words, predict the third one, predict the fourth
one. Right. And you give it, you know, all of Wikipedia or something. Right. And so, you know,
yeah, I mean, for some of these subjects and objects, it's going to be really difficult,
like the quick brown, it could be anything, but you're going to also see a lot of correlation so you'll notice like whenever you see
of like maybe you see the afterwards very common and so you'll actually learn a lot of structure
from doing that from what they call a forward model right and the awesome thing is it's
effectively free this is another thing because because of Moore's law and because
computers have become so cheap and so efficient, you know, it's really the people time that's the
killer for a lot of these things. Like if you can eliminate the time that a human has to do something,
you're in really good shape. And so, you know, with these forward models, you just download
Wikipedia. I mean, you could do this on your laptop right now and predict the next word.
You don't have to pay any humans to rate any sentences or anything.
So now you have this embedding, which says, given a part of a sentence, I have some point
in some space based on what word's coming next. And so, you know, sentence fragments
where the next word is going to be fox will all kind of be close together, right? It turns out now
if you do that same translation problem, but you work with that embedded space instead of with
whole sentences, you need, and it's been a while since I saw this, but I think it's like one one-thousandth of the data or something like that. I mean, it's extraordinary. I mean,
the difference is unbelievable. And so, you know, what used to take millions and millions of
sentences, you know, now after like 10,000 sentences, you're done. And there's even models
now where they've embedded, they've actually done this jointly with different languages. And so they embed like literally every language into the same space. And then all they have to do is train the second half of it, which does the translation part. Yeah, so all of natural language processing completely redone with self-supervised learning.
Like it's massively changed that field.
And I think even with image processing, you're starting to see a lot of interesting things.
The image equivalent of this is where you basically cut a piece of an image out and you say, reconstruct that missing piece of the image.
It's like, remember the magic eraser that adobe photoshop thing that was like really popular like 10 years ago content aware yeah yeah
yeah so you could and so uh you could actually yeah you could erase a person and it'll fill the
behind them right so imagine you know you you cut the person out, but your intent in this case isn't to literally
cut them out of the picture. It's to see how good your reconstruction can be. And you immediately
know what the algorithm did right and didn't do right. So in this case, you actually wanted to
generate. So you wouldn't do something really difficult like cut out an entire person. You'd
randomly cut out squares. And some of the time it would be impossible because you cut out an entire person you would you'd randomly cut out squares
and some of the time it would be impossible because you cut out a whole car or something
but most of the time you'll cut off parts of things and you'll be able to like if you cut
out one person's eye you just copy their other eye or whatever yeah or something yeah exactly
and so same kind of thing so you you have this model that reconstructs things by putting them into this big described. And then they trained another model on images.
And then they created another thing which said,
I have captions for images.
So I have like a picture of the quick brown fox jumping over the lazy dog.
And then I have that caption.
Those two points should be close together.
And then I'm going to take captions that don't
belong with their picture, mismatched caption picture pairs, and those points should be far
apart. And I'm going to take those input embeddings and now train, you know, another what's called a
joint embedding that tries to unify or push apart those pairs. And that's how DALI works. So then,
you know, when you go to OpenAI and you say, you know, astronaut eating ice cream on the moon,
it's taking those three models, the language model, I guess it doesn't need the image model
anymore, but it's taking the language model and, oh no, it does. It needs the language model,
the image model model and this joint
model it's using all three of them to generate that picture of that astronaut on the moon
uh and the question is is the artemis capsule floating around the the moon in the background
to tie this we should just type artemis into dolly and see if it's on the moon or not
uh i think like but they have exclusions for like a
lot of proper names and nouns and stuff so i don't know i don't know how that works ah really yeah i
uh i used it a little bit i found it to be you know really captivating there's something
powerful about that um i've always been a really big fan of dolly but uh other than this visual interactive
fiction i haven't found a practical use for it well it feels good to do a do a duo episode i
know it's been a while but uh going through the uh the habit of uh the first first uh
few well many many episodes of you and i doing this together. It feels good to do it again,
do our tools of the show, book of the show, news, and then this discussion about machine
learning was a really good time. Yeah, definitely. I think these are all very
accessible, approachable things. You can use SageMaker or other tools. You can train
on all of Wikipedia without having to download it to your desktop if you don't want to.
I think I saw, you know, training that model I just talked about run you like, like 30 bucks or something, which, you know, is the price of like going to the movies. So it's not it's not
nothing. But it's also like, pretty amazing that for 30 bucks, you could train a model on,
you know, the entire Wikipedia corpus,
and it'll come out correct and everything. So, you know, it's a lot of fun, amazing times we're
living in. And I guess as a like final thing, anything you build, you're going to need to
collect some type of metric to understand the people who are using your product and and so this is a really
good area for folks to brush up on all right and with that note um yeah it's really awesome doing
a dual episode um looking forward to seeing this one come out looking forward to seeing if we're
right or not about about our prediction and really looking forward to your emails we've been getting
a ton of really great emails.
So appreciate everybody out there.
We do read them, even if Patrick has them marked as unread.
He has looked at the subject.
I actually, I think both of us literally read every email that we get on Programming Throwdown.
So we really appreciate your support and supporting us on Patreon and Audible.
So thanks so much.
Definitely subscribe if you're not subscribed to the show using whatever podcast catcher.
We should be on all of them at this point.
If we're not, let us know.
And we will catch you all in two weeks. music by eric barn dollar programming throwdown is distributed under a creative commons
attribution share alike 2.0 license you're free to share, copy, distribute, transmit the work, to remix, adapt the work,
but you must provide an attribution to Patrick and I and share alike in kind.