Programming Throwdown - 145: Unsupervised Machine Learning

Starting point is 00:00:00 Welcome to another, as we've termed it, duo episode. That's just Jason and myself. No special guests this time. It's because we have a lot to say. We have it all pent up. So expect a high energy episode today. That's right. We'll start off by talking about email.

Starting point is 00:00:37 What's more high energy? Spoiler alert. I do tell this to people in my meetings. I'm like, if you're in an afternoon meeting, I'm typically more ramped up and like you know kind of you think you'd be tired by the end of the day but i get i get kind of like riled up throughout the day yeah i wonder if maybe we're more common because i feel like isn't that more natural you wake up your brain still hasn't really like kicked in yet

Starting point is 00:00:59 so maybe i don't know and then i just crashed at the end of the day but that's another another problem for another time. So as Jason foreshadowed, we're going to talk a bit about email. So this is actually a topic Jason and I were talking about. So we decided to talk about it on air, which is if you are not in a large organization,

Starting point is 00:01:18 maybe this is a foreign thing to you. But most folks, I would say, I've never looked at it actually. But I would assume a lot of people work in big companies. And in big companies, you get a lot of email, a lot of email, not junk mail, not spam, although I guess you could. But at my company, they're pretty good about filtering that out. But just like emails from random teams, random automated announcements, automated meeting notices, everything just gets pushed to email. And it's important to stay organized using either rules in your client or on the server, or just

Starting point is 00:01:54 making sure you put stuff into various folders or flagging stuff. I think everyone has a different scheme. I got criticized actually by my kids because I have like over a thousand unread messages. Oh, see, I can't handle that. Oh, no. Oh, you're going to criticize me too. But I have a method. Like it is, I'm going to say like methodical. Like I do have a method to my madness. Like it is organized for me. I can't not treat incoming email. Actually, wait, wait, hang on. So wait, do you have a thousand unread because you, like you might somehow know the last

Starting point is 00:02:26 one you looked at, you know, the subject line. So you actually read them all. They're just all like handled. Okay, okay. Okay. So it's not that they're not looked at. It's just like sometimes, like you said, I kind of moved up, I'll move the meter for is this on me, this is bad habit, I would not recommend it. Like you said, I'll move the high watermark forward as like what I've kind of read through by looking at summaries. And most of me don't need to do anything, but I don't necessarily take the extra few milliseconds,

Starting point is 00:02:54 seconds, whatever, to kind of put those in an appropriate folder or get them out of my inbox. So then every so often I've just piled through and via searching, you know, call all the junk that needs to get marked as SRED. So it's just sort of like another folder for me. Like there's the unread and read folder and they sit in the same logical folder in my inbox. I would not recommend this approach. You should definitely be more organized than me. So I'll tell you what I do for unread is I have the, well, at least on desktop, I have like the double paned Gmail where like you

Starting point is 00:03:26 still see the list of emails while you can see, you know, an email you currently have clicked on. And then I have it set up to where whenever I like hit the up arrow, it marks it as red, you know, like as soon as it gets to an email marks as red. So I could just go up, up, up, up, up, up on the up arrow. And, you know, and I'll end up with everything red at the end. But I won't actually look at ones that are like marketing and stuff like that. I do. I am pretty good about taking the time to unsubscribe to as many things as possible.

Starting point is 00:04:00 Does that work like in general? I guess at corporate mail, that should work pretty well, right? Like unsubscribing from something should make it stop emailing you. time is just like a one click on subscribe. But I think the most important thing you can do is to have folders. Folders are a total lifesaver. I'm not too sure what's the difference between the server side rules and the client side rules. That's one thing I wasn't ever totally sure about. So I think in some cases, so like you mentioned Gmail, I think it's a bit different because the most common ways of accessing Gmail are either web apps or progressive web apps, like they're on your phone, but they're still a web app. And so they're like the state of the kind we said, but if you're using like,

Starting point is 00:04:53 you know what, I guess that's pop or IMAP or whatever, you could have multiple different clients. And so if the clients are filtering, they're kind of notifying the server and moving and you could have two clients with two different sets of rules so if you access your emails from two different places and the clients weren't synced you could end up with sort of like not good so i think in gmail as an example when you write a rule it's just always a server side right like it always in the background when the mail comes in not when you fetch the mail, the server side stuff gets executed. Oh, that makes sense. So like you have a, you have a client side rule, but your desktops like not on. And so now like none of those rules are taking effect. So you're looking at email on your phone and it's unfiltered. Yep. Something like that. Yeah. I think most places have moved

Starting point is 00:05:38 to server side. Although I will say, I feel we at my company don't use Gmail. And in fact, I actually don't like gmail i preferred they killed it this is a rant another time inbox i i jive with the inbox way of doing it i was very organized when inbox was a thing it was nice they killed it and now it's gmail and now i'm unorganized again so i blame i blame bad tools and i know there are other ones out there people will send us links i know there are lots of other people and folks doing things that are clever. I just, yeah, I don't know. You know, I don't know if at some point I stopped getting ads.

Starting point is 00:06:12 There was a period of time where I was getting, you know, Gmail ads. So like you go to look at your email and then the first email would actually be fake. It wouldn't be a real email. That was super, super annoying. They got rid of that maybe about a year ago or so. And so that made Gmail kind of like, there's not that many downsides to Gmail. I did like Inbox. What was so magical about Inbox? I also remember loving Inbox, being really sad sad when they shut down. But now it's been so long, I don't remember what was good about it. They've morphed Gmail over time to copy some of

Starting point is 00:06:50 the UI things. But it was just like the way of handling the UI and like what was shown in the ordering of like how you went through messages and stuff, which is different. I think the complexity on their side is around like needing to support both the old way and the new way. Well, one of the things I think they added was the swiping, right? So like an inbox, and then later in Gmail, you could swipe right to archive, swipe left to delete, or I don't know, maybe I have it backwards. But basically, you swipe and it goes away. You just know, it's just one of those things you could just do. All right.

Starting point is 00:07:29 Well, we get on to our news of this show yeah my news of the show is simplifying lines with the douglas puker did i get that right puker puker algorithm anyways i know what this is but i don't know how it's said i have no idea how to pronounce that person's name i'm sure i butchered uh that person's last name anyways this is, but I don't know how it's said. I have no idea how to pronounce that person's name. I'm sure I butchered that person's last name. Anyways, this is really cool. And I found a particularly cool article where they kind of walk you graphically through how to do this. But I was always really interested in the game Worms. Do you ever play this game? Yes.

Starting point is 00:08:00 Yeah. So Worms was this, you know, pixel based physics thing where it, you know, there's also Scorched Earth that was kind of similar, but much more simplistic. Yeah, and Scorched Earth, like, things couldn't float, like the ground always settled. And so it was somewhat of an easier problem. But, you know, in Worms, there would be these floating islands and you could walk on them and you could bounce grenades off of them and stuff but anytime they there's any kind of explosion or anything you know it would chip away at these islands and it would start to take away some of the pixels and so i was always fascinated with this like like how do you bounce a grenade off of this pixel like you know this this 2d array of uh this binary array of pixels and so so that kind of led me to this algorithm,

Starting point is 00:08:48 which is really cool. So basically, you could imagine kind of tracing the outline of one of these islands. And then just, you know, anytime there's an explosion or something, you know, retracing that part of it. and you know maybe an explosion caused an island to split into multiple islands and that's fine you could detect that but then you have this problem of okay now i've traced this island and i have this like you know the outline right the shell of this island how do i like put that into a physics engine right and and you know even like if you were to put every point in a physics engine, the angles would be all really jagged and stuff, right? And so looking at some game development

Starting point is 00:09:31 sites kind of took me to this, which basically this algorithm is this really fast way of saying, okay, I have this, it could be a line, it could be a polygon, it could really be anything, but I have this, you know, line that's made up of a ton of points. And I want to keep sort of the structure, you know, the essence of that line, but get rid of, you know, a ton of the points. So it's a simple, unsupervised way to do that, which I thought was really cool. And the gist of it is you essentially like go from the start to the finish. I guess this wouldn't work if it's a loop. So there has to be other ways to do that. Oh, I think it actually it still works. We have to adapt a

Starting point is 00:10:17 little bit. But you go from the start to the finish and you you kind of draw a fat line, you know, like imagine like a rectangle that's oriented, right? Like a fat line from the start to the finish. And then you see like, okay, are all the points just in that line, in that rectangle, in that zone? If they are, you could just delete all of them. And now you have just a start point, a finish point, and a straight line. But if there are some points that fall outside of that zone, then that means, you know, if you were to delete that point, you'd kind of be changing like the

Starting point is 00:10:50 essence of this. So maybe it kind of like is more of like a parabola or something, right? So if you find a point that's outside of the zone, then you kind of do this subdivide type thing, or it's like a divide and conquer approach where you say, okay, let's go to that point and create a rectangle from the start to that point. And then a second rectangle from that point to the end. And now with these two rectangles, did I cover everything? And so you kind of keep going in this way. And yeah, it also looks like really mesmerizing. So it's a, if you ever wanted to make your own worms game this is how you would do it yeah i think there's a a few ways of doing this like line simplification this

Starting point is 00:11:30 one uses like you're kind of saying this fat line as like lateral stuff i think there's some that use uh sort of circular distance as well depending on what you want to do but yeah you're trying to like preserve the shape but also simplify the number of points yeah exactly cool my news article is how to pick a starter project now ironically or i guess this takes the stance initially of telling you how to pick a starter project for someone you want to get rid of um and by illustration of all the things that you do to get rid of someone, highlighting how to not pick a starter project. And it's kind of a funny thing, but I guess nothing in here is sort of earth shattering.

Starting point is 00:12:18 But reminder that starter projects really are important for onboarding new folks, like those first things that you have them do and making sure that it works well. It also gets in a bit to choosing mentors and that kind of kind of stuff. And I just wanted to highlight because I think this is a really poorly thought through thing, like how to onboard new people to teams. And some companies will have like a company wide onboarding process. But I think like still the importance of within a team, making sure new folks get up to speed and how that's like a joint thing across the manager, but also the members on the team.

Starting point is 00:12:54 And so here it's highlighting things like pick a gigantic project that has vague requirements and goes cross team and cross discipline and is tracked at a very high frequency by upper management. These are all ways to inspire someone to quit on the spot and to not get brought up to speed. I think like oftentimes, even under poor circumstances, people do integrate into the team eventually. But I think like often those formative first few months are critical to kind of having a really positive experience and making sure that the team knows that they should be

Starting point is 00:13:32 taking time away from their other tasks to onboard a new person. And I think like, whenever you change jobs, it's always difficult in the beginning. And I can look at these comments and think back to some of the times I've switched teams or changed companies and had starter projects. It's more or less sort of, I guess, we're intending to push me away. And maybe I was too dumb to realize that I should have changed jobs. No, just kidding. Your starter project is to delete all the code that your boss wrote. I did have a starter project that was, you're new to the team. The team's been divisive about coding style. So why don't you come in and start enforcing coding guidelines?

Starting point is 00:14:12 Oh, no. You've got to be kidding. Is that real? Nope. That's absolutely true. Anyways, it is important. And I think as well, if you say, well, well okay i'm not a manager or whatever i'm not assigning starter projects i i think it's important to to know that this is a thing and i think that people don't always do well and so if you're new to a team and you're not getting good work like be communicative to your boss that like you know or your co-workers like hey this seems like kind of vague like i'm happy to work on it but is there other stuff that i could you know, or your coworkers like, Hey, this seems like kind of vague, like, I'm happy to work on it. But is there other stuff that I could, you know, be doing as well, like be, be proactive,

Starting point is 00:14:50 and being an advocate for yourself and trying to, you know, assuming that they're not doing it on purpose, like trying to get work that you think is cleaner. I would say like, one of the more common unintentional mistakes is giving someone something you didn't realize that was a lot harder than it actually seemed. And so you give someone a project, you think it's really easy, and then they really, really struggle. This has happened to me before. And then it turns out, once you sort of start reviewing what you're doing, people are like, oh, I did not realize this was that complicated. I just thought you were kind of slow. And it's like, well, people really should be assigning work that almost they would just sit down and do in an

Starting point is 00:15:24 afternoon. Like it really should be too easy, better for them to finish too fast than to get bogged down. Yeah. The other mistake I see a lot is, is where, um, if there's a junior engineer, um, you know, someone will, um, you know, a manager might say, oh, you know, I really wish I had time to, you know, work on project X, which is some new project that the team, no one on the team has time to do Project X. But if we did it, it's like this big win. And then somebody joins the company. They say, oh, this person has 100% free time. They just joined.

Starting point is 00:16:02 We're going to put them on Project X. And it ends up being this really risky thing. And for your perspective, you feel like as a manager, you might feel like, well, I'm giving them opportunity. Yeah, this person has a chance to hit a home run. And there's no downside because they just joined and they can pivot and they're not going to get evaluated right away or anything. This is actually a terrible idea, right? So the reality is people are under kind of the most pressure to perform in the beginning. And so this advice here in the article about put somebody on the main bread and butter

Starting point is 00:16:43 project. I see people mess that up so many times. But yeah, as Patrick said, you want, especially for junior folks, but even for senior folks, you want them to be exposed to, you know, the core essence of the company, even if your long-term plan is for them to try and invent something new. Cool. So my article is tic-tac-toe in a single call to printf. These code golf things, I love these things. I've never tried it. I don't think I would enjoy actually building this. It's more of like an art installation than a coding exercise, but it is so freaking cool. So as you might imagine, there's a ton of pound defines to make this a reality.

Starting point is 00:17:26 There's a certain way in printf you can actually capture input. I did not know that. I didn't know it either. Yeah, it's, I'm trying to find it here now. I mean, there's a whole document here on how they actually did it. Anyways, there's a way you can capture input with printf

Starting point is 00:17:42 and they're hijacking that. And so basically this prints a tic-tac-toe board, lets you type in input and play games against yourself. The entire thing is done in one printf call. There is a while loop, while printf. I guess that's because every time you enter a key or something, right? But really freaking cool. And they actually took the time to make it like actually visually artistic so um the entire program is spaced in a way where there's a

Starting point is 00:18:14 like ascii art percent n um in the program so it's uh um oh they're actually actually i take it back there is a scan f yeah i see that, I take it back. There is a scanf. Yeah, I see that. In the argument to printf, there's a scanf. Oh, okay. So actually. No, but that's still crazy. I didn't know you could, it does not seem like a good idea. Oh, the scanf is inside the printf.

Starting point is 00:18:36 Yes. Oh my gosh, my mind was just blown. Oh, that's not that bad. It's just in the argument string. so like comma scanf open print so it's just a function call that's taking place and then the result going in as an argument okay okay okay this is not as like devious as cynical like as yeah okay this is so crazy but i got it now okay okay got it got yeah i got it too yeah so okay so it's a call to printf but that printf has as an argument i call it to scanf. So there's two functions, but the whole thing is freaking cool. I just I love seeing stuff like this. So definitely check it out. It will give you a chuckle and you can read through how it's implemented, which is very clever. It will give you a chuckle if you're either very against C or very deep in admiration

Starting point is 00:19:26 for your C programming language. I think that those two folks may like everyone else. Jason's description may be sufficient for you to realize you could save yourself the click. Well, you could copy paste this into your terminal and GCC it

Starting point is 00:19:43 and then play tic-tac-toe. It'd be a fun exercise. They even have instructions on how to do that if you're new to the program. If there were only a web browser where you could just type a website and play web games that don't involve incredibly obfuscated C code. Okay,

Starting point is 00:19:59 this is pretty cool. I would say the name almost is also a good competitor for obfuscated. They did a good job naming it to seem to seem really sinister it also claims printf is turing complete which doesn't surprise me but sounds horrible yeah oh my gosh that's wild uh that's probably like a complexity measure of your language which is like how many features of your language are themselves turing complete? Oh, interesting. Yeah. I wonder, you know, I wonder what, so there's been historically been languages where I've had a hard time working with them. And I feel like that's pretty common. So one of them is Scala. Another one is Haskell.

Starting point is 00:20:40 Those are two languages where I don't have strong opinions i mean i actually enjoyed scala but uh but i found that the programs that you know when you started to scale up the team size scala and haskell both became really difficult for me at least to understand um and so yeah i wonder if there's somehow a connection like i wonder if you could somehow take all this anecdotal evidence and and regress it to say okay you know yeah somehow some complexity metric of the language causes it to balloon in this way interesting so why is python so bad as a programming language then what is what is the uh zeitgeist on python oh Oh, no, no, no, no. We're going to get demoted.

Starting point is 00:21:26 We've got to keep making progress. Okay. I don't know. I don't know. We also have to throw JavaScript in here, too. Oh, my gosh. Okay. Mine is completely unrelated.

Starting point is 00:21:37 It's not an obfuscated C code, although it probably does have C code on it. And that is the Artemis 1 project. By the time you listen to this, hopefully it's successfully launched the time of recording, they've sort of attempted the first launch, which got scrubbed. But I want to say this had been a big deal for a long time for the United States government trying to send a rocket effectively back to the moon, this first one unmanned, eventually, you know, a manned moon base. Contrary to the last time that the United States and this one's actually being done much more internationally. Some of the sections of

Starting point is 00:22:15 the rocket are actually done by the European Space Agency in collaboration with NASA. So I feel like this is a bit different in that way, leaving all the politics aside of the last one. But I want to say, I feel like this came a bit out of the blue. Like, it is a big deal, I think. And I think that it had been going on for so long, so much of a run, people kind of forgot about it. And then it was sort of like, oh, it's on the launchpad getting ready to go. And so that's pretty exciting. So yeah, I mean, again, politics aside, but I thought that back in like 2010 or something, they totally defunded NASA. So I was like really surprised to see this. So I really don't don't know how like what really transpired there. Yeah, I don't know about defunded NASA. But there's been a history of telling various government agencies in the United States like what to do, but then not giving the funding. but there's been a history of telling various government agencies in the united states like what to do but then not giving the funding so it's like a two-step process first you give them like a new mandate and then you're supposed to like go in in the budget sort of

Starting point is 00:23:13 like budget for that new mandate and so there can be like hey you need to go do x and then we go to make the budget we don't give you money to do x so are we really telling you to do x or not it's a bit ambiguous. And so I think that's true of NASA, but true of other government agencies, but I'm not sure on the particulars. But no, it hadn't been defunded. It had been sort of, in some ways, people will say uncancellable because it was involving so many states, so many companies, so much international cooperation. There's just all this stuff in it it just sometimes you know estimation for timelines we're not that software engineer is the only one who get them wrong um but other companies get it wrong too but no i i mean leaving that aside i think it's exciting like people quibble over the expense of it but i think that giving that that sort of optimism and

Starting point is 00:24:00 hope of going to space and putting people onto the moon and the immense amount of technological research that falls out as part of this i i think it's an exciting thing i won't justify that it's worth the cost i don't know i kind of it's hard to say yeah it's exciting and like for you know having myself children who are elementary age you know their teacher turned it on they get excited about things about science. I think like from that stuff, it's really hard to measure just how big of an impact these kinds of things have. Yeah, I was talking to somebody about something related to this. look at like the sistine chapel and and like you know the pyramids like how you know we did these things that were kind of like really powerful and then also like like wholly unimportant at the same

Starting point is 00:24:53 time like like uh like the pyramids so the pyramids are extremely cool as a tourist attraction and and and i'm sure giza like generates a lot of tourist revenue from it. But like at the time, like it was just maybe like a thing to do or is like a religious endeavor. I don't actually know the history of the pyramids, but I feel like the humanity is full of these sort of endeavors where it's like not totally clear what to do, but you do feel like this is sort of like a milestone in humanity, whether it like turns it to something economical or not, right? It's always hard, right? I think these things are complicated, but from a celebration of what humanity can do for along this like technology, you know, pushing the envelope, doing these new things, I think it's incredible. Also, there's been a theme of various space and rocket related things throughout the history of the podcast but i mean i think i i said it i think even for our predictions for this year which probably end up needing to to slide out or whatever but the just the amount of new rocket

Starting point is 00:26:01 hardware coming online and people doing things we haven't had as many of these duo episodes, but there's a number of other sort of interesting rockets coming to fruition. It's really cool that soon our ability to access orbit and access things like the moon are going to be just so different than they were before. And I'm really excited to see what we're not anticipating about such a transformation. What do you think about the space slingshot? Is it spin launch? I think that's the name, right?

Starting point is 00:26:31 Oh, yeah. I don't know. You're talking about the one that spins in the centrifuge that's in a vacuum and then launches. I mean, I think it's one of those people have done stuff similar before in research and giant, you know, artillery know artillery guns basically that would shoot rockets out the the rocket equation which is oh this is not my area of expertise but basically the fact that you need a huge amount of rocket fuel to lift your rocket but you add rocket fuel

Starting point is 00:26:57 and rocket fuel is heavy so therefore you need more rocket fuel to lift the rocket fuel to lift your rocket and you end up with this sort of like cascade of stuff and the earth is in a pretty heavy gravity well right like it's it's the atmosphere is dense the earth is massive so it's pretty hard to get to orbit and so if you can get uh just a you know little percentage of earth you know is diameter away or earth's radius up and sort of lessen the effects of gravity lessen the effects of gravity, lessen the effects of the atmosphere and just sort of get through those really quick by energy held in your launch device, not in your rocket, you simplify a ton of things. Of course, doing that has its own caveats with acceleration and stuff. But I actually,

Starting point is 00:27:39 spin launch seems to know what they're doing. That is not necessarily always enough, but they built something at a scale people didn't really think was possible. I feel there's a lot of launch seems to know what they're doing that is not necessarily always enough but they built a something at a scale people didn't really think was possible i feel there's a lot of armchair quarterbacking so i'm hopeful that it'll work because the idea is you have this giant circle that stands up on its side it gets gets almost all the way to a vacuum and just very very little air left in it and then they rotate this huge arm in the middle with a rocket on one end, basically. And then at the right time, they unleash the rocket and it's, you know, shoots out like a sling, you know, shoots up into the atmosphere and it gets

Starting point is 00:28:16 through all that hard, dense part of the air, gets to a pretty high altitude and then lights its rockets. And so it can be a much cheaper, smaller, more compact, and you can launch a lot of times because you just load a new one up and you do the same thing again. And so, I mean, if it works for small satellites, it would be a huge unlock. Yeah. I mean, it looks really freaking cool. I mean, Patrick did an amazing job describing it, but you have to watch the video it's just really really cool to to watch fingers crossed i hope it works and also good luck artemis and if you already know what happens in the future well yeah wait actually so we should predict do we predict artemis will launch in the next month okay well we're recording at the very very end of

Starting point is 00:29:03 august you're saying by the end of september yeah so so i think this show will go out in october but anyways so let's say september will this uh will the will the rocket launch in september well the rocket will have had launched in september um yeah that's right nice job i'm gonna say yes just because i want it to be true all right yeah i uh i feel like it will too i mean they're so close right i mean i think it was just some kind of uh liquid leak or something i mean they could easily fix that flex seal just slap it on there and it's good to go now you know why jason and i are not rocket scientists all right yeah exactly oh my gosh

Starting point is 00:29:45 the basic flex i bought flex seal tape the other day it actually is pretty awesome i used it to fix a hole in our our pool like you know the pool has uh your pool has that thing which like goes on the bottom of the pool um and like cleans the bottom you know okay yeah like a vacuum call it yeah basically a pool vacuum type well pool vacuum means something else that's like cleans the bottom you know okay yeah like a vacuum call it yeah basically a pool vacuum type well pool vacuum means something else that's like a thing that you manually do anyways so i use flex seal to uh fix a hole in that it actually worked pretty good i was impressed i think there was this uh there's this one i found online that i i also end up using for some other project that actually it gets hot when you stretch it and it like that the tape like

Starting point is 00:30:27 chemically bonds to itself or whatever it's like wild i mean they have amazing crazy tapes now on amazon we can talk about the tape that's like electrically conductive sort of through the thin part but not across the tape so along the long part it's not conductive but like so you can if you you can pass through the tape but not along the tape electricity whoa that's wild okay anyways all right oh actually one more shameless plug so uh there's a gentleman that i know who started a company called bit rip and bit rip is uh basically i mean i'm probably gonna totally butcher this but imagine just like a roll of tape with qr codes and so the idea is if you're out in the field like you're an electrician you know like a you know a public utilities worker

Starting point is 00:31:17 or something you could just like slap this on anything and then you scan the qr code uh and put some data into some app and then someone behind you can like scan the same tape like a year later i see this so it's like every rip is like a unique code is it so is it like a non-repeating pattern like uh what do they call that oh man you know i don't know the details i always thought it was just like tape with qr codes on it but i haven't actually seen the product. Oh, okay. Let me see.

Starting point is 00:31:47 BitRip. Yeah, it looks a little bit like QR codes, but they're not exactly QR codes. Oh, yeah, you're right. It's almost like a barcode or something, but somehow there's a unique fingerprint. Now I'm curious with it. You brought this up, man.

Starting point is 00:32:00 Now we got to know what the secret is. We need to get the BitRip guy on the show 600 gps tracking tags embed photos documents audio oh yeah it's definitely not a qr code it's it's some kind of like a penrose tiling if i just had to take a naive guess i think it's a penrose tiling which is like a non-repeating infinite series i think you're right yep yep but okay oh we probably messed up his pattern we probably should be here like reverse engineering tape on yeah so so how many of you think that in the month of september we'll get sued by the bit rip guy all right it's time for book of the show my book of the show is the meditations by Marcus Aurelius. So I remember first hearing about stoicism a long, long time ago and thinking to myself, oh, that's pretty much me.

Starting point is 00:32:55 I live like a really simple, relatively simple life, always trying to simplify things. And it kind of really resonated with me. And so years and years later, I decided to read this book. Marcus Aurelius is, I wouldn't say he's the founder of Stoicism, but he's the person who really kind of popularized it. And I guess, you know, it was a bit of a, what's the word? Like, I think I'd overhyped it too much in my mind. You know, it's kind of like, actually, Patrick, you brought this up before the the show how it's like another one of these books is the art of war by sun tzu where like everyone talks about these books all the time and you think oh this is like

Starting point is 00:33:33 gonna be something that's gonna totally blow my mind and what i actually found was that so many people have already talked about this book that like i already kind of knew uh what was gonna happen and so it really it's almost like kind of knew what was going to happen. And so it really, it's almost like kind of watching the movie of Jurassic Park after you've read the book, or maybe Harry Potter or whatever, any of these. So it's kind of like, you know, it was a little bit underwhelming because I'd already kind of known the material. But I felt like it was still a good book about halfway through it. If you don't know what stoicism is, or if it sounds interesting,

Starting point is 00:34:05 if you're interested in how to lead a simple life, like what that means for like metaphysically and everything, check it out. It's a good book. It's also, you know, obviously really dated. I think Marcus Aurelius was what, like a Greek emperor, I think, or Roman emperor. I think Roman emperor. Yeah, Caesar, right. A Roman emperor. Roman emperor. There you go. So, yeah. So, I think it's going to be a very dated book, but it's a little difficult to read. But I think it's nice. And it could be sort of something that you do while you're in the car and just kind of have it in the background. And there's probably a lot more contemporary books on St stoicism and the other kind of philosophies I highly recommend.

Starting point is 00:34:47 I think understanding some of those philosophies and approaches, even if you're not saying, hey, I'm actively seeking to model my life or to do this as like a key tenant, I think still are useful for helping see how other people think to just like have new ideas and to kind of like question if there's some nugget of value there for some part of your life rather than necessarily reading a self-help book where you're saying, I'm going to adopt this as like everything about who I am and make it my core identity. I feel sometimes there's, I don't want to say like a pressure, but in my head, at least like a thought, like, well, if I read this, I'm kind of wanting to adopt it as like, and I don't think that's accurate. I think like Jason's pointing out, you can read it and still learn a lot from it.

Starting point is 00:35:30 Yeah, I think, you know, I was really fascinated when, you know, we were working on YouTube. I was really fascinated at like, what made things go viral? You know, like what actually makes things go viral and one thing that uh i learned in that process is that a lot of things that you think are organic are actually not organic right so there's there's actually a lot of uh viral videos where you look at that and say oh man you know perfect timing but it's actually highly, highly scripted. So it's very hard to tell what's real, what's fake. But either way, I think that what I noticed was a lot of viral content on really anything, any type of media, it sort of taps into this kind of like latent, I don't know how you describe it, just like latent, like shared common sort of culture, like the sort of like hive mind of humanity or maybe hive mind of, if you want to be more local, like hive mind to your country or your region or whatever.

Starting point is 00:36:34 It's like it doesn't directly say like, hey, you know, we're going to talk about stoicism today. But it just like it sort of taps into a lot of that sort of latent energy like harry potter is the example that keeps coming up in my mind how if you actually look almost all of those stories like like the at one point they fight a snake and that snake like harry potter fighting the snake is like you can you can see the parallel to this other like ancient story that that people read for like hundreds and hundreds of years and so it's like you can you can see the parallel to this other like ancient story that that people read for like hundreds and hundreds of years and so it's like you have this evolutionary like footprint you know and really popular content sort of like taps into like piggybacks on that on that footprint

Starting point is 00:37:18 um and so yeah reading like a lot of these canonical books will give you kind of like an understanding into into that which would help you if you ever wanted to write uh you know produce some content uh like like write a book or or or anything like that that that uh wants to tap into that that same energy well as tradition holds jason gave us a very highbrow uh very thoughtful book. This may be a bit late, but we've not done as many of these recently. So it was all the fad during sort of like the COVID at home stuff to kind of bake sourdough. I see that. I see that meme. I remember that. Mine is a book about bread baking, Flour, Water, Salt, Yeast by a gentleman named Ken Forkish. And there are a lot of books about bread making and artisan bread. And I think this one, for whatever reason, just reading it in the stories,

Starting point is 00:38:12 it really kind of like made me excited to try the recipes, to do them, to take the approach. I just thought it was a very thoughtful way of thinking about bread baking. That's a tough one. There we go i can uh and um you know just using simple ingredients i guess they call it sort of like a lean bread which is there's no there's no fat in it right just flour water salt and yeast um they of course have is that literal like do they literally put fat in bread yeah so an enriched bread so like if you i mean like a challah bread or uh you know something like an egg roll right or like a yeast you put egg or butter or oil like a lot of pizza doughs

Starting point is 00:38:52 have oil in them wow okay that makes sense yeah pizza dough yeah and so and so they those would be like enriched in some way with some kind of fat and so they're like and a lot of bread we eat has a little bit of that and so kind of going back to this very basic sort of loose shaped you know round bread gained popularity i know in like on the west coast of america the the sort of like portland and san francisco these places have like kind of developed a a renaissance of this style bread making. But if you're interested in bread making, which most of you probably aren't, that's fine. I would encourage you to check out this book. I really like this book. I baked a lot of the

Starting point is 00:39:33 recipes in here and had a good time. You know, I don't get super, super into it. Probably should do it more. But it does take quite a while. It's an endeavor, but it feels good at the end to really eat it. And there is something hugely different about eating a loaf of bread, you know, smelling it, cooling it, eating it that you made versus, you know, going to the store and buying one. And so I think it's something that people should try at least once. Yeah, this is super cool. So I was looking up the author to see if Ken Forkish is the real name or not. It just seems like too good to be true to write a book about cooking and he has fork in it. But so far, everything I look up say it's not a pseudonym. There's actually a real guy named Ken Forkish.

Starting point is 00:40:14 I didn't even think about that until you said it. Yeah, I think he runs a restaurant in, I think, Portland. That's right. Yeah, you got it. He runs a bakery in Portland. And he actually, before opening that bakery in 2001, he worked in Silicon Valley as a tech worker for 20 years. Oh, maybe this is why it resonated. I didn't know this. Yeah. So that's wild. So 20 years, so that means he joined, he went to Silicon Valley in 1981. That was probably when it was literally Silicon, you know, like making chips and everything. Then yes, worked there 20 years, opened a bakery.

Starting point is 00:40:49 Good for him. Very cool. All right. So time for tool of the show. My tool of the show is this app called Pythagoria. I guess sticking with the Greco-Roman theme that I have going here. So Pythagoria is this game where you basically have to solve geometric puzzles. And so that sounds like it would be really boring, like, you know, doing math homework as a game or something. But they actually do a great job of making it really engaging. You know, one of the things that they do really nicely is,

Starting point is 00:41:23 you know, the game is played on this 16-point grid, or maybe, you know, one of the things that they do really nicely is, you know, the game is played on this, this 16 point grid, or maybe, you know, it's more than that. It's 36 point grid. So there's a six by six grid of dots. And so, you know, the first level is very simple. There's a dot on the left side, dot on the right side. They're like, you know, find the dot in the middle, you tap the middle of the screen, you move on. And then, you know, it kind of tells you, hey, you know, you have two dots, you know, make like an isosceles triangle. And so you can drag lines between the dots to make triangles, but you're restricted by these dots. And the dots, the fact that you can only draw a line either from dots to dots, or you can make new dots where two lines intersect, that's where it starts to get really complicated. So for example, there was one puzzle where you kind of needed a point that was

Starting point is 00:42:13 sort of in the middle of four points. So you needed a point where there wasn't one. And so what you have to do to solve that is you can just make a little X, right? So imagine your mind like four points, you know, in a square shape, right? And if you draw the two diagonals, now you have an X, right? And so then the middle of that X, you can actually tap that and now make a fifth point in the middle. And so you can now like when you get to the harder levels, now it really opens it up because you can really make a point anywhere as long as you can figure out how to get two lines to intersect at that place, right? So the hard puzzles start getting really hard where it's like, okay, I need to go to like seven

Starting point is 00:42:57 sixteenths of the way between these two points. And so like, what lines can I draw to like make that happen? And so it's actually it's really fun i mean it's it's one of these things it's like very hard to explain you know audio um through audio but i highly recommend you check it out the other thing is it's completely free um it's a donation where um game so i went and gave them i think it's like a dollar or whatever that they're asking for but um but you can play the entire game cover to cover totally free no ads uh nothing like that and uh i found it really stimulating like the other thing is as soon as you get it right it kind of dings and you know you got it right you're not really guessing and so some of the levels you know i would stare

Starting point is 00:43:41 at it stare at it stare at it and i kind of find out, okay, here's sort of the trick. And then you get that trick, you solve the puzzle. It's very satisfying. I felt like they did a good job with the pacing. A game like this, it's very easy to make a level that's extremely difficult, and then you just can't move on, and it's really frustrating. They did a good job of ramping up the difficulty. And one of the

Starting point is 00:44:06 other things they did to help with the pacing is the game is broken down into chapters, but the chapters aren't in increasing complexity. They're just different phenomena, geometric phenomena. And you can actually play all the chapters asynchronously. So if you get stuck in chapter two, you just go to chapter three. So felt like that was uh also really clever game design and uh yeah definitely check it out totally free so there's nothing to lose awesome that's really cool so now you're going around with like a compass and ruler and making dodeca guns and like showing all your friends everything looks like a geometric problem it's like okay you know this door won't close let me pull out pythagoria you know that's awesome uh mine is we might have even had this as a tool

Starting point is 00:44:53 to show before but that is google keep um i feel a lot of people may have heard of this before but some some may have not and it's a way of doing note taking. But I think the power that I had recently realized about Google Keep is having sticking with it and using it and jotting what amounts to kind of like post it notes in the app or on the web or sending links as like a way to doing bookmarks and putting pictures in and just sort of gathering a lot of unorganized data and then just being able to be at search it to be able to go to like dates to be able to you can do categorization, but even just leaving it messy, and then putting stuff there over time and building it up. And then, you know, some something happened to me where I was like, I think I had written this down one time, or I was cooking some

Starting point is 00:45:38 dinner. And I was like, Oh, I think I last time debated what temperature to put this at. And like, well, let me go see. Oh, yeah, sure enough. I, I took a note here cause it felt like something I'd want to remember. And so just putting these little like shots of a node or a picture recipe or a link and being able to find those things later, anytime where I try to figure out something that I knew and I, I don't, or couldn't find it, always try to make sure to go put it in there the second time. Cause if I needed it, you know, twice, probably going to need it again. And so building that up over time, I think there are some open source

Starting point is 00:46:09 or different alternatives and other platforms that people use. So this is, my tool is Google Keep, which Google allows you to use for free. It comes with all the traditional Google cons, I guess. But that is a pro. But then, you know- Are there ads or no ads?

Starting point is 00:46:26 I don't think I've seen ads, but people in general have a love hate relationship with Google, which I completely understand. And so don't put any private information there, I guess that you're not willing to share with Google. But, you know, using a tool like this, I guess would be my shout out, which is something where you can just very low overhead, not superstructure, just sort of put your information in and allow it to kind of accumulate. We talked about the importance of email organization. If you're super organized, maybe you don't need this. I already fessed up in the beginning that it's something I need to do better at. But, you know, I think here, this is a way for me to kind of not lose those little ideas.

Starting point is 00:47:01 Yeah. Do you rely on search then to to retrieve the notes okay so the notes aren't like hierarchical or anything no i've seen stuff and always been intrigued about doing that i feel like the hesitation for me for hierarchical or like linked notes and stuff would be really cool except that i just know that i'm not gonna put i'm gonna worry more about where in the hierarchy goes and then therefore i'm not gonna put stuff in if i don't think it'll fit in the hierarchy that makes sense here it's like i need to just record it yep yep totally makes sense i wonder if like maybe we could automatically generate the hierarchy oh what do you mean pretty cool Like maybe with some unsupervised learning.

Starting point is 00:47:51 Speaking of which, if only our episode today was about... How did that happen? Cluster my notes. Clustermynotes.com. That needs to be... Oh, did I tell you it's a bit of a side topic? Then we'll jump in on supervised learning. I made a website called visual-if.com. That needs to be, oh, did I tell you it's a bit of a side topic? Then we'll jump in on supervised learning. I made a website called visual-if.com.

Starting point is 00:48:16 The idea is the UI is very clunky, but it's an interactive fiction game, but it runs Dolly as you're playing the game. So, you know, you type in, like, you know, you get past the intro screen, and it's like look up look down movies like one of these interactive fiction is like zork or adventure right but anytime there's a room description or anything um dolly is running in the background and you get like some crazy you know art installation of uh of whatever that is. And it really actually makes the games really fun. If you've played interactive fiction before, if you've played a particular game before,

Starting point is 00:48:51 you could play it again and just see all these really trippy pictures with it. See if it matches up with what you were envisioning when you played it the first time. I'm not sure if I went to the right spot or not. It's either really well done to be confusing or I'm on some other random person's website. So it's Photopia is the game that starts when you...

Starting point is 00:49:15 Okay, yeah, yeah, yeah, yeah. Yes, it says, would you like instruction? Yes, the UI sucks. I'll admit it. So it says, would you like instruction? You actually have to click on that and then type no or yes. Oh, okay, okay says will you read me a story you have to actually click again and then now you're in the game i need some way to like maybe i should make like a little intro

Starting point is 00:49:35 screen before the game kind of tell people what they're in for nice this is cool i mean these pictures are like kind of creepy i'm not gonna lie the ones i'm getting yeah yeah it's like uh there's like a person skiing or something at least that's what i see um but yeah it's uh it was pretty pretty fun i might try and uh double down on that a five time so i found out people didn't really understand how to use that product. I just asked people and they're like, I don't really understand what I'm supposed to be clicking on. But as you kind of work on any type of product, definitely you can show it to your friends. You could try it out yourself.

Starting point is 00:50:18 But you will eventually want to show it to strangers, right? And as we talked about with Kevin in the marketing episode, right, eventually you're going to give your either app or website or whatever you're building to the public. And you're not going to be able to really look over their shoulder and find out what they were thinking about it. So, you know, definitely we talked about marketing and surveys and all of that. So there's a whole human element too. But another thing you want to do is you want to kind of gain insights from data. So ideally, you know, imagine if you're making like this app, for example, this visual IF, you know, I could, I mean, I didn't, I didn't implement this, but

Starting point is 00:51:00 you could imagine, you know, I could track where people are moving their mouse or what they're clicking on, or if they're clicking, how long they're spending on that site. And that could, you know, all go into some kind of report that would give me information. And then I could even go a step further and A, B test. I could try a new version of the site, see if it improves, right? The challenge is, you know, all of this data that you're going to get is going to be highly unstructured, right? So imagine, you know, it's going to start with these print logs that maybe you did when you were doing development. So, you know, print, you know, person clicked on a page or print the page ID or print, you know, person move the mouse to the bottom of the page, right?

Starting point is 00:51:50 And you have to somehow turn that into something that you can look at, some type of graph that you could look at and say, oh, here's some thing that I can do to make things better. And so that's, at a high level, one of the main things that we want to do with unsupervised learning is take you know a lot of raw information and like you know the the one of the biggest examples that is used repeatedly is wikipedia you know take all of wikipedia and can you just learn something from reading you know you're having a computer read all of wikipedia so that's basically what what unsupervised machine

Starting point is 00:52:22 learning is all about and uh yeah i think pat I think Patrick, you brought this show topic up to the table. I think it's a great show topic and something that I am really passionate about. Yeah, I mean, I think like Jason was saying, this taking your data and sort of like helping to structure it but i think and i made the joke about about clustering earlier i feel that sometimes um jason is a machine learning let's say practitioner right like that's that's his trade i'm not so in general i try not to do it not because i don't know what it is or can't do it but just because it comes with certain i don't know honestly expectations and so when i set out to do work it's like the engagement model I have with the data that I have in front of me and with the task at hand

Starting point is 00:53:10 is a bit different than Jason, Jason might have as a machine learning practitioner. The same time, I think this area in parts has a lot of overlap between and I'm just making I don't know if there's probably a better word, but sort of machine learning practitioners, and sort of other folks. And so I think there are things like, for example, like clustering, where there may be features of your data or heck, there might even already be numbers that, you know, you have along multiple dimensions, you know, just even two or three dimensions, where if you clustered them together and looked at them, that would already be quite helpful. And you may say, well, that's not machine learning. That's not, you know, there's not machine learning. That's not, you know, there's no neural network. There's no, you know, whatever. But I think that's okay. I think things where you're fitting a, you know, a regression to your data and trying to say,

Starting point is 00:54:00 you know, hey, look, there's numbers here, and I'm fitting a line to it. So I can think about what the next number would be, or between numbers or out past the last data point I have, right? I think those are kinds of things where you are overlapping in a lot of this and learning about those things and thinking about them. And there's another tool in the toolbox. We talk about that all the time. And so I'm pretty excited to talk about some of these things because I think a lot of them have value even to people who wouldn't call themselves machine learning practitioners, but anyone who has data, which tends out to be a lot of people, or most people have some amount of data that are trying to work with and ways of organizing, you know, cutting that data down, smoothing the data, thinking about the data, all those kinds of things. Yeah, right. Yeah. So I mean, imagine like you have a bunch of data around, you know, people who visit your website.

Starting point is 00:54:47 Right. And so you want to set up, you know, clusters, you want to set up kind of cohorts. For example, you might learn that there's somehow like there's a lot of identity around age. So it's like the people visiting your website are either, you know, really young or really old for whatever reason. And so maybe they're coming for two different reasons. You have to figure that part out. But clustering will kind of looking at the centroids. The centroids are the centers of the clusters that you develop can give you a lot of information. So for example, you might have a whole bunch of different features, and then you might do clustering, which basically says for each of the data points, and a data point would be sort of like a set of features. So maybe a data point is a person who's

Starting point is 00:55:38 come to play your game and a set of features about them. Like how do you describe this person, right? And so then after you've done, you're finished with a clustering algorithm. Now each person, they're going to be assigned to one cluster. If it's a hard clustering, if it's a soft clustering, then they could be assigned to like some mixture of clusters, right? But that's neither here nor there. So now you can look at this at this sort of like these these clusters that people are assigned to. And you could find the center of them. In other words, given this group of people who are all assigned to cluster A, you know, what is the center point there? So what is like the person who would be most aligned with cluster A, like this hypothetical person who just perfectly lands in the middle of the cluster. And you say, okay, this is sort of an archetype. There's something unique about this group of people. You can also do this with faces if you're trying to do face recognition, or even if you're trying to do object recognition, you can even do clustering on images and say like, i have this huge bank of images and they're falling into

Starting point is 00:56:46 one of several categories so i think you know clustering has uh yeah it's been used for for tons of different things you can also then you know do machine learning on top of the clusters what have you used clusters for in your your work yeah I think like one of the things that came up recently is we noticed that, which I guess like just thinking about your data is like we were doing some processing and sometimes certain configurations of input were causing like the data to take a lot longer than other configurations. Sorry, speaking vague, but whatever. And so we were trying to kind of understand like, is there a difference or the features in one? And so this what you kind of alluded to, it's exactly right, which is, hey, we have a whole bunch of measurements, like the size of the input data, the like, let's just say it was text, right? Like, how many characters are there? How many lines are there? You know, how many punctuation marks would there be, right? Things which we could kind of look and say, hey, is there something about this that's making the processing take longer or not? And once we sort of like, you know, kind of started plotting it out and saying, okay, hang on,

Starting point is 00:57:54 let's look at what the clusters are of these like sort of easy inputs and hard inputs. And like, what, what? Oh, okay. Well, look, these hard inputs all have you know a lot of extra punctuation in them and then you know realizing that the processing we were doing was going to cost a lot more when that happened right but we didn't kind of it was vague enough that it's it's a bit difficult to uh sort of know that in advance and just look at your code and say hey actually i see here we do all this extra work in punctuation. It was the sort of like second order effect that was causing it. And so by doing this clustering, and just looking at the results and saying, oh, look, these things are different than those things

Starting point is 00:58:34 allowed us to kind of say, hey, up front, let's check for that and handle them specially, or maybe decompose them further or do something special. And so that's what we use Zipf. Yeah, that makes sense. I mean, one area where you see a lot of clustering is around log ingest and log reading. So imagine you have a website. The website has a MySQL database. It's got servers, backend servers.

Starting point is 00:59:03 It's got the JavaScript on the client. And all these things are generating logs, right? Your server, your database is generating all these logs. Like, oh, I'm getting full up or, oh, the utilization is too high. Actually, to be honest, have you ever looked at a database log, like a MySQL log? No. I've never done either. I just assume it works. it shows that we're terrible

Starting point is 00:59:27 dbas but you know it's generating a ton of logs and you know if your database like goes down or all of a sudden it takes i i do have some uh actually like there's a lot of popular websites that are only run by like 10 people you know what i mean like or i think craigslist is famous for having just extremely small staff for such a popular website. But eventually, you'll run into this where your site just doesn't load. And you're going to have to go step through all of these. So you look at the client, say, OK, the client's fine. You go to the server logs.

Starting point is 01:00:00 And it just says, it's just waiting on the database access. Wait database results. Then you go to the database. It's like you know, it's just waiting on the database access, wait database results. And you go to the database. It's like, yeah, utilization is 100 percent and you have to end up doing something. Right. So you're getting all these logs. A lot of it is code that you haven't written. Right.

Starting point is 01:00:19 Because there are logs from programs that you're using and you need some way to say, OK, can I separate the signal from the noise right like like is there like like is this log actually interesting um and and actually a lot of these systems like sentry and bug snag and these other systems use clustering so what they'll do is they'll take every log line and uh they'll do what's called an embedding which we can get to later but but they'll basically take every log line and turn it into a point in some space so imagine some cube but it's like a hypercube it's like a you know a 200 dimensional cube or something actually we talked about it in my mind yeah i've got the 200 dimensional cube i'm picturing it in my mind that's not like magnus carlson plays chess or

Starting point is 01:01:09 whatever you know but uh we talked to ito liberty about embeddings on on that show um and so yeah you have this big cube and you've you've figured out a way to take these lines of log and put them in this cube, right? So if I have a log line that's printing every second, that's like, you know, things are good. You know, it's like 1901, things are good. 1902, things are good. That's going to look the same, right?

Starting point is 01:01:39 It's always going to say things are good and then some kind of date, right? And so since it's so similar, and even the date, the number, the timestamp is also kind of date, right? And so since it's so similar, and even the date, the number, the timestamp is also kind of similar, those will likely end up close together in this space, right, once you've done this embedding. And so you can throw all these logs into that space and then do clustering. And chances are the, you know, things are good message will get its own cluster if there's so many of them and then you could just throw them all away all right uh you could also do things like say okay this line of log isn't even really near any of the clusters so it must be something

Starting point is 01:02:18 pretty unique pretty special um so maybe this one i should you should send an alert or something like that. And that's called outlier detection. And that's also, it's a really hard problem, but there's a bunch of great libraries. There's PyOD. There's a bunch of great libraries for outlier detection. And they're all kind of, it's all very related to clustering. So what are some other examples of unsupervised learning? Yeah, I think a lot of these words like unsupervised learning, reinforcement learning, a lot of them have become kind of really nebulous, right? As all of these fields have kind of like overflowed, right? But now there's the hot thing is sort of self-supervised

Starting point is 01:03:09 learning. And the idea with that is it's still unsupervised in the sense that you don't have a human... Actually, we should probably talk about that. So supervised learning is typically where you have a human in the loop. So imagine if I'm playing chess and I train some model to mimic my moves. So if I move the pawn, then I tell some model, hey, when you see this board, I want you to move the same pawn to the same place. And so it's supervised. I'm a supervisor, right? And it is just trying to mimic this, right? Now unsupervised would be a little different.

Starting point is 01:03:52 Unsupervised would be where, for example, you might train an autoencoder. So you might say, here's a picture of a chess board. I want this algorithm to embed that picture. So find a function that takes this picture of this chessboard and creates a point for that picture somewhere in this space. And then I want another function that takes that point and creates the picture again, right? And so you're going from the picture to the point back to the original picture. And so when you do this and you train this model, it ends up having to represent, you know, the essence of that picture in that point.

Starting point is 01:04:44 So, for example, let's say all of my pictures have the same chessboard and it's on a black table, right? And it's the same camera setup. It's like a tripod. So it's very reliable, very stationary. It's all pictures of this black table with this chessboard on it. So it can just recover the black table without needing any extra information right so because every single point that we draw in that cube when you go back to draw the picture you're going to need that black table and so that's where like this really powerful compressive ability comes in so you actually know, the points now don't need to differentiate based on the table. They're all going to have the table in it. And so, you know,

Starting point is 01:05:32 if two points are close together, then that means that the two images they generate must be similar, even given the similarities that there are broadly, like they must be even hyperlocal. They must be similar. Otherwise, those two points will get pulled apart. And so the way the autoencoder works is, you know, you generate the chessboard. It's not going to look exactly right. So you have some error. And then you say, OK, you know, this pixel is like too dark or you drew a pawn here and

Starting point is 01:06:02 you really shouldn't have it's empty. And so you, you know, given that you know the right here and you really shouldn't have it's empty and so you you know given that you know the right answer you just tell the model hey here's the right answer adjust yourself and it will figure out how to use that volume that embedding volume in the best way or in a good way to be able to generate all of those pictures not just one of them does that make sense yeah i think so i mean that was that was pretty deep but here i guess the when we were talking about like clustering you don't necessarily like you mentioned they're like the right answer there's no way to necessarily feed back so you're like you might as a human tune something or do something like the number

Starting point is 01:06:42 of clusters but and there may be algorithms you do that there's really no necessarily right answer when you're talking about it's still unsupervised but this sort of like auto encoding you're kind of giving a problem constricting the amount of information that can be shared between sort of like the left half and the right half and then trying to say like you need to simplify down to a representation and then reconstitute that representation back to the original and then look and compare the two. So you have a well-defined metric for saying, like, hey, how well did you do at your task? And so you algorithmically are supervising it. But as a human, you're not sort of like at each interval sort of like labeling something or giving a behavior to emulate.

Starting point is 01:07:24 Yeah, that's right. And the reason why all of this kind of comes together is, you know, clustering, you know, imagine you're looking at a group of people, like you're in a helicopter, and you're looking down at a stadium full of people or something, or you're looking at a football game or something, right? There's little dots, like they're maybe the size of ants or something running around on this football field so like when you cluster you're going to be using sort of the geometry of the field right so if somebody is twice as far away then that really has a big impact on whether they're going to be in that cluster or not with these other group of people right right? And so for all of clustering, you need to have a space that's pretty uniform. So like, for example, let's say you fed a bunch of features into some clustering.

Starting point is 01:08:13 And one of your features is person's age in milliseconds. And the other feature is person's height in meters, right? Well, like one is enormous, right? Your age in milliseconds, it's a huge number. And so the clustering algorithm will totally ignore the other feature because your height in meters is, you know, I don't know from, you know, I guess 0.5 to three or something. You know, it's such a small range.

Starting point is 01:08:42 Actually, I guess there's nobody nine feet tall, but anyways, so you're hiding, I'm trying to figure out the tallest person in the world is what, eight foot? Anyway, so your range is tiny. It's like three units, right? But your age in milliseconds is enormous. And so the clustering algorithm will just cluster ages

Starting point is 01:08:59 until you're, you know, not pay attention to the other one. And so, you know, if you're trying to cluster images or text or some of these things, you quickly run into this problem where the thing isn't geometric. And so the clustering can't really take into account different dimensions in a way that's fair. And so the nice thing about this auto-encoding is the way that the loss kind of propagates backwards from the correct chessboard to that latent space to the input chessboard, the way that those dots move and the way that the things kind of shake up ends up

Starting point is 01:09:43 creating like really nice spaces where all the dimensions have relatively the same importance. This is interesting. So yeah, so you're training both halves, but you may be taking, I guess you were calling it like the latent space in the middle, the encoded thing, and using it as input to other parts of your system

Starting point is 01:10:02 or sort of like clustering in that sort of more well-formed space so that you can say things about it even if you never end up reconstituting like there's really no reason for you to get back to the original chessboard like you had it as input like you could just use it you didn't really need that part but it helps you to get that middle part that you could then use to do clustering on. Yeah, exactly. So now let's imagine we have a bunch of people who go to your website, or we could even stick with the football analogy. We have a bunch of football players, right? And we have a bunch of statistics about them. And these statistics are all over the place. Some of them are important. Some of them aren't

Starting point is 01:10:42 important. The units are all different. And so if we just feed these players into some clustering algorithm, then it's going to have a really hard time. Maybe basketball. I know more about basketball. So basketball, people score a lot of points, but their height in feet, let's say and in decimal feet is going to be relatively small so all my score you know 20 30 points but they're only like seven feet tall and so the so you have like different scales there as well right um you know or assists or rebounds um you know number of minutes played um you know and so all of these have different, you know, and even slight differences in units can really matter. You could do some type of, let's say, contrastive learning, which is a self-supervised approach. So you might say, here's a list of players who I felt played very similarly.

Starting point is 01:11:39 So I could come and say, okay, you know, Shaquille O'Neal and Dikembe Mutombo, they're both centers, really tall, strong people who just can, are strong enough that they can just push their way through and dunk the basketball, right? Those people are very similar. So I'm going to pull these two people together. So, you know, whatever their features are, you know, they're going to, we're going to create sort of a point for these two people based on their features. And then we're going to say these two points need to be closer together. Then I'm going to take, you know, Shaquille O'Neal and like Anthony Hardaway. And so Anthony Hardaway is like a three point shooter, like small person, like for basketball standpoint, small person who goes and shoots shots from far away. So these people clearly are

Starting point is 01:12:25 far apart. You might even actually just use the positions, right? You might say, okay, all the centers should be close together. And then take two people who are from two different positions, they should be far apart. And so in this way, you're not, it's not supervised learning because you're not saying, okay, you know, Shaquille should be right here or, you know, Shaquille should make like this many points or something. You're basically saying, you know, these people should be closer together. These pairs should be close together. These pairs should be far apart. And it's contrastive learning. It's self-supervised if you can automate all of that without a human in the loop. Basketball is a weird example because at some point a human did decide

Starting point is 01:13:11 you should be a center, right? So maybe not the best example, but you could even imagine like doing contrastive learning on images. So you could say, here's a bunch of images that are on the same website. And because just by virtue of them being on the same website, they should be pulled together. And then here's two images from two different websites. They should be pushed apart. And so if you do this and you have a low learning rate, because that's going to be a weak signal, right? But if you do this and you have a ton of images and you've scraped a lot of the internet,

Starting point is 01:13:42 you'll end up with an embedding that's that's really powerful and so contrastive learning and auto encoding where you feed in the same thing that you're trying to predict are two ways of generating like really nice spaces that then you can do clustering and other things with this is is awesome. So we were sort of giving examples and sort of saying the algorithmic approach, I guess, to doing this. What are some applications of what people do with the... I mean, we talked about outlier detection for logs.

Starting point is 01:14:17 I think that was a good one. We were talking about classifying things. What are some other examples of applications of this process? Yeah, I mean, you know, all of the language processing is now pretty much done in this way. So, for example, you know, it used to be that if you wanted to train a model, let's say, to translate French to English, you know, you would have to pay people to, you know, manually translate tons and tons of things, like literally millions of sentences. And then you would train, you know, your model on these sentences. And you would have some translation that just goes from French to English, right? It's extremely expensive. Right. So now what they do is, you know, they will do what's called a word embedding.

Starting point is 01:15:11 So basically, there's a whole bunch of different ways to do this. So one way would be a self-supervised approach where you say, given all the words up to this word. So, you know, like what was it called? Like the brown. What is that one that's like you see all the words up to this word so um you know like what was it called like the brown what is that one that's like you see all the time brown fox jumps over the lazy dog yeah that's it okay so you say like i didn't know this that the reason that's a sentence is because for handwriting is it uses it's like the shortest sentence that uses all the letters of the alphabet or a very short sentence which uses all the letters of the alphabet what a very short sentence, which uses all the letters of the alphabet. So it was a penmanship test.

Starting point is 01:15:47 I never knew that. I learned that like a week ago. What? Shut the front door. Wait a minute. And then I was like sitting there counting them all. Yeah. Nope.

Starting point is 01:15:55 Nope. Yep. They're all there. Wow. Oh my gosh. My mind is totally blown. I feel like that. Have you seen that video where the guy pretends to do a magic trick and he takes the straw

Starting point is 01:16:05 and he has his friend like put the straw behind the other guy's back and it blows his mind. Anyways. Yeah. That's me right now. So, wow, that's freaking awesome. Okay. So, so is what the quick brown fox or no, the quick, I thought the dog was brown. The brown fox jumps over the lazy dog.

Starting point is 01:16:23 I thought the dog was brown. Anyways. So let's say the quick brown fox. Any word probably works. So, you know, this algorithm will learn, you know, like you give it the quick brown and then it has to produce fox, right? Now, if you look at that in isolation, that's like almost impossible. But you give it a ton of these sentences. And, you know, in every sentence, you say, okay, here's the first word, predict the second one. Okay. Here's the first two words, predict the third one, predict the fourth

Starting point is 01:16:49 one. Right. And you give it, you know, all of Wikipedia or something. Right. And so, you know, yeah, I mean, for some of these subjects and objects, it's going to be really difficult, like the quick brown, it could be anything, but you're going to also see a lot of correlation so you'll notice like whenever you see of like maybe you see the afterwards very common and so you'll actually learn a lot of structure from doing that from what they call a forward model right and the awesome thing is it's effectively free this is another thing because because of Moore's law and because computers have become so cheap and so efficient, you know, it's really the people time that's the killer for a lot of these things. Like if you can eliminate the time that a human has to do something,

Starting point is 01:17:37 you're in really good shape. And so, you know, with these forward models, you just download Wikipedia. I mean, you could do this on your laptop right now and predict the next word. You don't have to pay any humans to rate any sentences or anything. So now you have this embedding, which says, given a part of a sentence, I have some point in some space based on what word's coming next. And so, you know, sentence fragments where the next word is going to be fox will all kind of be close together, right? It turns out now if you do that same translation problem, but you work with that embedded space instead of with whole sentences, you need, and it's been a while since I saw this, but I think it's like one one-thousandth of the data or something like that. I mean, it's extraordinary. I mean,

Starting point is 01:18:29 the difference is unbelievable. And so, you know, what used to take millions and millions of sentences, you know, now after like 10,000 sentences, you're done. And there's even models now where they've embedded, they've actually done this jointly with different languages. And so they embed like literally every language into the same space. And then all they have to do is train the second half of it, which does the translation part. Yeah, so all of natural language processing completely redone with self-supervised learning. Like it's massively changed that field. And I think even with image processing, you're starting to see a lot of interesting things. The image equivalent of this is where you basically cut a piece of an image out and you say, reconstruct that missing piece of the image. It's like, remember the magic eraser that adobe photoshop thing that was like really popular like 10 years ago content aware yeah yeah yeah so you could and so uh you could actually yeah you could erase a person and it'll fill the

Starting point is 01:19:39 behind them right so imagine you know you you cut the person out, but your intent in this case isn't to literally cut them out of the picture. It's to see how good your reconstruction can be. And you immediately know what the algorithm did right and didn't do right. So in this case, you actually wanted to generate. So you wouldn't do something really difficult like cut out an entire person. You'd randomly cut out squares. And some of the time it would be impossible because you cut out an entire person you would you'd randomly cut out squares and some of the time it would be impossible because you cut out a whole car or something but most of the time you'll cut off parts of things and you'll be able to like if you cut out one person's eye you just copy their other eye or whatever yeah or something yeah exactly

Starting point is 01:20:20 and so same kind of thing so you you have this model that reconstructs things by putting them into this big described. And then they trained another model on images. And then they created another thing which said, I have captions for images. So I have like a picture of the quick brown fox jumping over the lazy dog. And then I have that caption. Those two points should be close together. And then I'm going to take captions that don't belong with their picture, mismatched caption picture pairs, and those points should be far

Starting point is 01:21:11 apart. And I'm going to take those input embeddings and now train, you know, another what's called a joint embedding that tries to unify or push apart those pairs. And that's how DALI works. So then, you know, when you go to OpenAI and you say, you know, astronaut eating ice cream on the moon, it's taking those three models, the language model, I guess it doesn't need the image model anymore, but it's taking the language model and, oh no, it does. It needs the language model, the image model model and this joint model it's using all three of them to generate that picture of that astronaut on the moon uh and the question is is the artemis capsule floating around the the moon in the background

Starting point is 01:21:57 to tie this we should just type artemis into dolly and see if it's on the moon or not uh i think like but they have exclusions for like a lot of proper names and nouns and stuff so i don't know i don't know how that works ah really yeah i uh i used it a little bit i found it to be you know really captivating there's something powerful about that um i've always been a really big fan of dolly but uh other than this visual interactive fiction i haven't found a practical use for it well it feels good to do a do a duo episode i know it's been a while but uh going through the uh the habit of uh the first first uh few well many many episodes of you and i doing this together. It feels good to do it again,

Starting point is 01:22:45 do our tools of the show, book of the show, news, and then this discussion about machine learning was a really good time. Yeah, definitely. I think these are all very accessible, approachable things. You can use SageMaker or other tools. You can train on all of Wikipedia without having to download it to your desktop if you don't want to. I think I saw, you know, training that model I just talked about run you like, like 30 bucks or something, which, you know, is the price of like going to the movies. So it's not it's not nothing. But it's also like, pretty amazing that for 30 bucks, you could train a model on, you know, the entire Wikipedia corpus, and it'll come out correct and everything. So, you know, it's a lot of fun, amazing times we're

Starting point is 01:23:32 living in. And I guess as a like final thing, anything you build, you're going to need to collect some type of metric to understand the people who are using your product and and so this is a really good area for folks to brush up on all right and with that note um yeah it's really awesome doing a dual episode um looking forward to seeing this one come out looking forward to seeing if we're right or not about about our prediction and really looking forward to your emails we've been getting a ton of really great emails. So appreciate everybody out there. We do read them, even if Patrick has them marked as unread.

Starting point is 01:24:11 He has looked at the subject. I actually, I think both of us literally read every email that we get on Programming Throwdown. So we really appreciate your support and supporting us on Patreon and Audible. So thanks so much. Definitely subscribe if you're not subscribed to the show using whatever podcast catcher. We should be on all of them at this point. If we're not, let us know. And we will catch you all in two weeks. music by eric barn dollar programming throwdown is distributed under a creative commons

Starting point is 01:24:57 attribution share alike 2.0 license you're free to share, copy, distribute, transmit the work, to remix, adapt the work, but you must provide an attribution to Patrick and I and share alike in kind.

Programming Throwdown - 145: Unsupervised Machine Learning

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.