Programming Throwdown - Optimization

Episode Date: August 31, 2016

This show covers software optimization (how to make software run faster). Show notes: http://www.programmingthrowdown.com/2016/08/episode-57-optimization.html ★ Support this podcast on P...atreon ★

Transcript
Discussion (0)
Starting point is 00:00:00 programming throwdown episode 57 optimization take it away jason hey everyone wanted to start with a public service announcement. And I've seen this so many times. It just absolutely drives me up the wall. And it's one of these things I feel like I don't know if I'm getting like becoming like a crotchety old man or whatever. But, you know, I saw this. So I'm going to kind of exaggerate. But basically someone said, oh, there's a 1% chance of doing
Starting point is 00:00:45 this event. So I have to do it a hundred times to get it. And you see this a lot, like, oh, there's a 10% chance. So that means, you know, you have to do this 10 times or, you know, you see this a lot and it just absolutely drives me up the wall. It doesn't work that way. Like if you have, if you have a 1% chance of doing something, it doesn't mean that way. Like if you have, if you have a 1% chance of doing something, it doesn't mean that you have a hundred percent chance. If you try a hundred times, it means that it's actually more complicated than that. And I'm not going to like go into the formula or anything like that. It's actually, I can go into, it's pretty simple. It's just, if you have a 1% chance of doing something, then that means you have a 99% chance of failing.
Starting point is 00:01:27 So the chance that you succeed even once is 1 minus the chance that you fail every time, right? So if you have 1% chance of success, you have a 99% chance of failure. So if you do 0.99 to the 100th power, and I will put this into Wolfram Alpha. But if you have.99 to the 100th power, it is not 100%. Let's see what it says. This is really exciting radio right here.
Starting point is 00:02:00 I like that you called it radio. 36. So if you have a 1 percent chance and you do have a hundred attempts then one minus 0.36 so 0.64 so you have about a two and three chance of succeeding at least one time out of a hundred and just you know i've been looking a lot at like you know polls and you know like it's politics season and so everyone tries to be a statistician and that's cool i think this is one of the few times of the quattro year it's one of the few times where people really care about statistics which is cool but don't make this
Starting point is 00:02:36 mistake it drives statisticians crazy that's a good point and it should also be pointed out that if you think about the formula as you were saying saying it, as you get closer, the, you start slowing down, right? So the, you're taking off a percentage each time, but the amount you increase gets small. I'm saying this really poorly that basically, so you did it a hundred times and you got to two thirds chance, but if you do it a hundred more, you're not going not going to again you're not going to get to a hundred percent you're going to just decrease two-thirds more right and so like the more you do it you're approaching a hundred percent chance but you'll never actually reach a hundred percent chance except at infinity right because there's always going to be some chance that you fail every time like it had that chance that possibility has to exist, right?
Starting point is 00:03:25 And it's a non-zero possibility. That's right. So yeah, if you're curious about this thing, it's the binomial distribution. You can go on Wikipedia, learn all about it. It's pretty interesting. In general, like even if you're not, you know, really into stats,
Starting point is 00:03:40 just reading about the different distributions is pretty interesting because each one comes from a different phenomena in nature they're all derived from like people's experiences with the world and and just just reading about them is actually like pretty interesting every time I read about statistics I'm reminded how bad I think everyone is pretty bad at it and I think i guess even statisticians probably have to be very conscious in their thought about when they hear things and what they mean because i feel like humans are generally just kind of bad at odds yep yeah totally uh the one that we've
Starting point is 00:04:19 talked about it before but for people who this might be their first show or whatever the monty hall problem is always the best example of that, I think. Yeah, although, you know, that one is kind of confusing. So have we talked about it before? Should we give a brief description? We'll recap it. Someone actually emailed me today to say that when we say we said something, we should say it again, if that makes any sense.
Starting point is 00:04:42 So Monty Hall problem is this. You have three doors uh they tell you that one of the doors has a car the other two doors has something else it's like a goat or something yeah so um you pick a door let's say door number one then what the host does is he picks not the door you picked that's the key point he he had if you say i'm interested in door number one then the host has to choose either door number two or door number three and open it for you but he knows what's behind each door so he will reveal a door with a goat that's right that's right so um so he knows so if there's goats and doors two and three he'll pick one at random 50 50 chance if there's a car in one of those doors he'll pick the goat uh door
Starting point is 00:05:35 but the thing is and then he'll come to you and he'll say you you chose door number one i showed you what's behind door number two do you want to stick with your choice or do you want to change choices and we'll reveal what's behind your door and you get whatever's behind your door that's right and so you should change well so the question is does it matter if you stay or like what is the best strategy right and and the answer is, is that you should change your guess. And it seems like doesn't make sense. Like, why should you change it? But the key is that the host can't open the door you chose. And so because he can't do that, he has to choose one of the other two doors and he has to choose a goat door and because of those two things he's giving you information you didn't have before like it seems like he's choosing a door at random but you are actually forcing him to choose a door by you know picking a door in the beginning you're sort of forcing his hand not always but sometimes and the fact that you did that carries some information. So it's one of these like statistical paradoxes, but it's super interesting. You should look it up.
Starting point is 00:06:51 It is really challenging. And even when I know the answer, I still don't find it intuitive. Like I have to reason about it. But the funny thing is the two strategies aren't even close. Like always staying or always switching. Always switching yields significantly better odds than always staying. Yep. And the more doors you have, the better your... I guess it diminishes. But yeah.
Starting point is 00:07:17 Right. That's right. I mean, as the doors go to infinity, then the fact that you forced him not to pick one of them becomes negligible. Right. Okay. infinity then the fact that you forced him not to pick one of them becomes negligible right um okay so i was emailed as well i actually think this was last week or two weeks ago by oh i have it here on the date uh when was this uh oh yeah so a week ago by abdullah and he was saying, or yeah, they're saying that four years ago, so we're in the Olympics now in August. And four years ago was also Olympic time. I guess that would have been
Starting point is 00:07:52 in London. And Jason made a prediction about the 2016 Olympics. That's right. I predicted that we would be able to watch, what's it called? I forgot. There's a thing where you can watch videos on BitTorrent. It was like pretty hot four years ago. But the idea is, you know, think of like, you know, Netflix, but it's all of the packets that comprise the video are just coming from random people on the internet. But it would be a Netflix-like thing. I think it's called BitTorrent video,
Starting point is 00:08:23 but it's a Netflix-like thing. You go in, you just start watching a video, but it's all crowdsourced. No one's paying for the servers. And I kind of felt like the Olympics in 2016 would be served over one of these, you know, anonymous peer-to-peer kind of things. And it didn't happen actually i don't think can you even watch the olympics online without a without a cable subscription uh i don't think so i'm sure you can watch clips and there's probably like quote-unquote pirate streams sure yeah but you can't get a feed like a official you know olympic feed which is really kind of a shame because the whole point of the olympics is like amateur athletics not anymore um yeah i mean i i kind of thought four years ago that that you know but instead of instead of peer-to-peer olympic streams we have uh um we have all these like
Starting point is 00:09:19 nba superstars which is you know cool too well i thought i was going to blindside you but apparently you had forewarning of of this prediction so that's okay maybe i'll try again so my first news article is uh and bear with me this is going to be a little strange i guess an interactive sudoku zero knowledge proof is the is the title of the article and this is something i didn't know about i was like wow this is weird what is this and then i something I didn't know about. I was like, wow, this is weird. What is this? And then I clicked on it, read about it, and I was like, whoa, this is awesome.
Starting point is 00:09:51 So if you, that sounded really surfer-ish. I apologize. If you've heard of before zero knowledge proof, I kind of heard it in passing, often associated with like blockchain stuff. So Bitcoin and Ethereum and the the like and zero knowledge proof is a way of you solving a problem and for instance like i solve a problem and i want to communicate to jason and prove to him that i have a solution to the problem him to know that i have the solution to the problem
Starting point is 00:10:28 without telling him the solution to the problem um so okay so one of the things would be like if there was a really large number and i had factored the number uh you know it was a composite number and i had factored it or it was a prime number and i had proved it was prime. And I actually don't know how it would work with primes. I'm just making this part up. And I wanted to communicate to you and prove to you that I had actually done that work. But I didn't want to communicate to you what the answer was until you accepted my proof. This is this is kind of like a way you would go about doing this. So in this example, and this article actually links to by the same author a previous article they had written about more generally zero knowledge proofs but i thought the sudoku one so you should read that as well but the zero knowledge sudoku one
Starting point is 00:11:15 was kind of cool because it was a very easy to gasp explanation the other one goes through a graph coloring problem which is a little more abstract but basically well sudoku is a graph coloring problem right uh i don't yeah if you told me it was i would say that would agree with you if you ask me i don't know oh yeah sudoku is a graph coloring problem where you have nine colors and and you draw edges between all of the squares that that can't be the same number you know i'm saying oh okay so it isn't a graph in the way it's presented though it's a but you could represent it as a graph i got it yes yeah yeah right i hadn't thought about it that way before um but i i guess it makes sense um and so but i guess like the rules of sudoku
Starting point is 00:12:04 or if you've sudoku if you've played it before what this article does is explain it and then actually has like a little I guess we would have called an applet before but applets aren't a thing anymore so it yeah that's kind of sad there's actually I don't even know I guess because applets aren't a thing and flash isn't a thing so I guess so it must I guess it's htmlml or javascript yeah um and what this is is if you've kind of played sudoku before this would be a way given a puzzle and you know a few of the numbers filled in and i solved it and filled in all the numbers correctly and there's always a unique solution to a proper sudoku puzzle only a single unique solution um and i wanted to
Starting point is 00:12:43 communicate that to jason and him to know that i completed it without giving him the answer and this walks through how you would do it and it has both sides like what my side would be as the solver and what jason's side would be as the checker i think they call it mine would be the prover and i forget what the exact terms are anyways check out the article it's really informative And it's something I didn't know anything about other than hearing the term in passing. And now I actually am very intrigued. I kind of want to dig into it a little more and understand.
Starting point is 00:13:13 And so if you've ever been curious about it or heard about it or never heard about it before, I'd recommend this article. You either search Interactive Sudoku Zero Knowledge Proof or we'll link in the show notes. Yeah, this is really cool. I mean, they literally have a thing where you can fill out a sudoku board and it'll show you how to do the verification and everything like it'll step by step walk through it yeah and so this and then
Starting point is 00:13:36 in the previous example a previous article they'd written they do mention here like if i wanted to use a escrow account in the blockchain how would that work like how would i have something that i wanted jason to know that what i had was legitimate him put money in the escrow send it to him and and basically have the blockchain itself kind of do the brokering so we don't have to rely on a third party and And that was really interesting as well. Cool. Very cool. Yeah, my news is OpenStreetMap City Blocks. This is pretty cool. The idea is OpenStreetMap.
Starting point is 00:14:14 We may have talked about it before. I've actually used it for several projects. OpenStreetMap is, as it suggests, it is a open source map of all the streets. And it also does coastal and forest lines and all that stuff. So imagine, you know, you have some cool project where you want to count the number of streets in your neighborhood or in your zip code or something like that. You can download the OpenStreetMap data set for your zip code.
Starting point is 00:14:43 And they even have a little map view where you could draw a little rectangle and crop out your neighborhood. And you will end up with this XML file and you can parse that file. They have a bunch of tools and you could count, like how many streets are my neighborhood? How many traffic lights are my neighborhood? You can even plot them like on the screen
Starting point is 00:15:02 and stuff like that. So this guy used Python GeoJSON, which is, I actually haven't heard of GeoJSON, but I think it's a JSON format with like a spec for GeoData. And he actually walks through with Python code how to determine the city blocks in an area. And it's just one of these kind of cool step-by-step with good visualizations and tools and source code on how to do any type of like mapping data. So if you're into that, if that project sounds cool, check out what this guy has. It's's pretty interesting working with geospatial data is always kind of cool because it kind of is inherently visual or i feel like it is and so you always get to make a cool map or something at the end yep yeah totally uh also we keep calling these news but i think all of ours are links this week so or this month um yeah my next link is actually
Starting point is 00:16:04 been around for a while apparently but i had never seen it before and i'm thoroughly fascinated adding it to the queue of things i need to work on uh or want to play with in spare time uh which i did i will say i have open saw my computer the puzzle competition that we talked about last month and i didn't go to any of them yet uh i meant to do it before the show and i just didn't uh so i still owe everyone my attempt at trying to do those puzzles all right cool but these puzzles are called the crypto pals crypto challenge uh and this is a group of puzzles around breaking cryptography so everyone has kind of heard that for instance the sha1 hash is broken and you shouldn't use it like it's not secure um but i i would venture to say most of us don't know what
Starting point is 00:16:52 that really means yeah i have no idea what that means we also know like you w oh man i always forget it because now we're all wpa but what was it before wpa and wi-fi wep wep is is encryption is broken um and so these set of crypto challenges uh start out with some programming tasks to learn and then they build up on that by having you break encryption using the tools you've built or at least that's what i gather from kind of skimming it and the kinds of attacks you're doing and analysis you're doing uh increase in complexity as you're working through them um up to some kind of like real grade it seems like an actual encryption so these aren't just toy you know uh cipher schemes these are like real things and explanations about them so i feel like
Starting point is 00:17:44 working through it you would learn really well about both the attacks and maybe even being able to do your own attacks and also about the things themselves, because what better way to learn about an encryption scheme than by figuring out how to break it. So when they say broken, do they mean that they found some assumptions you can make where you don't have to brute force as much? Or do they mean that computers are just fast enough that they're not you know the keys aren't long enough or something i think it means both ways so in the first case people might say that like 32-bit which no one uses it but like 32-bit rsa encryption is no longer secure and what they mean by that is that with enough compute power factoring a 32-bit encryption key is so fast that your message won't stay secure for very long or long enough in other words
Starting point is 00:18:34 instead of lasting a hundred years it would only be secure for maybe well in 32-bit really short but that's kind of what they mean but there's also as you pointed out you could detect a flaw so in that way it's not kind of broken it's just insecure but you could also find a flaw which means that even though you think you have x bits of safety you don't because there's some flaw that allows someone to exploit something and find a way around that uh where instead of factoring so for instance in the rsa algorithm if someone figured out oh you have a 256 bit number but i don't actually need to factor the 256 bit number i only have to factor this 8-bit number instead like then that would be kind of broken got it yeah that makes sense at. At least that's my understanding. And that's kind of both of those ways.
Starting point is 00:19:28 Cool. Yeah, I'll have to check this out. I love things like this. Cool. Yeah, mine is pretty simple. So there's obviously a lot of programming languages and new languages coming out all the time. As languages come out, I think we did this with Rust a long time ago, as languages first come out, we usually cover them in something like this because we don't know really what the impact is going to be. And this is one of these languages, it's called Simit.
Starting point is 00:19:55 It's a language for physical simulation. This is pretty cool. The idea is, it's all on one machine, but it handles these... So what's a good example of this? Let's take, you know, it's all on one machine, but it handles these. So like, what's a good example of this? Like, let's take, for example, some type of water dynamics. Like you have droplets of water that are, you know, bouncing around in some physical engine, right? Well, you could start coding this yourself. You could say, oh, I'll have, you know have a water class and the water class will have a velocity. And then all of a sudden you find, oh, I have 300,000 classes and my program is just breaking apart because of handling all these classes and virtual functions 300,000 times
Starting point is 00:20:39 and you're updating it 60 times a second. That's where sort of a lot of the programming principles kind of fall apart when they're at that huge scale running that quickly. And so that's where you really need somebody who can both be good on the architecture side and also give you incredible performance. And so that's where a lot of these kind of niche languages, that's where they're important. So Simit runs on CPU or GPU. And the idea is you, you know, kind of explain the dynamics for a particular particle and you describe how it, and I'm saying particle, but it could be any,
Starting point is 00:21:18 you know, units, right? And then you describe how that interacts with with other ones you sort of try to describe the envelope of these interactions and then it breaks apart your uh simulation and tries to run it um you know on as many cores as possible on a gpu or cpu and as long as you can model your program like as long as simit um you know like what simit simit can't you know it won't let you do like for loop and do individual things on each element and things like that right it's kind of restricted but as long as you can uh describe the dynamics of what you're building in simit it's insanely fast it like completely blows away um you know matlab and r and and all these other tools time for book of the show book of the show my book of the show is a set of graphic novels which
Starting point is 00:22:16 i found is is the adult is the like the old man way to say comic books but uh mine's called the manhattan projects it's pretty cool uh the idea is um so if everyone doesn't know the manhattan project was the code name for the atomic bomb um so the manhattan projects is a set of comic books where you find out that the atomic bomb was just one of many Manhattan projects. And I'm not going to spoil too much, but, but you know, a portal to, you know, an alien planet is another one. And there's a variety of these projects and it kind of takes the whole like military secret and scientists who, you know, are researchers, but also military and dealing with, you know, policy who you know are researchers but also military and dealing with you know
Starting point is 00:23:06 policy you know how you're building something so cool but at the same time it's so dangerous it kind of tackles that it has some amazing quotes in it like really kind of funny and prophetic quotes um i read the whole thing i think it's like 30 or 40 volumes it's uh uh it's really really good highly recommended the art style is kind of interesting it's like 30 or 40 volumes it's uh uh it's really really good highly recommended the art style is kind of interesting it's like pretty gritty yeah definitely i mean it's it's that's actually one thing that we should probably mention is you know it's it's pretty kind of violent um if you have like kids in like middle school high school like maybe this wouldn't be the right thing well high school probably fine but like maybe this wouldn't be the right thing. Well, high school, probably fine. But like, yeah, it wouldn't be something you would give to like your five-year-old son
Starting point is 00:23:46 or anything. Oh. Or daughter. But it's definitely for adults only. You know, it's got a lot of language, got a lot of violence, but it's really cool. It's a fantastic read. Yeah, I've never thought about that. I guess for our book recommendations, at least the fiction ones, we probably should have
Starting point is 00:24:04 some sort of like rating uh kid approved yeah well most of them have most of them have a rating i'm trying to think back to all of my sci-fi recommendations and seeing if i've said anything okay it's all right i can't i can't worry about it no i mean amazon i'll tell you right away the rating. So my book of the week is Theft of Swords. This is interesting because I guess it's a volume. I don't know how they were originally released. I haven't looked into it because I try not to read too much about a book
Starting point is 00:24:35 until I've read it yet. But there's been two books within the book and each one has been pretty self-contained. So I guess it's a volume and there were kind of two stories in it um and they were related but you could have kind of stopped halfway through the book and like been happy it was a story um and i'm not quite done with it yet i'm probably like five percent away from the end of this book but i still am going to recommend it unless like i come back
Starting point is 00:25:02 next week or next month and i'm just the ending was atrocious and totally not worth it. Other than that, I feel like it was a really good book. Um, cool. So this is a fantasy book about, Oh, I always struggled to not give away too much about two thieves who are kind
Starting point is 00:25:16 of working together and get caught up into, uh, something that someone is trying to use them, but it turns out not to happen as they wanted. And they get kind of tied up into some political goings on and kind of their something that someone is trying to use them, but it turns out not to happen as they wanted. And they get kind of tied up into some political goings on and kind of their story about being thieves in this world of political intrigue. It's kind of interesting. Cool.
Starting point is 00:25:35 Sounds awesome. So it's a lighthearted book and I've been listening to it on audible and audible is a sponsor of our show. So if you're interested in getting audible subscription which i use to pass my drive to work i've been a paying subscriber for i don't even know how long now several years um but if you go to audible trial.com programming throwdown you can get a one month free which is essentially one book um and then after that you know the various plans and if you don't like it you can just cancel it and keep your book.
Starting point is 00:26:06 And that helps out the show. And I actually do recommend Audible. If you have a commute, people are always looking for ways to pass it. And I pass the time in the commute. And I definitely recommend Audible. I like it. Yeah, totally. I mean, the thing about these things, like Audible doesn't like have any deal with us.
Starting point is 00:26:24 Like we have, if you go to audible trial.com slash porting throw down you will help out the show like we have that but you know we're not like shills who uh you know we only endorse things books and and and services that we actually use so um so yeah audible is cool check it out so we also have Patreon I promised everyone I would get t-shirts made so that we could so you could order t-shirts I'm totally lacking on that
Starting point is 00:26:54 it's not that I'm not working on it it's that I'm not happy with the logo and I'm kind of a perfectionist but once I have a good logo we'll get some t-shirts are we going to get models for them? Because I feel like you and I modeling them will not sell them. Oh, good point.
Starting point is 00:27:11 Yeah, we need models. All right. Do you think we could get Michael Belps to wear one of our T-shirts? That would be pretty awesome. The only thing is he's got those brown spots from those suction cup things. So, yeah. So we have Patreon. You go to patreon.com slash programming throwdown.
Starting point is 00:27:31 And that is just like a donation. It keeps the show going. I was able to buy a microphone, so I didn't have a microphone that I got in a thrift shop as someone accurately called me out on on Reddit. So it helps us you get the equipment we need to keep the show going and all the services and all that so we appreciate that long tool this show my tool this show is um so originally or not originally but a few months
Starting point is 00:27:57 ago i i recommended by word and i do like by word but um it has this problem where it's OS X and iOS only which is great until you have a machine like my home desktop is still running Windows and so Google Docs is pretty close to ByWord but my big criticism of Google Docs was that it's not Markdown and so if I want to take something from Google Docs and post it on the web or do something else with it, it's not in a format that's very accessible. Well, someone wrote GDocs2MD, which is a – and I haven't tried it yet. I'm going to try it later on today.
Starting point is 00:28:40 But it will take any Google Doc. It's a little Chrome extension. It will take any Google Doc and convert it to Markdown and so I've always been a big fan of Google Docs other than this one issue and this seems to you know a lot of people are saying that this plugin works really well so yeah so definitely you know
Starting point is 00:29:00 if you have any notes if you're trying to write like a book or anything like that or if you're taking notes, if you're trying to write a book or anything like that, or if you're taking notes in class or something, Google Docs is the way to go. It's truly universal. My tool of the show is not a game this time, which it normally is, but I guess you say it's Scrapey, which is like scrap with a Y on the end. And this is the PY at the end for Python.
Starting point is 00:29:27 And this is a web crawler or web scraper for Python, a library to help you write Python programs that crawl the web. I'd seen this before and always said, Oh, I wanted to, you know, I want to learn that at some point again, in my giant heap of things, it switched from a queue to a heap, but by heap of things of, you know, to research, but I ended up with the use case for this.
Starting point is 00:29:50 And I will caution that you should probably read the like user license agreement, you know, respect whatever websites say for proper usage. And please don't take anything I say as like legal advice or not. That was my disclaimer. But I had a use, I wanted to find out what realtor was selling the most property in my area. And, you know, you can go on the website, you can look in your zip code, you can see what realtors are selling houses, selling houses at least in the
Starting point is 00:30:26 united states is something that needs to be disclosed for various reasons um and so you can see when houses are sold that that's kind of public knowledge and actually how much they're sold for as well it typically also lists the buyer and seller as people as well as the buyer and sellers agents that represented them um and i wanted to see who was selling a lot of houses in the area because you know if i you know i wanted to sell my house i was just curious who would be someone good to contact yeah totally it makes sense but this isn't information that they for various reasons i could imagine they don't want to share or they don't have available um yeah they're afraid of like a winner take all thing, right?
Starting point is 00:31:06 Right. Yeah. And so it's available. You could do it by hand. You could kind of enter your zip code or, you know, search by area and kind of look at every one of the houses and kind of keep notes of who was selling houses and how much for it. And you could kind of build this data. But that's what a web crawler does. So a web crawler, you give it a web page and you tell it kind of what data to look for on that web page,
Starting point is 00:31:31 one of which would be a bunch of links like to houses that had sold in the area. And then it'll go crawl those pages, pull them down as well, and present to your program the code for that website. And then you can collect whatever information you want out of it. And so Scrapey is actually a very powerful tool, and I only needed the basic tutorial
Starting point is 00:31:53 to even get this working, so I'm sure I didn't even scratch the surface, that supports all of these activities. So it knows how to go to a website. It knows how to get the data, pull it into a way it uses XPath or CSS filters for you to be able to kind of look for specific tags in the website, to find a URL or to find parts of the DOM. And then you can parse those do something with it until Oh, hey, go scrape these other pages. And your code kind of gets a very clean slice of the data. And then it has a way to kind of output all that data to, for instance, JSON. And so what I did was, or a
Starting point is 00:32:31 friend did, I guess, I saw a friend who did this, put it all to a JSON file. And then once you had it in a JSON file, you had all the data you needed. You just needed to do it once. And then I could kind of slice and dice the data however my friend wanted. It could be sliced and diced and you could sort, you could see who sold total dollar value the most houses, what was the median price of houses they sold, all this kind of information. It was really powerful.
Starting point is 00:32:58 And it was actually really, really quick to get there. It didn't take very long for my friend to code this up because this tool is really good. So Scrapey, if you ever need to do something like that, very powerful. It has all sorts of advanced functionality in there about like creating a pipeline of data to go through and how to create objects that get improved as you're moving down the pipeline. And it does, for instance, often the same link may appear more than once in the web page. And it knows how to not kind of scrape that web page twice. So it'll do de-duping.
Starting point is 00:33:29 Oh, yeah. No clobber. Yeah. So this is just, I think you just gave the best endorsement ever for anyone who is in high school or college and trying to pick their major like you know i mean if you're in cs like i mean obviously all the majors are great but if you're in cs i mean look how cool your life is like you just you just like automatically say oh i think i'm just gonna write a program to scrape the web and find like the most competent realtor and like you just have like incredible authority and power at your fingertips it's just amazing and the thing that's interesting as well is like once i wrote it it was very easy to say And like you just have like incredible authority and power at your fingertips. It's just amazing.
Starting point is 00:34:09 And the thing that's interesting as well is like once I wrote it, it was very easy to say, oh, what about for neighboring zip codes? What about for another state? What about right? And then you can start to say, oh, interesting. I wonder for if I had a list of zip codes for which zip codes are the highest houses selling or, you know, whatever. Yeah. I mean, if you wanted to buy a stable, if you want to buy in like a stable neighborhood you could just do it like you could you could literally like see the trends of each neighborhood and then you could start thinking i mean and i didn't get this far but
Starting point is 00:34:33 you start thinking and someone was like oh this is a great idea you should you know publish that data somewhere and i think that would definitely for my friend to do that would be definitely against whatever terms oh yeah you would for the underlying data so do not do that would be definitely against whatever terms of service for the underlying data. So do not do that. For personal use, you know, you could probably have just done it by hand, like I said. So I'm not a lawyer, but it doesn't feel that much different. It's very hard to prove. But I'm not a lawyer. Like your friend did nothing wrong. But yeah, if you publish it, and especially if you try to make money off of it, that's where it becomes problematic. you could imagine i think for instance in this one it was the last five or six months of data that was available but you could imagine running this
Starting point is 00:35:12 every month for the last month and building up trends over time that would prove even more useful to you as the amount of data grew like what month is the best to sell in and there's general trends like that but you want it for a very specific area, a very specific tailored need. And so to Jason's point, this is cool being in CS, but the Scrapey seems really powerful. I think I've probably come up with more uses in the future
Starting point is 00:35:35 and just wanted to give a shout out to that project. Cool. Yeah, I will definitely. I usually use Wget with dash R, but Scrapey sounds way better so i will switch yeah so so i mean that's an interesting point you could write all of that what i just said by hand it wouldn't be that hard and you could cobble together a few libraries to even help you but the awesome thing was twofold one is that they already did all of that for me so i only had to like it the kind of
Starting point is 00:36:01 the plumbing was all there so i banged this out in a total of maybe and that was because i'd never used it before and i'm not a real big python programmer but i probably banged it out in a total of three hours from like having the idea to using pandas to actually like slice the data in the way i wanted um and so that was really powerful to get that and they have tutorials you know kind of devoted like if you're doing w get on your own you're kind of on your own a little but with scrapie i mean this is a community of people devoted to doing these kinds of things and so it's really easy to find answers very cool very cool all right so our show uh topic is optimization and uh there's obviously like many components to that.
Starting point is 00:36:45 So we try to start with, you know, close to the metal as possible and then move all the way up the stack. So I guess like the lowest level would be, you know, just the, you know, the instruction set and the architecture you're working with. Yeah, so I think, you know,
Starting point is 00:37:03 I'm more comfortable low level. So I think I'll probably talk a lot here in the beginning and then we'll probably shift over to Jason as we move up the stack here. All right. But exactly right. So the instructions, so you kind of framed it a little, but optimization is really broad topic. We're going to try to kind of give some buzzwords, keywords for you to look at and some things to think about. If you ever encounter needing to make your program run faster or your software stack to run faster, what are the kinds of things you should be thinking about?
Starting point is 00:37:33 And at every level, there's some amount of work that you can be done. And different levels are appropriate for different kinds of projects, different parts of the life cycle of a project. And so we're gonna kind of cruise through this on a pretty high speed. But the first thing is the processor you're using,
Starting point is 00:37:51 the architecture, whether it's a microprocessor or an x86 processor, an ARM processor, the instructions available to you are going to be different. And if you have the ability to choose what kind of architecture you're running on, you may be able to get certain advantages. One the things that's interesting for instance is what is the difference between a desktop intel processor and a server intel processor um if you look at them kind of like at a high level the clock speed might be the same the number of cores might be the same
Starting point is 00:38:20 but the instruction sets available in the two can be vastly different and the amount of simultaneous floating point operations supported right can be very different between a home and a server and explains you know a lot of the price difference between those two and understanding which one you have and which one you're running on will help you optimize your program for the right thing yep yeah when you look at megahertz you know your processor you're only looking at the speed it takes to do the fastest one thing um but you know a processor could have you know half the megahertz but twice the bandwidth and so then they're you know assuming you can parallelize um you know your computation then they're equivalent, even though one looks a lot higher at, you know, superficially. So I saw a great example of this, someone said for the
Starting point is 00:39:10 newest Intel processors, like high end Intel processors were pretty expensive. And so they actually bought Xeon, which is the server class Intel processors and kind of did a comparison. And the Xeon processors crushed the newest Intel processor for the consumer market, kind of in the same price range. But the difference is if you played a video game, the Xeon process were actually pretty terrible because they were all about doing lots of little things at once, which you would do in a server,
Starting point is 00:39:38 serving up many, many page requests. But on a game, you're kind of only wanting to do one or two things as fast as possible yeah and so that was an interesting trade-off other thing is we've talked about several times so we'll skim over it but gpu so get into heterogeneous processing and whether you can use a gpu to offload uh and optimize how your process your your program's runtime by running it there and then you know once you have a cpu selected or a gpu architecture you can start to talk about the compiler which is producing code for a specific architecture and there's all sorts of optimization settings in your compiler
Starting point is 00:40:19 so you can optimize for size versus speed and one one of the interesting things, and we'll get to this in a second, is that you might say, well, why wouldn't I just do size? Because most of the time we're not concerned about, or why wouldn't we just do speed? Most of the time, we're just not really concerned about the size of our program. But if you look at how a processor works, it has to keep parts of your program in memory. And so if you have a certain portion of your code, which is used over and over again, if the compiler tries to optimize it to run as fast as possible, through those instructions once, it might have a code size, which is 25%
Starting point is 00:40:57 bigger than if it optimizes for size. And if the size gets you under the limit of one of your caches and you can keep your program loaded in cache instead of in main memory, then you actually might be faster by having it be optimized by speed, which is by size than by speed, which is really kind of confusing. Wow, I had no idea that that could happen. Yeah, so we've had this happen a couple times and it's really kind of hard to predict and it depends on exactly what processor you're running on.
Starting point is 00:41:27 But you kind of just got to try both. But Optimize the compiler will try to do a variety of tricks for you about keeping your variables in registers instead of in memory. It can do things that are complicated like loop unrolling. So if you think about you have a for loop and you're only going to go through it once or twice you know some constant that's defined on it once but like two times like you need to go through this loop two times or three times if you kind of write it in assembly as a loop then you have this branch so at the for loop you have to first
Starting point is 00:42:03 evaluate some condition if the condition is met you have to first evaluate some condition. If the condition is met, you need to execute the instructions below. Otherwise, you need to branch past the end of the for loop into the other code. But doing that comparison and branching are two things that cost the processor time. The branching because of pipelining, which we won't get into now, and the comparison because that's an instruction it has to perform. If you are always going to execute that loop three times, instead of having to evaluate, you know, the counter i is less than three or not, you could just run the same code inside the loop three consecutive times with an increasing counter value in between. And then you save the overhead of running the for loop processing part. And so that's something that compiler can do to try to
Starting point is 00:42:51 optimize your code. And sometimes the compiler doesn't know that can be done. And you need to do it by hand. And you actually need to like, that's pretty extreme. I wouldn't recommend kind of starting off there as an optimization setting. But sometimes you may gain benefit by hand unrolling your loops. Yeah, I saw some article a while back that said, I mean, I feel like this is a very contrived example, but they said that in some cases, Java was faster than C++ because they have this just in time compiler. And so at runtime, it could know that this for loop is only going to be executed 50 times even though the 50 was a variable um you know by the time it got to the for loop it knew that it was only going to be 50
Starting point is 00:43:33 guaranteed and so it unrolled the loop which is something c++ couldn't do um i don't know how much truth there is to that today this is also an old article but that that does happen and actually people don't, I've met people who, who kind of didn't know this and it was shocking to them. The processor itself also will learn in what's called branch prediction. Yeah. So if you have an if statement,
Starting point is 00:43:55 for instance, if the code has thrown an exception, you know, do error handling. Otherwise, you know, kind of continue on the compile, the processor itself, as it, do error handling, otherwise, you know, kind of continue on. The compiler, the processor itself, as it's running your code, will learn that that branch is never taken. So it always could be taken, but it will optimistically assume that it won't be
Starting point is 00:44:18 taken. And then it'll start loading the pipeline with the code afterwards, instead of the code inside the exception. And so as you go through that loop more and more times, you'll get better speed. And then if you ever do throw an exception, it'll be slightly slower because the code, it wasn't anticipating that was going to happen. Yeah. And ironically, the way the processor does that is by estimating the binomial distribution, which is the first thing we talked about in the episode. So it goes backwards. And it says, based on how many times you did or didn't take this branch, I'm going to sort of reverse engineer the confidence bounds. And then when
Starting point is 00:44:59 it gets to a certain amount, then it starts doing, it swaps over the prediction. And this also has to do with what we were talking about before keeping your code kind of as compact as possible, especially when there's many, many iterations you need to do. So for instance, think about if you needed to do image processing and do some operation for every pixel in the image. If you keep that code very small, then the compiler has a better chance of learning about those branches or not the compiler but the processor has a better chance of learning about those branches yeah and so then i mentioned before cache levels so people have heard things when processors are spec'd out like l1 l2 l3 cache sizes and then you have ram size and then you have RAM size, and then you have disk.
Starting point is 00:45:53 So because of the speed of light, the amount of distance you need to travel actually becomes fairly significant if you have to go all the way out from your processor to main memory. So no matter how fast your RAM ever gets, there will still be a limit to how fast data can come into the processor just because of how far it has to go travel through the wires does the wire is it the speed of light or is it the speed well the speed of light sets a theoretical bound oh okay got it right so even if you use fiber optics or whatever if you're like a meter away from the processor which is an extreme then you know kind of there's a set latency from when you decide you need something to the fastest you could get it from something
Starting point is 00:46:29 that far away. And when you're talking gigahertz, that actually matters, which is crazy when I first learned that. But leaving that aside, that's not the main problem. The biggest issue is it is very expensive to have very, very fast memory and to put memory on the processor itself so the closer you are to kind of the main bits of the processor that are executing your code the smaller you are because there's a lot of things that need to be connected on the physical processor and there just isn't a lot of space so then you have to go through extra sets of interfaces and out to kind of other parts of the chip to get to bigger levels of cache and so the processor always wants to look in the closest cache because that cache is fastest if it's not
Starting point is 00:47:18 there you have what's called a cache miss and then you got to go out to the next cache miss go out to the next cache miss go out to main memory if it's not in main memory it has to go out to the next. Cache, MISC, go out to the next. Cache, MISC, go out to main memory. If it's not in main memory, it has to go all the way out to hard disk, which is terrible. It means it's going to take forever in processor. Yeah, I think each way it's like an order of magnitude, right? Or orders of magnitude, yeah. Oh, wow. So it's really bad, right? And so when we talk about optimizing, there's a lot of things you can do once you kind of understand that that's how the processor works.
Starting point is 00:47:43 To keep the working set is one way to describe it the amount of data that you're operating on if you need to do lots of operations doing it on a small set of data before moving to the next set of data is going to be faster than if you so if you think about like i have a sequence of operations i need to do on a on an image like first i want to you know gaussian blur all the pixels then i want to do an edge detector then i want to do and you kind of stack up like five things uh and if they were not dependent on each other hypothetically for now if you took a single pixel through all of those steps that is in some ways going to be faster than doing uh all of the pixels through the first step then all of the pixels through the second step then all of the pixels through the first step, then all of the pixels
Starting point is 00:48:26 through the second step, then all of the pixels through the third step. Because the size of your image is going to be really, really big. And if you start at the first pixel and you go through every single pixel, by the time you get to the last pixel, the first pixel isn't in memory anymore, or it's not in the closest cache anymore and so right a lot of optimization for memory access is built around that and so if you look at kind of how a matrix multiply is done if you get to really big matrices the naive way they kind of teach you to multiply matrices by hand doesn't work that well in a computer setting mostly because of how memory access patterns work memory comes that is clumped together you kind of get close by locations for cheap and then you operate in
Starting point is 00:49:11 a smaller working set and so bastar matrix multiply operations take advantage of that yeah that makes sense yeah a lot of this stuff i mean if you're doing, the hope is that, you know, 1% of your code is 99% of the CPU usage. And so, you know, you don't have to super optimize everything. But, you know, something like, for example, you said image processing, the part that manipulates the images at a low level, like has to be extremely fast. That's right. And so there's a couple of things you point out there.
Starting point is 00:49:46 First of all, is making sure you identify the code that it's worth spending this time to optimize. Yeah, we should do a pitch for earlier. We had an episode on profiling, you know, finding out what parts of your code are the fastest. And you should definitely check that out if you haven't already. So that's a great first thing you should always do.
Starting point is 00:50:03 Because once you start getting into these optimization optimization techniques that i've talked about your code does become harder to read often uh because it's doing something slightly non-obvious and uh so you don't want to go there first unless you have to the second thing is using tools other people have written so you were talking about linear algebra libraries i think last episode or two episodes ago yeah right and so using a linear algebra library if you're going to do matrix multiply what you're hopefully getting is someone else who spent a ton of time figuring out the absolute fastest way to multiply two large matrices because if you do it naively you're probably not going to be as fast as them. Yeah, it'll make a huge difference.
Starting point is 00:50:46 And so identifying blocks of code that are common and trying to find other people who have good implementations of those. Yeah, totally. So let's say now you have a lot of data and it's not fitting in memory or it's just super slow to access and you know you can't it's not a matter of you know speeding up the the you know unrolling loops or things like that the reality is like you just have a ton of data and you need to pull out like a needle in a haystack so this is called you know an index or a reverse index so you know for example a database if you use sqlite or use my sql or one of these databases and you can store you know a billion entries and then you say you
Starting point is 00:51:33 know i need you know patrick wheeler and it just gives it to you instantly right but you know if you wrote a for loop it would take forever so what are they doing? Under the hood, anything that returns results in sublinear time, which means faster than if you wrote a for loop and had everything in memory, it's using one of two things. No matter what database you use, it's always boiling down one of two things. And that's either a tree or a hash. So what that means is a tree just means you're dividing the data based on some criteria and the criteria is set up in a way where if you're on sort of one side of the criteria, then you know your answer isn't on the other side. So for example, you could make a tree just based on the letters of someone's name. So the tree could say, is the first letter of someone's name
Starting point is 00:52:31 less than or equal to M? And if it is, you go one direction. If it's not, you go a different direction. So in this case, if I'm looking for, you know, Patrick, I say, okay, the first letter is a P. So I know that anything that's on the left side of the tree can't be Patrick. So I just focus on the right side of the tree. And this is just assuming a binary tree. And so in this way, you can actually search, you've effectively searched all of the entries, but you've just, by making one decision, you've eliminated half of the entries. You just know that, you know, by making one decision, you've eliminated half of the entries. You just know that it's not any of those. And so there's B trees,
Starting point is 00:53:11 there's binary trees. Those are different. There's red, there's red, black trees. Yeah. There's a bunch of different trees. There's R trees, R star trees. There's vantage point trees. There's a ton of different trees. And they're all designed to eliminate options as quickly as possible, which should mean instead of searching a million entries, you're searching five, right? Five plus making another six decisions. So you've gone from a million things you have to do to 11. And so this is how everything works when it comes to these kinds of searches.
Starting point is 00:53:53 Like if you're on Google Maps and you put in an address, you get it back right away. It's because there's some kind of tree underneath. Another way to do it is hashing, which is really just kind of a form of clustering. So for example, I might say, okay, let's take everyone whose name starts with a P and let's put them in one file or in one spot in memory. And everyone whose name starts with a J, I put in a different spot of memory. And so i get you know someone's searching for patrick
Starting point is 00:54:26 i know to look at the p file and if someone's searching for jason i look at the j file so it's not a tree it's really just these buckets but it does kind of the same thing and so um you know there's again there's a billion different ways to do hashing and clustering um there's actually some really interesting work with like neural networks and embeddings and clusterings. It's like relatively, you know, recent and like, it's like they're inventing new ways to do this all the time. But at the end of the day, it breaks down to one of those two things. So if you've ever, you know, tried to make your own Minecraft or done anything like that, where you realize like, how am I storing all this in memory? It's impossible.
Starting point is 00:55:07 And how did they do it? Well, they're almost certainly using either a tree or a hash. Yeah, and I think in the context of optimization, what you should be asking yourself is, if your code is running slow because you're trying to find something in a whole bunch of data, can you apply a tree or use hashing to eliminate
Starting point is 00:55:27 the bulk of the data you need to search through more quickly yeah exactly um and and that gets into then doing you know kind of deferred computation and things like that like in the case of minecraft um if you're not looking at something, then they can not process those chunks until someone's looking at it. And so there's a whole bunch of sort of dead reckoning type processes that go on. Well, for other buzzwords, you have like binary space partitioning
Starting point is 00:55:58 or things like that, which are also kind of a form of tree where you're dividing up a world into parts so that you can say if someone's looking forward how do i eliminate everything that's behind them because i know they can't be looking at it right now yep yeah exactly yeah binary space partitioning is literally a binary tree um where every decision is a hyperplane and so if you have a certain area you say does that area cross the hyperplane if it does then I have to look at both directions but if it
Starting point is 00:56:32 doesn't cross the hyperplane if the area you're looking at like your frustrum is only on one side of the hyperplane then you can skip everything on the other side so yeah there's indexing. Another thing is, you know, going beyond just running on one machine, there's distributed computing, which actually there is some one machine distributed computing, like multi-threading, multi-processing, thread pools, things like that. So for example, let's say you have a huge list of items and you want to... a huge list of strings and you want to capitalize all of them. I mean this is very kind of a trivial example, but you could write a for loop and go from 1 to 2 billion, you know, s equals s dot uppercase, right? But that's pretty slow. You could also say, let me, you know,
Starting point is 00:57:29 take my data and divide it into two batches, you know, from zero to half a billion, and from half a billion to a billion, and have two threads working on them. That's okay, but that has another issue, which is if for some crazy reason the first 500 million strings are very tiny and the next 500 million are huge, then the first thread will finish much quicker and you'll just be waiting on the second thread forever. So you might say, okay, I'm going to make a separate thread for every entry. I'm going to make a billion threads. And you run that command and you have to restart your computer because it's just caught on fire, right? So thread pools are a way to handle that.
Starting point is 00:58:13 So you can give a thread pool, you know, a billion things to compute. And it will, you know, figure out the optimal number of threads, depending on how sophisticated it is, but some thread pools will figure out the optimal number of threads. Like if you have 16 cores and hyper-threading, they'll say, okay, you have 32 actual threads, and it'll assign all 32 threads to 32 strings, and as they finish, it will assign more and more and more.
Starting point is 00:58:44 So at any given time, you have 32 of these being processed, and you have a queue that could be huge. Now, eventually, when the queue is exhausted, it'll tell you we're all done. So that's a clear way to get some good benefit. I mean, there's obviously more complexity there, so you should only do it when you're working on code that takes a long time to execute. Another part beyond threading, kind of taking it to the next level, would be networking. So with threading, you're still limited by what your one computer can do.
Starting point is 00:59:21 But you might say, oh, I want to use 1 you know, a thousand computers. I want them all to be serving some website. And so once you get into networking, then you get more into, you know, like load balancing, which is, you know, similar to the way Threadpool works. It's just kind of taking it to the networking level. And it requires a little bit more statistics because there's more uncertainty. But, you know, load balancing is effectively saying, okay, I have 100 machines. Let me make sure all the machines are equally loaded. So if one machine is serving one request, but that request is really expensive for some reason, then they'll give all the feature requests among the other machines.
Starting point is 01:00:09 And load bouncing is similar to thread pools, except you're constantly having to sort of pull all the machines and find out if they're healthy or not. There's also other kind of networking tricks for optimization. There's reliable UDP, which means typically when you do things over the Internet, you do things over the internet you do tcp and that means uh the tcp protocol guarantees that i get every packet and i get it in order so if i want to send patrick you know uh email i want him to get every letter in that email and i want him to get them in order like i don't want him to get this weird jumble of letters. And so TCP seems
Starting point is 01:00:46 very reasonable for that. But you may not need the in order part. There's often times where you're just sending a bunch of data to someone. And let's say I'm sending a file to Patrick. He doesn't need to get all of the bytes in order. He just needs to get all the bytes and put them in the right place as he gets them. And so that's an example where I could use reliable UDP and other kind of protocols to get a huge speed up. So I think about networking and the question is going to be when you have computers put together and you start to have an issue and you start looking at what. And when you get to networking level, you really have to be careful about collecting a lot of really good statistics so that you know there is a problem. Because often problems go undetected for a long time.
Starting point is 01:01:39 I'm pretty tongue tied tonight. And so for networking, one of the things like Jason was was saying you have a request comes in and it takes a long time and you need to load balance so what you would be looking for is what parameters should i be watching to see if i need to add a load balancer or change the algorithm i load balancers using or add new computers and so you would want to for instance look at what do the longest one percent of queries take or 99 of queries take longer than this or that? Or, you know, basically, what is the histogram of response times? And you said some, oops, sorry, go ahead.
Starting point is 01:02:15 Oh, I should say, if you ever hear the term P50 and P99, it's what Patrick's referring to. So P50 is what is the mean latency for something and P99 is, you know, 99% of your requests have a latency less than that number. Right. So, so you might have, when someone specifies how good a response time they want, they kind of often mean two things or three things they want, you know, 99% of requests should complete within 10 milliseconds, but 100% of requests or nearly 100% should complete within one second. So I know that occasionally something will happen and I can't complete within 100 milliseconds, 10 milliseconds, whatever I said, but I never want a request to take too, too long.
Starting point is 01:03:00 And so then that helps define how you would do load balancing, what architecture for your network you need. What do you do if, you know, 500 milliseconds has gone by? Should you like abort the request and start it again in your remaining five? Like, how do you deliver on that? You know, what you're trying to guarantee? Yeah, and it gets really interesting once networking is involved there was a case in a place i worked where it actually made sense to send the load balancer to requests every time so in other words like like imagine that like you need a you need some data and you literally ask the same system for the same piece of data twice immediately one one after another, and then you pick the one that returns first. And the idea is, you know, you have so many resources, but you can't
Starting point is 01:03:53 control the latency. And so the only way you can do it is by making multiple calls at exactly the same time with exactly the same data and just picking the one that returns first. But this is an interesting optimization before you were talking about hashing. And so I guess when you go to distribute a computer, the same optimization technique kind of becomes called partitioning, where you say some requests for certain usernames go to this computer, other usernames go to this other computer, right? And you've partitioned the request, but you can, if your partitioning scheme is complex
Starting point is 01:04:32 or for various reasons, you may want the load balancer or the thing sitting out in front to issue requests to some or all of the computers behind it. And the first one who has an answer replies back. And then if the others don't ever end up or don't have the data necessary to respond, they just never respond.
Starting point is 01:04:52 Yeah, I mean, you can do all sorts of interesting things. I mean, the machine behind the load balancer could know that it's taking too long and give up on the assumption that the load balancer will pick someone else. And yeah, it gets pretty kind of math heavy, which again is a good reason to use off-the-shelf components for this. Don't try and write your own load balancer. But I think it's also a really good reason to,
Starting point is 01:05:16 when you start designing a system, understand what your goals are and also to try to make sure to collect really good data so that you can analyze problems when they come up yeah definitely i mean we've talked about this in the past but we can't stress enough how important it is to collect just a ton of data um you don't want to be in a position where you have some problem and you have to start collecting data to fix it. That's bad news. The last piece of this, the top of the stack are just algorithmic improvements. So if you haven't heard of this,
Starting point is 01:05:56 there's this thing called big O notation. It's just a fancy way of saying this is sort of the way that a particular algorithm grows in terms of computational you know computational time so in other words if something is n squared then that means as the number of elements increases the time it takes to get the answer you want goes up exponentially. If something's order n, then that means it scales linearly. So, for example, if you want to loop through a list of strings and count the number of letters in the string, in each string, and you assume that the strings are pretty small you might say that oh you know the length of the string doesn't really matter this is order n in terms of the number of
Starting point is 01:06:52 strings as I add more strings that I have to count the letters of you know I increase linearly if you're using as you said before like trees and hashing and things like that you can actually do sublinear computations. So if you want to find, for example, if you want to find the 10 real estate agents closest to you and you've put all of your real estate agents into an R tree or an R star tree. Now, putting them into the R star tree, that might take N squared, and that might take two days or something, right? Or N log N, I think, is what it takes.
Starting point is 01:07:31 But that might take a long time. But once you have that R star tree built, then you can look up the K nearest neighbors. You can look up the nearest realtors in sublinear time. So even though there's 100 realtors, you only have to look at two to know that those two are the closest. And so, you know, whenever you start dealing with big values for N, you know, when you're looking at, a billion websites or a million strings or what have you, that's where this big O notation becomes really important because it will tell you kind of the best case and the worst case.
Starting point is 01:08:12 I think big O is the worst case. There's also little o, and then there's like sigma o. Little o is the best case, and sigma o is the amortized cost. So, for example, let's say you're building this tree and most of the time it takes log logarithmic time but every 10 000 times you have to do some big rebalancing that's expensive that's where you would say it's sigma o log N, because you might have to pay a big hit and you should kind of design your system around that, but it's rare. So yeah, all of these different notations
Starting point is 01:08:54 and algorithmic improvements, that's sort of a last ditch effort. Like if you have gone through all of these things and you realize you actually need a brand new algorithm and need to do some research, then that's another opportunity. through all of these things and you you realize you actually need a brand new algorithm um and need to do some research then uh that's that's uh that's another opportunity sort of i mean i don't want to be particularly contradictory and what you said at the end if you need to invent some new algorithm yes probably last ditch effort but i think actually reducing algorithmic complexity
Starting point is 01:09:22 is often your first step you should approach, which is... Oh, that's true. If I'm storing all of my... If I'm searching for something and I'm just storing it in an array and binary search doesn't work because my array is not sorted, then you should say, oh, maybe I should sort my array and use binary search. Or maybe I should use a tree. Or maybe a hash map would be appropriate here.
Starting point is 01:09:45 And the reason why I think that is kind of the first thing you should go to is because that code is still going to be pretty readable, and often pretty readable. And you'll end up not wasting your time trying to tweak your compiler only to find out that you need to switch to a different architecture, and you lose all your gains. Those are going to be the most portable changes and can often lead to code that outstrips any other optimization you might be able to do. Yeah, that's actually a really good point. So like a couple of things. One thing is definitely use off the shelf algorithms., don't really ever write your own sort or anything like that. But you're right.
Starting point is 01:10:27 I mean, if you have sort of 10 elements, you might just use, you know, for loop and sort them really quickly, you know, quickly in terms of effort. But then all of a sudden your requirements change and I have 1,000 elements or 10 or a million elements. And then you're right. I mean, you could tweak the compiler all day long, but if your sort is, you know, the bubble sort that you wrote when you first started the project, then nothing is going to help you other than improving your algorithm. And I think one of the things about optimization that's difficult is the,
Starting point is 01:11:03 oh, I don't ever know how comfortable i am feeling about uh referring to something as the art of but the art of optimization is knowing when not to optimize or how much to optimize and it's something that's you can't often write down you just kind of have to use judgment but for instance it may be that you have implemented merge sort because quick sort a good implementation of quick sort isn't available on your system or whatever and you're debating whether or not you should use it you really should look at your use cases and say hey it turns out i'm normally only sorting 100 items though maybe it isn't i don't know what the breakdown is but often merge short actually at the end does, or quick sort at the end does merge sort because it performs slightly better on small data sets.
Starting point is 01:11:52 And so there's like some crossover point, right? this very exotic binary tree, because I think my binary tree might, you know, may not be exactly the best thing, is knowing when to invest that complexity and adding, you know, that to your code and potentially causing unintended consequences by having more complicated code than you did before, even if it does end up faster in some cases. Yep. One last thing about algorithm improvements.
Starting point is 01:12:24 A lot of what you'll do here isn't going to be inventing a new algorithm, maybe unless you're a professor or something, but it's actually going to be finding the right approximation. Like for example, if you have a list of numbers and they're streaming in,
Starting point is 01:12:39 maybe it's, you know, you're building your own load balancer and you have a list of latencies that are coming through, you could store all of them and now you can find the average latency and it's the P99 latency and P50, all of that, as we talked about. But that's a lot of numbers. I mean, you're getting maybe 100 of these a second. You don't want to keep this enormous list of latencies, right?
Starting point is 01:13:03 So what you can do is is approximate you could you know maybe store the first 10 000 and then store some hyper parameters about you know all the ones you've seen up until now or you could keep some kind of like distribution estimator common filter or something like that where you're not storing you're only storing two numbers, but you're just keeping those two numbers updated in some recursive way. And so especially as your application grows in size and popularity and what have you, a big part of it is figuring out where you can make approximations and what the trade-offs are.
Starting point is 01:13:45 Well, I think that's a wrap. Yeah. So I promise I will get t-shirts, or I will at least have sent myself a t-shirt that I'm happy with. I'm going to do that before I subject anyone to buying a t-shirt that may not actually look good when you get it. But I promise I will have a t-shirt prototype by next show. And I think we're actually going to have an interview next show.
Starting point is 01:14:10 It's not a guarantee, but it seems to be coming around that way. It's actually going to be really cool. I'm excited about this guest that we're going to have on the show. And yeah, if you have any questions, if you've had any like wicked, you know, optimization stories where like you fit, you know, the whole world in a cheerio or something, that'd be awesome. Post on Facebook, post on reply to us on Twitter. And yeah, have fun, you guys. Yeah, we need we get a fair amount of user feedback and email.
Starting point is 01:14:40 So thank you guys and gals for doing that. We should have more featured like emails that get written to us we've had a lot of really good ones yeah maybe we'll do another mailbag or something or even just like people yeah like write like we should give something for people to write in about and then we should like read the best ones like in this case tell us your best
Starting point is 01:14:58 optimization story and then like we'll read it next time yeah that's true that's true if anyone has a cool optimization story we'll definitely we'll bring it up as our intro next time all right well thank y'all cool see you later the intro music is axo by biner pilot programming throwdown is distributed under a creative commons attribution share alike 2.0 license you're free to share copy distribute transmit the work to remix adapt the work but you must provide attribution to Patrick and I and share alike in kind

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.