Programming Throwdown - 169: HyperLogLog

Episode Date: November 27, 2023

Intro topic: Testing your car batteryNews/Links:Tech Layoffs still going onhttps://www.sfchronicle.com/tech/article/google-layoffs-california-companies-18465600.php Real-time dreamy Cloudsca...pes with Volumetric Raymarchinghttps://blog.maximeheckel.com/posts/real-time-cloudscapes-with-volumetric-raymarching/Robot Rascalshttps://en.wikipedia.org/wiki/Robot_Rascals Meta Quest 3 https://www.theverge.com/23906313/meta-quest-3-review-vr-mixed-reality-headsetBook of the ShowPatrick:HyperLogLog Paperhttps://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf Jason: Eureka! NVIDIA Research Breakthrough Puts New Spin on Robot Learning https://blogs.nvidia.com/blog/2023/10/20/eureka-robotics-research/ Patreon Plug https://www.patreon.com/programmingthrowdown?ty=hTool of the ShowPatrick: Techtonica: https://store.steampowered.com/app/1457320/Techtonica/ Jason:ESP32 development board: https://amzn.to/3Qpmb20 WEMOSTopic: HyperLogLogMotivationCardinality CountingLinearCountingHash + expectation of collision based on how fullBloom FilterLogLogUse first N bits as bucketUse max sequential 0s in each bucketAverageHyperLogLogHandle empty bucketsUse correction factor like linear counting for low counts (number of empty buckets) and high countsDistributingTransfer bucket counts ★ Support this podcast on Patreon ★

Transcript
Discussion (0)
Starting point is 00:00:00 programming throwdown episode 169 hyper log log take it away, Jason. Hey, everybody. So, Patrick, are all your cars electric? Any of your cars electric? Do you have an ICE car? Actually, I have one normal car and one plug-in hybrid we discussed. So it does have a battery, but it also has a regular 12-volt car battery. Oh, yeah, that's what I was going to ask you. So even the hybrids have regular car batteries uh the hybrids for sure i think even electric cars still often have like 12 volt sort of normal car batteries for a variety of purposes i don't know okay it does i mean it's probably better than trying to like convert you know from whatever voltage the the engine battery uses it looks
Starting point is 00:01:04 like here they do just a quick google says that yeah they have the normal lead acid 12 volt batteries so you know how do you know when your car battery is dead do you just drive until the battery stops working or what do you do i mean in an overall thing i guess yeah the car can't start so i don't know how that works in electric your accessory systems like when you sit in the car before you start it, it probably doesn't turn on. Yeah. So like in my experience until recently, I have never tested my car battery. And so basically what happened is I will just drive until the car won't start.
Starting point is 00:01:38 And I will have to go and desperately figure out how to replace the car battery. It ends up being this big thing. And it's just gotten worse as like we've gotten busier over the years. So it culminated when we were about to all go camping and the car didn't start. And we've had the same battery for a while. But part of it is, you know, you never totally know without a tester if it's if it's the battery or if you left the lights on or if there was some kind of glitch that caused the energy to drain like uh um we had the key in the car uh like not literally in the ignition but we had the key next to the car and i feel like just having the key so close to the car it somehow got confused and it thought we were in there or something um
Starting point is 00:02:27 because uh you know i'm actually so i guess to answer i'm i'm ordering a car battery tester which is only 20 bucks i thought it was going to be like hundreds of dollars because you know when they they anytime they do anything at the dealership or you call someone to do something it always seems like oh like this must be some amazing device that you know must be manufactured in a laboratory in the middle of the earth or something and so no it's like you can get a car tester or battery tester for like 20 bucks do you have to disconnect the battery not remove it but at least like disconnect the terminals like the rest of the electrical system for it to work because that's a
Starting point is 00:03:04 that's an annoying thing if you do yeah i don't think so right because as long as the car is not on it's not drawing it's not drawing any amps so it should it should be an insignificant draw so we're gonna find out i'm gonna get the car battery tester on tuesday tomorrow and um'm going to see what the instructions say. I'll let you know. I'm pretty sure, just putting my engineering hat on, that as long as the car is not on... Well, the car has to not be on and not even on accessory mode.
Starting point is 00:03:39 The keys have to be far away and everything. I could see that. But it's not a voltage thing right so the thing you're describing is can you pull enough amps temporarily to to turn the car over right so it's like measuring like rush current that it can deliver yeah so this $20 thing what it needs to do is draw enough amps for a certain amount of time and make sure the voltage doesn't drop right i think that's but it can also tell you it can actually tell you whether the battery like all the cells are just old or if you have one bad cell i don't know what you would do with that information but it could tell the
Starting point is 00:04:15 difference this is probably a voltage based thing knowing that all the batteries are made from the same chemicals or whatever so yeah that's right yeah i think if you have one dead cell then you would have you'd consistently pull a certain number of amps but at a lower voltage well you'd have one cell voltage drop missing so you like i don't know how many cells are in a 12 but like you would go to 9.3 and that's like all your cells are good except for one yeah versus if you just had like 10 instead of 12 then okay so that your your mission should you choose to accept it is to crack open the tester to look at the circuitry figure out what it's doing yeah after the last episode where i burnt
Starting point is 00:04:58 the raspberry pi and i literally have a imprint on my finger from the SD card. I don't know if I'm going to crack open the battery. But but yeah, I guess, you know, I'd be interested to hear what people out there do. So if you just drive your car until it doesn't turn on, if you have some regiment that works for you and you've never had this happen to you either end of the spectrum let us know um you know i just you know kind of like it didn't ruin our camping trip but it definitely set the camping trip back by like multiple hours so so i kind of feel like i needed to improve this area of my life so we'll see how it goes if you have a way of doing this less though actually patrick do you do anything or do you just drive the car until it doesn't turn on yes that one which is horrible i i think when i take it for its regular oil change i think one of the things in the checkpoint is they hook up their expensive
Starting point is 00:05:58 battery tester to it and then try to convince me to buy their battery um so that's been my i do regularly get the oil changes on schedule um i don't know if that actually actually is super critical but i for whatever reason i just choose to do it that way and so yeah i hope you did the same thing i get the all changes on schedule but i've never as far as i know i've never had a mechanic tell me i need to change the battery it's always just died oh man all right uh on to the news your first up all right so you know we never really talked about this uh this is a news article about you know google continuing to to do layoffs um but you know the particular article isn't what i want to talk about. I want to talk about tech layoffs in general. My view of it, there are layoffs.
Starting point is 00:06:50 There have been layoffs for, what, probably 18 months now. It kind of oscillates. So there's layoffs, and all of a sudden, there's even articles about companies laying people off and then desperately asking them to come back. It's a very volatile situation. You know, my overall take on this is that, you know, I think that it's a really complicated situation because on one hand, I do think people should choose degrees and majors and careers that have an addressable market, right? So if you get a degree, well, depending on what your goal is, but if your goal is to
Starting point is 00:07:35 enter industry and you get a degree in American history, that's going to be really difficult. I mean, there are, you know, you don't need to do what your degree is. But, you know, it's just making things a little bit more difficult. And so someone might say, well, with these all these tech layoffs, maybe we shouldn't get a major in engineering or computer science. I think my overall take of this is, is, you know, engineering is still an extremely strong discipline career-wise relative to the spectrum of degrees or things you could study. And so I wouldn't be really phased by this. If you're a high school student, college student, I would choose engineering if it's something you're interested in and not really be worried about the layoffs. I wouldn't make that a really big factor in which college major you're going to pick. What do you think, Patrick? Yeah, I agree. I mean, I would love to say that if you look back to other periods in time, like 2001 or or other the great financial crisis and that the layoffs were
Starting point is 00:08:45 completely merit-based but i mean i think the unfortunate truth is when a company goes to lay up when a company hires people hire against different reasons different performance targets they might be over specializing over generalizing hiring less quality than they need overpaying for whatever there's just so many variables that go in i think when a company goes to do layoffs it becomes very difficult to sort of like run some strict merit based thing because also you layoffs normally come from top down right normally the bottom people aren't doing the layoff so it's the top down that hey there's a target to hit and it but bottom up there's this like obfuscation that happens where everyone says their team is overperforming
Starting point is 00:09:27 because it's sort of a game theory thing right like you don't want to say oh i have a team full of underperformers uh and that's why my you know that sounds bad on the manager right so you get this like information problem and so i think when the layoffs happen it unfortunately it isn't always very fair um but to jason's point i do agree that I think when the layoffs happen, unfortunately, it isn't always very fair. But to Jason's point, I do agree that on a whole, I think the industry is fairly robust and there are other people hiring. But times are exuberant and times are lean.
Starting point is 00:09:55 And I think what often can happen is if you are in the wrong part of the hype cycle when you come out of college, you can have sort of wrong expectations like, oh, you know, I'm going to start out and make X dollars per year. And then if times get lean and you're sort of waiting for a better, better offer, you know, that matches what you heard. It's like, well, unfortunately, the situation is dynamic and like, you know, it's not necessarily true that those offers are still there. And so you got to kind of you know be aware
Starting point is 00:10:25 of that but but on the whole i think you know as an industry there's still growth here i i mean people kind of say which they've always said oh ai offshoring whatever there's always some threat to computer science or engineering you know disciplines but i don't i don't i'm not over worried about it myself and i would still encourage people who are interested in it to pursue it. I think there's lots of adjacent jobs you can do if you, you know, if things that really, really lean, there's plenty of other things you could apply the skills you learn in getting an engineering degree to. And so also, though, I will like give a shout out for the finance stuff we talked about
Starting point is 00:11:00 sometimes, which is the importance of having savings and not, you know, living paycheck to paycheck. And it can be hard when you live somewhere expensive, but making sure that if that does happen, where you have to sort of go a few months before you can land another job and have the freedom to not have to just take the first thing that comes across your desk, then I think that's a huge benefit and gives you a little bit more confidence to kind of seek out the thing you're looking for. But yeah, I agree. I'm not overly concerned about it. I think like this latest round that you're mentioning are a lot smaller. So the news likes to hype it up because everything's dramatic. But, you know, there have been serious cuts. I think the cuts
Starting point is 00:11:39 are tapering off. We're probably seeing the last of them. But you'll always see companies wind down, but new companies start up. And, you know, it's kind of an ebb and flow. And so as a whole, I think the health of the industry in terms of, you know, compensation and stuff is probably steady, probably not as exuberant as it was two, three years ago. But holding steady. Yeah, yeah, I agree with that. Yeah, I think you brought up a really good point, which is, you know, and I know we've talked about this before, but when you start out, let's say you're in college right now, you're looking for different careers. Your goal should be to learn as much as possible. And in fact, like most companies will have source control, will have peer reviews, will have like a lot of different disciplines that will be really important for you to pick up early.
Starting point is 00:12:35 And so you really can't, as far as I know, you really can't go wrong. Obviously, I've only had one first job as one human being. But you really can't go wrong with your, I've only had one first job as one human being. But you really can't go wrong with your first job. You can always find stuff to learn. There's just a huge expanse of things that you need to know to be a professional developer. And you will learn at least 70, 80% of that almost anywhere. I would say that maybe the place that's the most risky would be to join something extremely early. I heard a hilarious story. I don't think I've ever talked about this on a show. This was back when Java was really popular.
Starting point is 00:13:17 I had a buddy who worked at a really small company. It's only just a handful of engineers. Most of them were straight out of college. There wasn't really anybody who has experienced. And, you know, you know, Java used to have these things called jar files, right, which are basically zip files with some metadata that told you like, which Java file to run first, and I'm probably butchering that, but it's basically what it was. And so most people don't know that jar files under the hood are just zip files, but these folks did.
Starting point is 00:13:49 And so their continuous deployment strategy was to build the code on their computer, guess at which dot class files changed, based on which Java files they changed, and copy those into the zip, into the jar that was running on the deployment, like on the production server. So there was no source control.
Starting point is 00:14:14 Everyone just copied, you know, source files to each other over like Windows shares. And so when it was your turn to update the code, you would open up this jar file and start dropping Java class files in there. So that's a difficult situation to learn continuous deployment and source control and all of that. So that might be something I'd be concerned about.
Starting point is 00:14:42 But most medium to large size companies most you know medium to large size company actually all medium to large size companies and most small size businesses will be just fine all right well taking a hard hard turn there from the seriousness of that uh but but i agree agree with what you say uh my next news article is my continual stream of shader descriptions, which is in my... I've given up the belief that I'm going to build a killer game. So now I'm just like, I'm going to build a killer shader for making a cool visual demo. So this is real-time dreamy cloudscapes with volumetric ray marching. Ooh, that's a mouthful.
Starting point is 00:15:23 Check this link out. This is by, I'm guessing by the name of the blog, I think it's by someone named Maxime, Maxime Heckel is what I would guess there. But the, yeah, I think that's right. And they have a article on their blog about doing a couple of things. First of all, rendering clouds,
Starting point is 00:15:42 as the name says in real time. So, you know, doing it nice and fast. It's a beautiful thing if you check it out. But also interesting is if you sort of click around through the linked articles, they have a ton of really just interesting examples of doing shaders and how to kind of make them work. And so shader is a code that roughly runs on your GPU and, you know, sort of has instructions per pixel about what to render uh and it's source of a lot of like how do you call like computer generated art demo scene style things which we've alluded to before um but also this discussion of ray marching has come up
Starting point is 00:16:15 a few times into i feel like those things ebb and flow i think it's the algorithm just like honing in on me but like whatever um so i've seen a lot of youtube videos or whatever about ray marching recently and so it's on my like list of things to look into which is a way of sort of thinking about ray casting and you know understanding how close you are to something uh sign distance functions uh anyways tons of good quality things like a jumping off point so uh i one day aspire to make beautiful like like mellow screensaver like shaders like these articles I keep touting out for everyone. This is amazing. One PSA. Don't look at this in Firefox. It will cause cause all sorts of heartburn. But on Chrome, it works just fine. Do you do this, Patrick? I use different browsers on my work computer for personal and for work things that way it's like the most isolation isolated environments i could think of oh yeah i i do do that but uh it's sort of a more because it's a requirement rather than uh that's okay um
Starting point is 00:17:21 but yeah on chrome it works just fine. That is so cool. I wonder how many of these things are commodities. Like if you were to use Godot or Unity or one of these things, do they just have volumetric ray marching or is it too cutting edge? Oh, that's a good question. It's like you could probably find out implementations of it and you know pieces but i think this is one of those things that you kind of piece together the sort of almost algorithms into building the look of your game or your your you know demo or whatever so you may do things like compile various ways of doing water rendering and cloud rendering if you're doing
Starting point is 00:18:03 you know this kind of thing and so i think you might be able to pull down certain it's not like shaders i don't think that you know my understanding is it's not like pulling down a model like downloading a model and like you know rendering the model and it's rigged and you do animations like those things i think you have common utilities for the shaders are more like the sort of look and feel and so you can probably have a jumping off point with them but you sort of combine them in your own way to give your game its unique style um you know or realistic you know kind of thing but i think a lot of people make independence are pursuing a very stylistic approach so that kind of looks unique to them or whatever so um rather than you know hyper realistic is a really tough game to play in yeah that totally makes sense um all right my second news
Starting point is 00:18:46 is uh i've been diving into okay so what was your first computer patrick first thing with a keyboard it must have been one of those is that like an apple 2e okay yep yep we had those in school i had a commodore 64 and uh you know i also had an apple tv at school and i've just been going down memory lane on old like original games like the first few games ever played um one of them was this game called street beat where you ran around with a boom box uh trying to uh hit people in the face with music notes to make them dance. And so you actually, the way it worked was first you had to find a song and the songs were like in people's houses. So you'd see a glowing door. That was a sign that you, this person had made an awesome song you'd go in his house you
Starting point is 00:19:45 take his song and then you turn on your boom box and hit people in the face with music notes and once you've hit a certain number then the song was like you know i guess like proven to be good or something baked in or whatever it is and then you would take your your baked in song to your house your own personal house and score a point and so i don't remember how many points you have to get but um anyway so that was street beat um i came across one that i played with my my family when i was really little that kind of blew my mind and that it was really a interesting crossover that I don't think I've seen since called Robot Rascals. And so the way Robot Rascals worked is it was a card game and a computer game all in one. So the cards were your private information that no one else could see, and the computer was sort of the community, you know, information. And so you would take turns playing
Starting point is 00:20:47 this, this game on the computer, but you would have your private information. And, you know, you try to based on what other people were doing, try to guess at what their private information was so that you could exploit it. And I just, yeah, so many good memories of this, but I haven't really seen anything like that. There are some companion apps. So if you're playing Monopoly, for example, there's a companion app where you don't have to pass paper money around. There's things like that, but those are more assisted tools. I haven't really seen a board game app kind of you know crossover like like this um have you patrick have you seen anything like that yeah i mean there are a few so i think
Starting point is 00:21:32 you know i've played like a ticket to ride where you know like everyone has their tablet and you know the screen is rendering the common thing but you can also do pass and play or whatever where like the color oh i don't know how people anyway the color of the cards you have you know it shows at your turn there's also i think one for suro um wait hang on i don't know maybe i don't know saying that you're okay so what i'm saying is you have physical cards so so you're playing a physical card game but then the computer has like the other half of the game you know so like the computer doesn't know it's like combined like hybrid so it's not like you're playing the board game as a okay no i missed this okay so like you have like a deck of cards like holding in your physical
Starting point is 00:22:17 hand right right oh wait what how did and so you have to enter information reliably into the computer so yeah so like you could uh like basically like information reliably into the computer? So yeah, so you could basically tell the computer, I have a right to mine this rock. And the computer has no way of knowing that because it doesn't know physically what cards you have. So it just assumes when you say something, you're right. But yeah, you're also playing with other people. If you have a right to mine this rock
Starting point is 00:22:46 and someone else is just looking at that card, then he knows you don't have it. So you can't really cheat, per se. But yeah, you need the physical and the virtual in the same space. I think I have heard of something like that, but no, I've never played it. never, I never played it where like, yeah,
Starting point is 00:23:06 like the, the engine of the game can be very complex because the, all the accounting is done by the computer, but you still are playing a physical game. Right, right. Exactly. Yeah.
Starting point is 00:23:16 Yeah. And that's cool. Yeah. And this was like an old, old game, right? So it looks like you were saying it's like, I definitely played it as a very little child it came out uh 1986 so i was actually four when it came out i mean i played it i was probably
Starting point is 00:23:32 like eight when i played it but very cool wow that's nice yeah so folks out there if you're in game development this would be like a interesting area i don't know if it's sort of like just you know past its usefulness but i just i feel like there might be something cool there we'll see maybe someone could write in and say what some contemporary equivalent is well i mean an alternative to what you're proposing is you could go into virtual reality and the game could just be sitting on the table in front of you and if you wanted uh sort of a newly released virtual reality headset you could use the new meta quest 3. no i don't i don't but i just bring it up because i feel like it's one of those a couple like uh you know space travel uh random finance threads uh virtual reality. There's like a few things I feel like we keep weaving through our podcast over the years.
Starting point is 00:24:29 And so our podcast is old enough that we've seen some of these things mature. And so interesting that MetaQuest 3 is like a sort of, I guess it's the latest released hardware. Of course, you know, I think there's the Vision Pro, which people are talking about coming from Apple next year, I think is what they're saying but um for the meta quest 3 this is out you can buy it and the the sort of like it does a lot of things right like it just revs everything but the pass through where your cameras on the front that are sort of eye distance apart passing through to you on the inside so that you can do the mixed reality this is the thing
Starting point is 00:25:05 that you know everyone really wants to make happen is the augmented reality the mixed reality uh and so you are seeing which is kind of the funny thing the what happened when google glass was coming out which is like people sort of out in public using their headsets in weird places to you know record record stereo video yes but also to like be somewhere but also not be in that place and so it's sort of uh this interesting uh i guess you know zone between this being a real thing where you just put on glasses and see it because you're still wearing a headset people are very aware of what you're doing um but yeah i i know i do play still my i have a quest two i i want to call it an oculus quest but they deprecated that but to be clear it was like on the branding of
Starting point is 00:25:51 my box i feel like i still call it that uh anyways the quest two and i've really enjoyed it i feel like the they sort of figured some stuff out the games you know we play we keep playing and and we've talked about that before on the show you know getting exercise in the game and just generally how immersive it is i've been playing the i guess i could have made that my tool of the show i didn't oh well uh which is i expect you to die number three um which i told to someone and they looked at me very funny like what kind of horrible game are you playing it's like no no no it's like a light-hearted puzzle room uh so anyways i expect you to die number three i feel like you know people have kind of figured out the formula a bit for some of
Starting point is 00:26:25 these things that work and it may be your cup of tea or not, but excited to continue to see stuff coming out in this area. I don't know where it's going. I don't know if this is like the next big thing or not, but definitely excited to continue to see how it shapes up. And at least, you know, for me, you know being a casual uh person involved here uh you know enjoying some good games and entertainment as these things mature yeah i mean i'm one of those people too that plays the the oculus the the oculus quest to um at least once a week sometimes
Starting point is 00:26:58 even like multiple times a week and uh but you know something that like uh it's kind of interesting you know i i play basically the same two games that i've had almost since the beginning and so um you know i don't know how much of a commercial success it is um but but it creates a ton of value um and basically for me i just play boxing and sword fighting. And, you know, originally, I got it to where I was playing boxing on the harder and harder difficulty levels. But I realized two things. Number one, you know, I'm not really very, like, fast in terms of reaction time. And so number two is like, when you start whipping your head around, it's just not very comfortable. So I thought, okay, I can make this more difficult by adding weight, because i know boxers do this in real life right they have like really weighted
Starting point is 00:27:49 gloves when they train and stuff and so i got these uh wrist weights and i've been kind of just adding more and more weight and playing these games on the normal difficulty just with more weight um it is amazing like number one how much harder it is even with one pound of weight on each wrist you'll be amazed yeah you'll be amazed at like how slow you punch and how slow like you get your hands back in a guard and all that um and i got this is kind of ridiculous but i'm up to like five pounds of weight on each arm so i have these uh they're meant for your ankle, but I put them around my wrist and I have another one that has a thumb loop. And so I have these like basically
Starting point is 00:28:31 two weights on each arm and I'm doing boxing and it is like a heck of a workout. I mean, it's just, you'll be sweating bullets up. You do it. It's, it's a ton of fun. So you're ready for a street brawl is what i'm hearing yeah i'm talking to a friend of mine who does he's kind of a hobby blacksmith and he told me that um i always thought and i always thought that that swords were like 30 pounds uh but no like like swords you know in real life like a broad sword or whatever is only like six pounds and so um now you know granted there's even more of a of a cantilever effect because the sword's weight is you know not even close to your body but but but i'm starting to approach what it feels like to hold a real sword um and uh it's it's it's
Starting point is 00:29:21 a ton of fun so if you get stuck in one of those time vortexes or whatever, and you end up in medieval times, you'll be ready to go. You're ready to rock. Yeah, I think the challenge is, you know, I'm used to moving through people. I'm not used to people actually existing in a physical plane. So as long as it's a lightsaber, you're fine. Yeah, as long as nobody hits me either. Every plan is good until you get punched. Yeah, Mike Tyson, right?
Starting point is 00:29:48 Yeah. Yeah, random sidetrack. Okay, I wanted to go back to one thing you were saying, which I don't know if we talked about on the show, so I'll mention this just really quick, because we've talked about it before. But when Activision was on trial, that was earlier this year, I believe,
Starting point is 00:30:03 it came out that there are a million PlayStation users who literally only play Call of Duty. Literally nothing else. They have their PlayStation and they only play one game. And something like there was 6 million who spent 70% or more of their time playing a Call of Duty. And so you're like, I only play one or two games i'm the same i feel bad if i just keep playing the same game you know factoria um but uh you know if you
Starting point is 00:30:32 just sit there and play all your but actually i think you know this is quite normal is what is is what the statistics show is people just end up really liking one game and for very long periods of time they just play a single game. Yeah, that's wild. I mean, it's very hard for the industry to make money if they have to sell those units. I think they sell the units at a loss. And if someone only buys one game, it's just really difficult.
Starting point is 00:30:56 Yeah, no clue. All right, time for book of the show. All right, so I'm going to go fast because mine's a cop out. Mine is the paper we're about to talk about today. So we're talking about an algorithm, HyperLogLog. Stay tuned. And so my book of the show, which isn't a book, it's a paper.
Starting point is 00:31:11 And it's a paper by Google Research, which is where HyperLogLog was sort of written up. So check that out. If you're not a fan of reading papers, I know Jason reads a lot of papers, but I don't. I find them very difficult. I don't prefer them. I don't like the way that they're presented. I think they're pretentious, but whatever. Anyways, all the PhD people are coming after me soon. Anyways, but I find that they're not written for clarity, but written to sort of affect a certain style. This paper is no different, but there is some good stuff in here. There are, you know, good graphs.
Starting point is 00:31:46 And even just, I've just given up that you don't have to understand every line of a paper, or even like the notation. Just kind of like doing it once or twice. And if you sort of go through a couple of papers and people will often reinterpret a previous paper. So for papers that are sort of seminal,
Starting point is 00:32:01 you can kind of read a later paper and get a better description. And so anyways, don't feel bad if you're not a paper reader. You know, I would say as part of like, you know, some small percentage of your education, if you're not a sort of master student or PhD student or PhD graduate, still don't shy away from looking at papers every so often when they come up and just sort of like reading. You can occasionally glean really interesting tidbits or or insights not always there's a lot of junk uh you know anyways hyperlog log paper that's my that was a weird rant for a book of the show but
Starting point is 00:32:35 there we go no i think it was really important i mean we must be on the same wavelength because my book of the show is also a research paper and i didn't know this. I saw it in there and I thought it was an actual book. Yeah, I went into today's show thinking, man, I've read so many research papers but made zero progress on my Audible books. I'm just going to talk about a research paper and then here we are. I'll go pretty quick too
Starting point is 00:33:00 because I want to get to the content. But basically, NVIDvidia has this really interesting thing where they're using large language models to try to fabricate goals for this uh reinforcement learning robot so for example they'll ask chat gpt and but i'm just starting to get into this paper, so I'm not going to do it justice right now. But they'll basically ask ChatGBT for, like, what are certain things that we can do to show that we kind of have good ambulation? And then they'll kind of just keep riffing on that. So I do feel like one thing that's always missing with reinforcement learning is you have to train from scratch versus, you know, for example, if you're driving a car or writing something down or whatever, you're starting from this base of all this common
Starting point is 00:34:00 sense reasoning. So, okay, here's a good example. So good example. So we have the AI to play Atari. And so the way it works, if you want to play Pong, is you start off just moving the paddle randomly and losing. But then sometimes you just randomly hit the ball and you get a point. And so that creates a target. And then you start marching towards that target. This becomes a really big problem in something like an RPG.
Starting point is 00:34:32 So if you want to play a role-playing game with AI, you're going to have to like try every spell randomly, move around randomly because you're not drawing from a common sense reasoning bank like like you don't the ai doesn't know that well forests usually have more dangerous enemies than roads because roads are generally guarded or or fire works well if the enemy looks like he's made of ice like these are all things that you could just do on the first try but the the AI has to trial and error its way through.
Starting point is 00:35:11 And that's why you don't see AI playing Dragon Warrior, for example, because there's just too much common sense that has to be learned from scratch. And so this interested me because I feel like maybe we can somehow draw from the common sense that's been accumulated in something like chat gpt um or gpt4 i think there's something to that like imagine if you somehow just fed the visual and the textual content to some type of embedding and then trained on that then maybe you know like fire one and fire two would go into this embedding and you would just learn in general some things about fire the fire spell or whatever um so i feel like there's something there but it's it's very early days that's interesting so rather than like combinatorially exploring the space and finding out that like, if the attribute ice or whatever is attached to a enemy,
Starting point is 00:36:08 then use a fire spell. I guess I could use a Pokemon. Maybe it's a good example. Like, Oh, visually this thing looks like it's frozen or, you know, has an ice type associated with it.
Starting point is 00:36:19 Or I've memorized that. Like I should use a fire spell. Like you said, that's done that way, even though it's an abstract concept, there just rock paper scissors right but that transitive is encoded to you as a human because it maps to something you understand which is like you know if i have something made of wood something with fire will destroy it and would have a very hard time fighting something made out of fire and so you're sort of saying you could kind of like get that understanding by combining two approaches something that understands the context being given to the human and then the normal you know sort of yeah
Starting point is 00:36:56 interesting yeah exactly exactly like i mean of course you could well i don't know of course i mean you could theoretically cheat if somehow you could read the game's ram and figure out like okay this this character has this you know ice attribute right but like that's not how humans do it like humans are just looking at the picture and so to really play final fantasy or dragon warrior these other games barely like to show it that an ai that is really representative of solving the problem it has to be able yeah to just look at things and and infer based off what it knows about the universe
Starting point is 00:37:34 so i guess the equivalent would be if a computer designed a game that it thought was interesting based on the rules of the game and then presented to as a human you would be like i this is not fun this is just like some abstract concepts uh the ai would be okay you would be on a more equal footing but because game designers are humans designing for humans yeah you kind of you got to cross that chasm before or you put a computer on equal footing yeah i mean imagine like a poco they actually have these where it's like uh they randomize the game have you heard of these game randomizers yes yes yes yeah and so they you know imagine a game randomizer for pokemon where it just randomized all of the strengths and weaknesses of all the pokemon it would be just terrible because it's like oh this
Starting point is 00:38:22 the turtle that shoots water is actually you know know, a fire, you know, entity. And it would just it would just be really frustrating. Like, that's how the AI feels. Am I supposed to empathize and feel bad for the AI, Jason? Help me out here. Yeah, I mean, you know, it only makes billions of dollars, controls the stock market. I mean, you should feel bad for this thing. If you like takes like that,
Starting point is 00:38:52 you should subscribe to us on Patreon. If you're not an AI, you should subscribe to our Patreon. We really appreciate it. If you are an AI, you can also subscribe to our Patreon and fund this very important source of training
Starting point is 00:39:07 data that's right so yeah thanks to all the patreons we had a lot of people stick around for a very long time so it's always an encouragement and you know thankful for the support yeah definitely alright it's time for the tool of the show
Starting point is 00:39:23 well I'm returning to form and giving another game. This is, I mentioned Factorio earlier, this is in the style of, I guess it's closer to Satisfactory if you've played Satisfactory, which is, you know, there was a kind of a lot of stuff routed up around the Minecraft culture,
Starting point is 00:39:41 you know, sort of Terraria and several others. And I think we're seeing kind of the same thing with i don't know what you call them i guess factory building games i'm not sure what the official i think that's right like base building plus factory building uh and so tectonica is a new one i've been playing on my steam deck actually um many of these games like satisfactory actually don't really have very good sort of controller support and people do play it so someone's going to write in and be like, I use the touchpads for the 37 shortcuts
Starting point is 00:40:07 that you need. You know, it's not something you would sit down and expect to play on your Xbox. Recently, Factorio added that. That was actually when I picked it up and started playing on my Steam Deck. But I've been playing this one Tectonica. And so all these add a twist. So this one is 3D,
Starting point is 00:40:24 which adds an interesting dimension to your ability to one tectonica and so you know all these add a twist so this one is 3d which is uh you know adds an interesting dimension to your ability to uh you know build your factory and and how to align things because it's in a first person sort of 3d 3d thing uh and then the second thing is that uh you're underground so you're in caves and caves of course like limit how you can sprawl your factory even kind of more than normal. But you're also given, you know, drilling equipment, basically. So you can carve out, you know, paths to, you know, extend your factory line or add stuff. And, you know, you need plants to fuel some of your, you know, combustible uh like smelters and uh so you need to you know to hydroponic
Starting point is 00:41:06 growing because uh you know it's a pain to run around and mine mine all the mine gather harvest harvest yeah bring on the walls uh so anyways uh tectonica it's still early access so i guess in theory it's buggy i did i did actually hit one bug on my computer um but in general very playable i've not hit like some nasty crashes or anything. And a lot of these games, interestingly, are satisfactory. I think it's the same thing. Still technically early access,
Starting point is 00:41:33 early release, whatever it's called. And so it's not formally done yet, but very, very playable with beginning and end game and all of that. And so Tectonica also is nice because it has a... I mean, you may not like the story. It's not an rpg level story but there's definitely like a story to it about what's going on and sort of like accomplishments to work your way through that
Starting point is 00:41:54 that are sort of very specific to unlock a narrative that goes along with it so i'm enjoying it tectonic i checked out oh on pc very cool as far as i know i think xbox pc i don't know about playstation oh very cool check it out yeah i love factorio and satisfactory so this is right up my alley i'm gonna give it a shot all right my tool this show is uh the esp32 development board have you heard about this patrick yes you have okay i'm gonna say what I think it is. I literally just got one. I have it. I've actually two of them right here. I'll hold it up for the audience to not see because we don't record video. But actually, that was useful because I was wondering which development board you have. So I have a tiny, tiny one. Yeah, it's like the size, maybe the size of a stick of gum, basically. Okay.
Starting point is 00:42:48 And yeah, and it's, you know, the thing that caught my attention was that it has Wi-Fi. So it's, you know, the Raspberry Pi Pico W, I have a couple of those. Those also have Wi-Fi. Those were significantly more expensive. This is like extremely reasonably priced. I have a link to the actual one I bought and I bought, yeah, I bought two of these for $12.74 US. So it was like $6.20 a piece or $6.70
Starting point is 00:43:18 a piece. So yeah, very reasonable. And I haven't tried it yet. They're still in their vacuum sealed bag or static bag, whatever it is. But I definitely want to give it a shot. It supports Arduino and MicroPython, which is pretty cool. So you can use either of those. Yeah, the Raspberry Pi and Arduino, they only support their own thing. This one supports either. I'm generally a little leery of things that are just so open-ended
Starting point is 00:43:47 that maybe it won't do any of those well. The instructions that came with it walk you through how to set this up for Arduino, so I might just stick with that. But yeah, just the fact that I can get Wi-Fi in a small form factor is
Starting point is 00:44:03 really neat. I think next year what I might do is I have a one of my Eero, you know, Eero is a mesh Wi-Fi thing. One of my Eero capsules is pretty close to the front of the house. And so I'm thinking I can actually have robots in the trees next year for Halloween. And they'll get this wi-fi packet and when they get the green when they get some wi-fi packet they'll swing swing from the tree like some giant ghost or something so um that's my plan um uh last halloween which was you know just a week ago from when we're recording this um you know i i actually went down a step i i only did about half of it um things have just been really crazy for me so i didn't have a lot of time to invest in it but i'm gonna try and make
Starting point is 00:44:53 up for it next halloween i'm gonna bring back all the robots i didn't do this halloween and add some more in the trees if i can very cool yeah so i guess like the commentary i have there is the uh company i think it's expressive is develops a little uh chip that does like all the wi-fi and has i believe their arm cores um on them and so all of that to i guess like what jason is saying is it's kind of irrelevant unless you're doing embedded programming directly um but they're very, very cheap and common. And so a big community has sprung up around them. So Amazon's a good source if you want them quickly. If you are willing to wait for overseas from AliExpress,
Starting point is 00:45:36 you can get the original version, or at least the original one I came into, which is ESP8266, which is a 16-bit microcontroller, I believe. Again, maybe not important depending on what you're doing, if you're just sort of controlling a few servos and also has the library associated with it. And that one's like $2. Wow.
Starting point is 00:45:55 For an ESP8266 for like $3. So you can search like, they're kind of like all cloned, but Wemos, W-E-M-O-S, is a very common like dev board you can kind of look up and they have esp 8266 versions and esp 32 versions the esp 32 is a you know 32-bit processor faster so if you're needing to you know do more computation they have variations of it with you can attach a camera even and do some minimal image image processing on or use as webcam of course if you're just going to need a webcam it's better to normally just go buy one but then you normally you know pulling the data down and
Starting point is 00:46:32 doing processing rather than being able to handle it sort of device side itself and they all have you know we talked about that last time with the raspberry pi it has the sort of headers for doing i squared c or spi and know, a lot of example code. And like Jason mentioned, you can use the Arduino, I guess you'd call it like middleware, but basically like the libraries and the interfaces associated with that. They do need to be for like hardware stuff. They have to be ported to ESP32. So you can't run the sort of Arduino libraries directly in some cases if their software is fine but for you know things like i squared c handling for a specific something you know you may you may
Starting point is 00:47:11 need some variants but it's relatively straightforward much easier than sort of building your own pcb and doing it from scratch but they also support micropython as i mentioned and lua as well are pretty commonly seen as ways to sort of control these. And the big win, like Jason mentioned, which you don't normally see in a bog-standard Arduino, or at least not at this price for sure, is the ability to connect to your Wi-Fi network.
Starting point is 00:47:36 So you'll see a lot of home-controlled devices actually use these as well. And there's a lot of firmware floating out for reflashing sort of, you know, the light bulbs that screw into your roof will have one of these chips inside. And that's how they connect to your network so that you can control them from the app. And so I figured out that a lot of these run very, very similar chips. And so there's various from easy to hard ways of sort of flashing your own stuff onto them.
Starting point is 00:48:02 And in many cases, the other cool thing is rather than having it hooked up to your computer, there often is the ability to update firmware over the air. So you can just send a new update to your bat in the tree and change its behavior without having to bring it back inside and hook up cables. Oh, that is awesome. Yeah, I'm really excited about this. I'm going to have to...
Starting point is 00:48:23 I bought a... This is probably just really amateur to you to uh i bought a this is like uh probably just really like amateur to you but i bought a crimper i've never crimped anything before but i wanted to like basically have this thing control the servo but i need to have a wire come out of this thing and go into the servo and it can't't be just a DuPont cable because of the way this thing is set up. And so I actually bought a crimper. I'm going to try doing my first crimp at some point.
Starting point is 00:48:54 Yeah, I don't do a lot of crimping. I do a lot of soldering, but not too much crimping. And a correction, the 8266 was actually still 32 bits, just less powerful less purple devices do you have to buy like a thousand of them if you're buying it from aliexpress or can you still buy like five or ten you can buy like one or two yeah oh nice shipping has caught up it used to be a much better arbitrage or however you say that but now uh yeah normally you know they want
Starting point is 00:49:25 you to buy a few to basically amortize the cost of shipping but it's not like hundreds um you know if you're buying a tape a tape of them uh you know the actual chips you probably get to buy 500 or something but most people are buying a dev board so you're only buying a couple uh makes sense very cool all right well it's time to talk about HyperLogLog. So Jason has warned me that he has been willing to play foil to my explanation here. For more broad topics, I do have some other specific algorithms in mind. So if this goes okay, maybe we'll do some others in the same style. But let me kind of like set up and walk it through. And Jason, ask away the questions. Represent the average person out there or the everyone else or even me
Starting point is 00:50:15 because normally I'm the one asking you these questions. The algorithm here arrives from a difficulty that probably most people haven't really thought about. In fact, I really hadn't thought about it until it was sort of posed, which is how do you count the number of unique items in a set? And so if you sort of imagined the classical example, we'll just use this one. You know, everyone's pretty technically oriented here. Imagine you're a website and you don't want you know you see that visitor account right and your mom goes and visits your website and then she just like keeps refreshing and the counter just keeps going up yeah i have a popular website most people will sort of give
Starting point is 00:50:59 this number instead as you know unique visitors so you know you could say oh okay i'm gonna you know store a cookie and tell if you've been there before or or whatever right there's various ways but just imagine you know it's sort of being written to a log let's say the ip address or you know a unique identifier for a user and you want to count up how many of them there are so you know like the first first blush you know sort of like the naive answer is you sort of you look through your list of all the visitors and you keep track of which ones you've seen. And then every time you go to the next one, you look through the list. And if you see that user, you throw it away and then you count the size of the list at the end and then boom. And that that is correct. That actually will produce the correct answer, the so-called
Starting point is 00:51:45 cardinality, the number of unique visitors that was in your set of consideration. And so you may say, well, that's pretty obvious to improve. If this was a programming interview, that would be a suitable first answer. And then your interviewer would normally up the challenge a bit, which is ask you, well, how efficient is that? Could you do better? And so of course you can do better than searching through a list for unique items and you could use a tree, right? That would be one option. And the other option is you could use a hash map, right? So you could have a hash map of all the unique identifiers and then every time you, you know, get a unique id in you hash it or maybe it's already you know you randomized enough but normally you want a really really rigorous hash
Starting point is 00:52:31 and so you mix up all the bits you get essentially a random number but a repeatable one and you go to a set of buckets and you is that one present or not if it's present you throw it away and if it's not you insert it and we've talked throw it away. And if it's not, you insert it. And we've talked about hash charts on the show before. If not, you can go listen to them if you don't know what they are. And so at the end, you count how many entries you have. And this will also work. Yeah, I mean, the limit of what I,
Starting point is 00:52:58 off the top of my head, would do just to scale up is have a distributed hash table where basically you do the hash and then each, it's like if the hash starts with a one, send it to computer one. If the hash starts with a two, send it to computer two. And so then you have a computer for each possible value of the first digit of the hash.
Starting point is 00:53:21 And so then now if there's, I don't know, 56 different characters for the first digit of the hash and so then you know now like if there's i don't know 56 different you know characters for the first digit of the hash then like you have 56 computers and you get 56 way parallelism um but yeah i guess we're gonna hear about way even better than that no no no no so so this is great so you know i i had a same kind of like thought process in my head. And I think that works great for live, right? So if these if you wanted to sort of like, what I was mentioning, you know, these visitors are coming in and you sort of counting them, but it doesn't work like after the effect, or sort of analytic purposes. So imagine you like I was mentioning, you have this is a log and you want to cut it by day, by month, by, you know, people that came from Europe, by people who ended up buying something, by whatever, you know, these kind of filters. and so they have a distributed system for querying the column score and and this was developed
Starting point is 00:54:26 because they went and looked and people doing analytic queries or i forget what it said in there i won't quote it because i'll get it wrong but a very large number of count distinct so you write this sort of sql query and you say hey i want the distinct number of these but they're over arbitrary input so what you say will work jason but the issue is sort of like that's like if you only really want you know a handful of Jason, but the issue is sort of like, that's like, if you only really want, you know, a handful of them live, and you're not sort of like infinitely slicing them later. So not not not an off approach, we'll come back in a second, I think you're heading in a good direction. But the sort of first, there's many algorithms have been done to sort of solve this. And the paper even references one and to be clear,
Starting point is 00:55:02 they even recommend this one, if you're sort of not going to have very many. If you don't believe that there's a lot of unique items, this one will work. And I want to talk about this one because it references something else. And we already kind of mentioned hashing unique identifier and then trying to put it in a bucket.
Starting point is 00:55:20 So this first one is linear counting. So if you hash it and you put it in a bucket, but rather than in a normal hash it and you put it in a bucket, but rather than in a normal hash map, you would store the identifier. And one way is store a linked list of identifiers. So the first thing in the bucket is the head node. And then it points to the next thing and you have a linked list, right?
Starting point is 00:55:38 Until you sort of grow the hash map number of buckets. In this case, what we're going to do is actually only store a single bit and not store the actual ID. So you just insert the value. And if it goes in a bucket, you just mark that bucket as having seen something, you know, Boolean equals true, and all the other buckets are Boolean equals false. So now you have an issue is when you have a hash collision, you're losing some information, right? You're losing the information that there were actually, potentially, it was the same item,
Starting point is 00:56:07 but potentially it was two different items. And so you are getting this sort of probabilistic structure. But this is going to turn into an advantage and the rest of this is going to talk about this. And the reason why I think this is an interesting topic for the show is my background before kind of coming across a few of these was always like,
Starting point is 00:56:24 you would definitely not want a probabilistic answer you would want an exact answer uh and so it would not be acceptable to be even off by one right that's an off by one error and so you yeah exactly what you have but if you're willing to give up an exact answer the design space widens dramatically, not only for the number of solutions you can have, but also for things like being able to tune your trade off, because you can say how reliable can I trade space or processing for more reliability. And an example that you see a lot recently with machine learning is approximate K nearest neighbors. So we've talked about k nearest neighbors
Starting point is 00:57:05 before i think maybe if not maybe that's a good one to talk about i think we have yeah okay okay good um but what if you were willing to accept an approximate answer then you can kind of approach it differently so this first one that i'm mentioning is called linear accounting so you you hash it you keep track and in cases, when you produce this random number, and you go to insert it in a bucket, you're going to collide. And like I mentioned, sometimes that's you want it, you want to throw away because it's the same ID, and every one of the same IDs will collide. But sometimes another ID could probabilistically collide. And that's the second bit of this is using statistics to say, when you're done done don't just count the number of buckets that have
Starting point is 00:57:46 an entry in them which would be an approximation but you can do better which is look how many of the buckets like the percentage full your hash map is and use that to estimate what the likelihood is that you may have seen extra insertions that miss and so this this is like this like clever unlock right for me was um we don't know how many of the same item and how many you had duplicate uh hashes or portions of the hashes that you used are actually different ids but if your you know hash map is only 10 full only 10 only 10% of the buckets had values, that's, it's pretty unlikely, like, you can just say 10 of the buckets had items. So I'm just going to guess 10. But if you know, 90% of your buckets, you had 10 buckets, and nine of them
Starting point is 00:58:38 had it, then the chance that the next thing being inserted being random, 90% chance that it's going to collide, right? And so you sort of take that into account. Now there is some balance there, right? If you had 10 buckets and 10 million unique items, you're going to get like a terrible guess, right? Because you fill it up and then everything will collide and you can't, the information is lost. So you do need to have an understanding of kind of the approximate number for this example, the sort of approximate number of items that you would have. So this one's called linear counting. And when I was reading about this, the thing that comes up here, which I don't know if you've used these before, Jason,
Starting point is 00:59:16 bloom filter, have you ever used a bloom filter before? I've used it, I don't remember exactly how it works. Okay, so this is very similar to a bloom filter in a bloom filter uh the idea is you're trying to have us what things are members of the set and you query the bloom filter and say via the same mechanism basically you hash it you look in the bucket and if you normally do a couple different buckets um and you say if every one of those buckets has an item in it, then the thing I'm looking for, I'm going to do a more expensive query or check or actually go to the database.
Starting point is 00:59:52 But if any of the things that should be present aren't, then I know with 100% certainty that this item is not in my set. So if I'm looking up the username Jason, and I do this operation, and it comes back that one of the buckets that should have had a bit set don't, then I know Jason isn't in, he's not a user. So I don't actually have to query my backend database. And the advantage is... So let me see if I understand this right.
Starting point is 01:00:17 So the idea is like, I mean, you just created an example. So there would be sort of a bit for every letter and position pair. And so it's like, you know, if you don't have the, so JSON is J-A-S-O-N. If you don't have the, you know, A at position two bit, or you don't have the O at position four bit, if you're using any of these, then you know that the word JSON just can't be there. But if you have, you know, J1, A2, S3, O3 o4 and 5 then like you're still not 100 sure you have to do some extra step but but like you're you're confident enough that you've eliminated so many other things that's still like a huge time save yes so if you took each of those like j in the
Starting point is 01:01:01 first position a in the second position and you you know basically hash them and then put them in the bucket and the probability of all those things you said right the probability of false positive is controllable if you say i think i'm gonna have this many true entries i want to use this many bits and there's calculators online for sort of fine tuning your parameters and if you have lots and lots and lots of buckets, you can decrease your false positives. But if you, you know, shrink it down, the really nice thing is, say, you know, on the edge node, you can't store that you have, you know, 10 million usernames, but you could store a megabyte or whatever.
Starting point is 01:01:38 And in that megabyte, you could store, you know, this Bloom filter and give you a reduction of... And it's only worth it if you think there's lots of users going to try to log on who don't actually have usernames in the database or whatever you're trying to do. But if they do, then you can sort of not have to go to the backend.
Starting point is 01:01:58 So if you imagine for something like a spatial index and you're trying to query for an area on your computer, like your video game, you're looking into this portion of space, and most of the portions of space are empty, and you just have entries of where there are things, then you could filter out the expensive sort of collision detection and all of the other things by basically knowing up front which things are present or not. Now, there are other ways of handling that as well, but you're right, it doesn't give you a 100% answer. But it prevents you from doing an expense query by filtering out many of them. Got it. And that's a bloom filter. Or did what is it called a bloom
Starting point is 01:02:33 filter? Do you know? That's a great question. I think because you do this multiple hashes, so you don't just do a single entry into a bucket, you normally bloom I would, this is my imagination, because it blooms out and you normally bloom. I would, this is my imagination because it blooms out and you normally do a couple different buckets by taking portions of it. Okay. I have a really funny, true answer. So, so the, well, that might be true, but also it's called it because it was created by Burton Howard Bloom. That's a true story in 1970. But, but i mean uh that's uh um that's really interesting man i can't believe that you know that this this uh someone in 1970 was thinking about this problem that just blows my mind you know so this one was the first this bloom filter was the first one i
Starting point is 01:03:21 came across that was this it's okay to be wrong. And in fact, it's just a tunable parameter. This is sort of weird. Sometimes you get this if you do like DSP work or filtering or something, you know, or you talk about floating point math. But for something that was like a data structure, a data structure that's like meant to be somewhat broken, it's like intentional that it sometimes gets the wrong answer. It's by design.
Starting point is 01:03:46 So Bloom filter and linear counting, at least in my mind, pretty similar. So moving on from linear counting to the next step, we're going to go to the tail end of HyperLogLog and talk about LogLog. So LogLog does something
Starting point is 01:03:58 really interesting that if we were talking about the Bloom filters taking, you know, Jason was talking about taking the J and hashing that and taking the A and hashing that, right? So instead, if you just hashed JSON, the string JSON, you have a really good hash function. You could just use the first couple bits,
Starting point is 01:04:17 the first however many bits you want to determine, you know, the bucket that you're going to go in. And then instead of scoring just a single bit of I had something in this bucket or not, you go to that bucket and you look at the rest of the hash. Say your hash was 32 bits and you had five buckets.
Starting point is 01:04:40 So, oh, did I pick bad numbers? 27 bits left. And those remaining 27 bits, you could either use the first or the last. It doesn't really matter. Count the number of sequential zeros. Now, this part was really, really weird to me. I didn't understand this.
Starting point is 01:04:52 So take the number of, let's say starting from the left to the right, count the number of zeros in a row that are in the next 27 bits. And then instead of storing one or zero in the bucket, store the number of zeros so if the first digit was a one you would put zero if the first digit was a zero and then a one you would put one because there's one zero but you could have four zeros in a row and you would store that
Starting point is 01:05:18 in the bucket now if you were like me that number of leading zeros in a row. Okay, got it. The number of leading zeros in a row. And the clever probabilistic thing about this is that if you imagine that there... So your hash needs to be very good. So it needs to have a 50% chance of a zero and a 50% chance of a one in every bit position or as close to that as you can manage. Then what you would say is if you're going
Starting point is 01:05:46 to look at a stream of unique ids you would expect that the length of sequential zeros that you would get is a function of the number of items you've looked at the length okay let me say if if i said you're only going to play the game once you're going to generate one hash and look at it and i asked you to derive like what what is what would you be the expected number of zeros you would see it's much more likely to see one zero or two zeros in a row than to see 16 zeros in a row on the very first time yeah it's like it's like a coin toss right it's like how many times can you toss the coin and get all heads well like as the number of tosses go up that that probability goes down that's exactly the key to this so if you look at a given bucket and let's
Starting point is 01:06:37 say you only kept one bucket that's fine you could just have one bucket and you keep track of this then at the end of seeing all of the items you look at what the sort of like longest run of zeros you now and you say that i i believe that i've seen one over two to that many number of zeros that's the probability so then i've seen two to that many number of zeros items is my guess so yeah let me let me see if i can can rephrase it, paraphrase it, see if I have it right. So you flip a coin. There's a 50-50 chance you get heads. If you get tails, you're done.
Starting point is 01:07:16 And, you know, that's a zero. You get zero points. If you get heads, that's a point. And then you try again. You know, you keep going until you keep accumulating points or you're out, right. And so, you know, 50% of the time, you'll get you'll get zero points. 75% of the time, you'll get zero or one point. And so it just keeps going up like this, this hyperbolic ramp. And so like, you know, 99% of the time, you're going to get less than 10 points or something, right. And so what that means is, what you're saying is, you're going to get less than 10 points or something. And so what that means is what you're saying is if you did get 10 points ever,
Starting point is 01:07:51 then there's probably like 99% of the other trials there that you just didn't record because all you did was record the best score. It's like if you go to the arcade and the high score on pac-man like maybe they just rebooted like reflashed all the machines yep yeah and the high score of pac-man or actually this happened to me where um you know i'm obviously not super mr athletic or anything but i went to chucky cheese with my kids and i always play the football game you have have you seen this at chucky cheese you I always play the football game. You have you have you seen this at Chuck E. Cheese, you have to throw the football through the holes. Yeah. So I played the football
Starting point is 01:08:29 game. And I won the high score of the day. And I felt like for a moment, like, wow, like I should have like, you know, played in the NFL or something. And then like five minutes later, like some guy behind me, like, you know, got the high score of the day. That's because they just reflash the machines, right? So the high score, you know, there wasn't a lot of samples. know got the high score of the day that's because they just reflash the machines right so the high score you know there wasn't a lot of samples uh and so the high score was easy to beat um conversely if you go to like uh you know some other arcade or if you go later on in the day the high score will be super high and you can't even get to it and so you're using the high score as a approximation for how many people have played the game yes that's the yes that's the key and so i guess we didn't talk about that part here but you you caught on
Starting point is 01:09:10 to it which is you want to record the maximum you've seen so in this bucket you record the maximum and as you said as the number of attempts goes up and unlike your example where it's skill based every one of these should be random so right it's just like slot machine and it's the whatever anyways and so yeah so the higher that number that you've recorded you're going to make your guests go up now you could get really lucky unlucky however you call it and they vary you see only a single number and that number happens to have you know 100 leading, we said 32 bits, it has 27 zeros in it. And so you think you've seen two to the 27 items, but you really only ever seen one. And so this is, this is not good, right? This, this would make you have a very, you know, a high chance of getting an error because of this one problem. And so the second thing, which I started off with there,
Starting point is 01:10:07 is by having multiple buckets and then averaging them together, you take the first few bits, the first five, and you choose your bucket. And you use the last 27 and you do this thing I'm mentioning that we've talked about recording the high score.
Starting point is 01:10:19 And then by averaging all of the buckets together, if you've only seen one, sure, you're going to have this really large number in a single bucket, but all the other buckets are going to seen one, sure, you're going to have this really large number in a single bucket, but all the other buckets are going to say zero. And so you're going to pull it back down
Starting point is 01:10:29 to try to correct. And we're setting up for the second thing here in a minute, but this is called, I guess, like log log. I don't see a ton of evidence that this was really rolled out or used because there's some pretty easy ways to improve it here we'll talk about in a second. But this idea of recording multiple max scores
Starting point is 01:10:46 and then looking at the end and taking an average allows you to sort of approximate it in a very small way. So the other thing we didn't mention with log log and linear counting in both cases is you kind of measured, well, in linear counting, you measure this amount and log-log. You have these coefficients as well because you just sort of know on average you're likely to kind of see these things.
Starting point is 01:11:13 So you run some simulations and you kind of tune some constants to accommodate for kind of like edge effects, I guess you would sort of call it. So hyperlog-log improves on log-log. Well, one thing actually before before you jump into hyper log log i want to double click on something you said earlier which is you know i also went through this where um you know i felt like even just hashing i kind of felt like you
Starting point is 01:11:39 know how do you trust the hashing you know and, and even like, what happens if your data set is biased in a way where, you know, because you're maybe your data set all starts with HTTP colon that now like your hash function isn't pure and you know, you have more collisions or even like Mac addresses, you know, I'm pretty sure Mac addresses are, well, actually I, I don't have no idea, but I'm assuming that there's some kind of randomness there there's no way all you are pre-determined it's by like it's hardware vendor
Starting point is 01:12:10 sets the first few bytes okay got it but then within a hardware vendor at some point there's probably a collision right or are there there can't be like a global count of mac addresses while they're the factories like churning out tons of these. Yeah, maybe there is. I think there's not supposed to be, but probably there is. But, oh yeah. So whenever I see hashes or all these things, I used to feel like, well, this isn't kind of pure.
Starting point is 01:12:41 It's not going to work out. And how do you even know if you have a problem, right? And it turns out that, yeah, at scale, you know, you can't do anything very precisely. And so what you do instead is you try to measure, you know, the bounds statistically of some issue. And then you're constantly correlating between other things. Like, for example, maybe there's, you know, a one out of 10,000 chance that you overestimate your unique visitors by 10x, right? And so, and so one out of 10,000, you know, that means, you know, Google has been around for like 15 years or something so it's probably like a 50 50 chances it happened once right but what they're doing is they're
Starting point is 01:13:29 constantly correlating that with other things so you know if the viewer count goes up by 10x but the revenue doesn't move well then you know that might be an error that day in the viewer count and so you're constantly trying to cross correlate between different data sets, different signals, and then also auto regress, you know, correlate across time for a signal. And so it really just becomes about reducing the error and trying to reduce the bias and the variance of these signals. And so when you start thinking about it in those terms, then it becomes okay to kind of say, well, yeah, all of these errors are there. We're just going to measure them all in gross and try and get that measurement to be as accurate as possible. And so as they're making changes to
Starting point is 01:14:20 these filters, they're comparing the correlation to, again, just using our example, to revenue. And they're seeing, oh, the new value is much more correlated to revenue, which is what I would expect. As unique visitors go up, revenue goes up, and all of that. So you can actually measure all of these things and get better in an incremental way, even though it feels like a very helpless situation. That's a great call out. I will say also, I think a lot of these techniques we're talking about time and a place. So like your financials, you wouldn't want to do this. You probably do a different sort of way of the expensive query, but you can't maybe do that in real time. And then as you said, not only comparing it with other metrics, but even just also running, maybe like a second method that is less reliable or different parameters and checking that as well. So maybe it's less accurate, but it gives you a coarser bound, right? And says, you know, hey, at least I know that like, if I get a 10x measurement, I need to throw it out,
Starting point is 01:15:29 like that's, that's going to be wrong. So it can give me an order of magnitude kind of kind of swag. Yep, makes sense. Okay, so moving from log log to hyper log log, it's not a big jump. So the first one is for any buckets. So it's basically what happens with, you know, kind of very high numbers, like when you're starting to get very dense filled buckets, or very low numbers where you have a lot of empty buckets. In the case of very low numbers, we have a lot of empty buckets. You may consider, instead of taking the max counts and doing the average, where, like I mentioned, you have five buckets, you happen to get one that was 27 zeros, you're going to way overestimate it. If you have a lot of empty buckets, instead, maybe just estimate the probability of having that many empty buckets so what is the chance that you saw you know in the case your estimate there would be very high like a million or something rather than estimating a million what is the chance that you saw a million
Starting point is 01:16:20 numbers and had nine empty buckets this is going to be very low. And so you use that for sort of some corrections, as well as very dense, just like in linear counting, we're talking about the more buckets sort of like have bigger numbers in them, the more chance that you're sort of seeing these collisions and you're sort of seeing a different behavior of the numbers. And so handle those as well. And a lot of that, you just end up running by either computing the statistics or running simulations
Starting point is 01:16:50 or sort of handling that rather than kind of a single formula that you run, you kind of have a sort of set of, if you meet this condition or that condition or the other condition, apply a slightly different formula, but roughly the same thing. And that gives you hyperlog log.
Starting point is 01:17:08 So this hyperlog log allows you to perform these analytics, get an approximate answer. I think the number that I was sort of seeing in the paper is like for the majority of cases with these corrections, you see like 2% error. So if that is important, like in the case of maybe financials or something, you don't want to use it. But for the case of maybe financials or something, you don't want
Starting point is 01:17:25 to use it. But for the case of like counting how many visitors in a day, you know, if you don't want to be lying to people, if it's like a court thing, so like you wouldn't run it. But if it's just, you know, keeping track from day to day or hour to hour, as an example, you could run something like this and know that you're going to be generally close most of the time and of course a lot of those uh things are tunable no very cool it makes a ton of sense the final trick and the sort of like second thing that kind of like i don't know melted my mind a little uh and there's a variety of these problems and this one's going to get to jason's example so here we are at the very end and you foreshadowed by talking about, you know, sending it to different computers and having each of the computers handle it. So let's imagine we chop up our logs and send all of our log, you know, one tenth of each of our logs to 10 computers. And each of the 10 computers does the thing we just described with hyperlog log. What do you do at the end? Well, you can't sum the 10 results, right? Because if you sum the 10 results, now if you divided them like Jason said, where I looked at their username and sent them to a
Starting point is 01:18:32 computer based on their username, you could. But if I just chopped the first 10% of the logs, the second 10%, and I sent each of those, some people will have two different computers will have seen the same user potentially. And so you don't want to just add up, hey, each of them, so 10 times whatever number each of them, just kind of add them up. That won't work. That's not a good estimate. You need to dedupe them again. Well, the sort of mind-blowing thing, or at least for me,
Starting point is 01:18:57 is if you look at those buckets we talked about, and everyone's running the same algorithm, you can just transmit the final counts of the buckets and just run the same max operation. So if you run the same max operation across all of the buckets, and then do your final calculation, that actually works perfectly fine and is incredibly cheap. So if computer one says my first bucket has eight zeros, and computer two says, oh, my first bucket only has five zeros, then your final count for the first bucket is eight zeros. And it's the same answer you would have gotten
Starting point is 01:19:29 if you had sent all of the queries to only a single computer because of this, you know, counting, keeping only the max. So this max for individual computers is, if you take the max across all computers, the same as if they had seen them because everything- Oh, right. This was not, this is like, what?
Starting point is 01:19:49 No, this isn't going to work. It was my first inclination here. But it's this aggregating by Macs is the unlock. Because you're aggregating them all together by Macs, you can just run it at the end across all the computers and you have it distributed as many ways as you want but wait what was wrong with your idea of saying well just just adding up all of the
Starting point is 01:20:14 oh because there's duplicates across the computer so imagine the entire log one million entries is all username jason and i send each to 10 different computers each computer is going to estimate one let's say and then my final result say 10 the answer is actually only one right right so again for something like visits to google you may know on general how many times a given person visits in a day maybe you could do this via other means but if you're using this for an analytical engine that doesn't know how likely repeats are or not then this is a very difficult you know thing to solve a priori with sort of like specialized knowledge yeah that makes sense um yeah that's totally mind-blowing you know another thing that this benefits from is that each bit doubles the count. And so the difference between seven and eight bits is so large that if one says five bits and the other says eight bits, by throwing away the five-bit one, you're not throwing away that much because it's as i said it's it's uh it's
Starting point is 01:21:25 hyper logarithmic or whatever its contribution is irrelevant yeah right wow that is super cool i guess a couple things slipped over so one is that um moving from log log to hyper log log you also move from taking an average to tape to taking a harmonic mean so you take the harmonic mean of the buckets which helps now what is the harmonic mean harmonic mean is one over two to the n time and then you sum those up for each bucket and then you multiply by n at the end um and so that's used for rates uh so i try to look it up as well it's sort of it's like one step past my intuition about probability so I'll keep I'll keep noodling on that one and maybe have to get back to you
Starting point is 01:22:10 but they see it by moving from the sort of normal averaging that we would think about you know just sum them all up and divide by the count and instead moving to this harmonic mean instead they see improvements there and again I mentioned these coefficient factors and you would run simulations and so they give estimates for what they see for these
Starting point is 01:22:28 correction factors for high, low offset, these kinds of things. So definitely read the paper. But I would also say that day to day life, are you going to implement this? Yeah, don't. I mean, maybe if you have good, but one, we mentioned a bunch of things along the way here. Bloom filters, probabilistic counting, knowing to go reach for one of these if you don't need an exact answer. Second thing is a lot of databases have already like Redis and other things
Starting point is 01:22:54 have already implemented this under the hood for these kinds of operations. So you're likely already using it. And it's just for me, one of those mental like aha moments where the chance that this specific solution is what I need to reach for is very low. But understanding a bit more about the space I feel helps me sort of know when to go to the Internet to search for something I don't know. Yeah, so I mean, as far as using it, there must be some way in sql to say like approximate count distinct oh interesting i didn't
Starting point is 01:23:28 look at so every sql is different so let's let's see postgres this is this is fascinating awesome type postgres oh man same wavelength all right wavelength i mean it looks like there's a uh some add-ons you can kind of do to to kind of distribute a distinct count with hyperlog log on postgres i see i see a lot of articles about this so uh again yeah it looks like uh yeah i agree it looks like a plug-in so oh yeah this is so you install this plug-in in postgres and then you do hll underscore cardinality and so uh but i remember i think this is sawzall uh i remember there was some i do remember in my life typing select a prox count distinct like i did like that for some reason it is like just radiating through my mind so there's definitely some system that has it built in. Yeah, I mean, I think they're like probably in Hadoop
Starting point is 01:24:29 or this sort of very distributed processing of things. So unlike when you have it on a single computer, but you have it across multiple computers, doing large analytics, column data. I would imagine like Apache Arrow or something. Yeah, exactly. PySpark and and pi spark have a prox count distinct that's where i've done it um that's amazing i had no idea that that was
Starting point is 01:24:54 doing that under the hood that's like totally mind-blowing i don't know so so maybe right in if you uh like this detailed singular algorithm exploration. I mean, I guess we touched on a bunch of stuff. So maybe it's not that dissimilar as I thought it might be in my head. But Jason had to play along here. So he's at least acting like he walked away understanding this a bit more. No, this is amazing. I definitely learned a ton. No, we should definitely do more of this. Yeah, please write in, let us know, plus or minus. But I actually learned a ton. This is really satisfying. Another thing is, you know,
Starting point is 01:25:31 you know, when people get a CS degree and they learn about quicksort and merge sort and all of these things, they might say to themselves, well, like, you know, okay, like we figured out how to sort, you know, we're done. Like why even teach this to me? Like, why are you teaching me a one-liner in any modern programming language? And, and what you've kind of shown folks here today is that like, this doesn't end,
Starting point is 01:25:54 you know, like sometimes you need probabilistic things. Sometimes you're working in very limited environments. You know, there's all sorts of different types of counts and things you do and so and so computer science you can do it all day as your job and and like literally the science part of computer science and and there's a lot of fun stuff well i think that brings us to the end of another episode 169 wow. Wow. Yeah, pretty wild. Thanks everyone for sticking around. Yeah, we should definitely do something for our long-term patrons. I'm gonna do some digging on my end,
Starting point is 01:26:37 figure out if I can get that number of months subscribed and all of that. But thank you so much for supporting us all of these years. You've been able to continue to grow the community. I actually, I'm considering using some of the community money to buy Twitter ads. It does feel like our Twitter presence is growing pretty good organically. And so, you know, at the end of the day, you know, we try to get as many folks interested in programming computer science as possible. We couldn't do that without all of your help. And so we do put, you know, every dollar that you donate towards trying to accomplish
Starting point is 01:27:16 that goal. Yes. Thank you to all the people who have stuck with us for, oh, I always forget how long it's been, but a very long time, including Jason. Thank you, Jason, for all your hard work. Thank you. Someone commented asking, we did two shows on Bitcoin. One was episode seven or something, and another show a few years later.
Starting point is 01:27:39 And they were asking if we bought Bitcoin at those times. And I'm sad to say the answer is no, we did not buy Bitcoin at I think they said 50 cents, or $500. We didn't buy Bitcoin at either of those prices. I didn't. But after that show, I did install I got some Bitcoin out of a faucet. But unfortunately, it was like point oh one Bitcoin or something. So that was the only bitcoin i had left from that so yeah we we close we we answered the that's definitely the most asked question i think i think about once a month we get asked that someone asked it on discord last week but i do get a lot of email hey are you guys did you guys buy a bunch of bitcoin. We didn't do that. All right. That wound has been sufficiently salted.
Starting point is 01:28:31 But thank you so much, folks. It's awesome that we have such an outpouring of support. A bunch of folks asking for different languages. We will definitely get to that. But I also want us to cover algorithms. This was a really exciting kind of new branch
Starting point is 01:28:46 for the podcast. And so thanks, Patrick, for doing the background homework on this. All right. All right. Catch everyone later. music by eric barn dollar programming throwdown is distributed under a creative commons attribution share alike 2.0 license you're free to share copy distribute transmit the work, to remix and adapt the work, but you must provide attribution to Patrick and I
Starting point is 01:29:27 and share alike in kind.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.