Programming Throwdown - 169: HyperLogLog
Episode Date: November 27, 2023Intro topic: Testing your car batteryNews/Links:Tech Layoffs still going onhttps://www.sfchronicle.com/tech/article/google-layoffs-california-companies-18465600.php Real-time dreamy Cloudsca...pes with Volumetric Raymarchinghttps://blog.maximeheckel.com/posts/real-time-cloudscapes-with-volumetric-raymarching/Robot Rascalshttps://en.wikipedia.org/wiki/Robot_Rascals Meta Quest 3 https://www.theverge.com/23906313/meta-quest-3-review-vr-mixed-reality-headsetBook of the ShowPatrick:HyperLogLog Paperhttps://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf Jason: Eureka! NVIDIA Research Breakthrough Puts New Spin on Robot Learning https://blogs.nvidia.com/blog/2023/10/20/eureka-robotics-research/ Patreon Plug https://www.patreon.com/programmingthrowdown?ty=hTool of the ShowPatrick: Techtonica: https://store.steampowered.com/app/1457320/Techtonica/ Jason:ESP32 development board: https://amzn.to/3Qpmb20 WEMOSTopic: HyperLogLogMotivationCardinality CountingLinearCountingHash + expectation of collision based on how fullBloom FilterLogLogUse first N bits as bucketUse max sequential 0s in each bucketAverageHyperLogLogHandle empty bucketsUse correction factor like linear counting for low counts (number of empty buckets) and high countsDistributingTransfer bucket counts ★ Support this podcast on Patreon ★
Transcript
Discussion (0)
programming throwdown episode 169 hyper log log take it away, Jason. Hey, everybody.
So, Patrick, are all your cars electric? Any of your cars electric? Do you have an ICE car?
Actually, I have one normal car and one plug-in hybrid we discussed.
So it does have a battery, but it also has a regular 12-volt car battery.
Oh, yeah, that's what I was going to ask you.
So even the hybrids have regular car batteries uh the hybrids for sure i think even electric cars still often have like 12 volt sort of normal car
batteries for a variety of purposes i don't know okay it does i mean it's probably better than
trying to like convert you know from whatever voltage the the engine battery uses it looks
like here they do just a quick
google says that yeah they have the normal lead acid 12 volt batteries so you know how do you know
when your car battery is dead do you just drive until the battery stops working or what do you do
i mean in an overall thing i guess yeah the car can't start so i don't know how that works in
electric your accessory systems like when you sit in the car before you start it, it probably doesn't turn on.
Yeah.
So like in my experience until recently, I have never tested my car battery.
And so basically what happened is I will just drive until the car won't start.
And I will have to go and desperately figure out how to replace the car battery.
It ends up being this big thing.
And it's just gotten worse as like we've gotten busier over the years. So it culminated when we
were about to all go camping and the car didn't start. And we've had the same battery for a while.
But part of it is, you know, you never totally know without a tester if it's if it's the battery or if you left the lights on
or if there was some kind of glitch that caused the energy to drain like uh um we had the key in
the car uh like not literally in the ignition but we had the key next to the car and i feel
like just having the key so close to the car it somehow got confused and it thought we were in there or something um
because uh you know i'm actually so i guess to answer i'm i'm ordering a car battery tester
which is only 20 bucks i thought it was going to be like hundreds of dollars because you know when
they they anytime they do anything at the dealership or you call someone to do something
it always seems like
oh like this must be some amazing device that you know must be manufactured in a laboratory
in the middle of the earth or something and so no it's like you can get a car tester or battery
tester for like 20 bucks do you have to disconnect the battery not remove it but at least like
disconnect the terminals like the rest of the electrical system for it to work because that's a
that's an annoying thing if you do yeah i don't think so
right because as long as the car is not on it's not drawing it's not drawing any amps
so it should it should be an insignificant draw so we're gonna find out i'm gonna get the car
battery tester on tuesday tomorrow and um'm going to see what the instructions say.
I'll let you know.
I'm pretty sure, just putting my engineering hat on,
that as long as the car is not on...
Well, the car has to not be on and not even on accessory mode.
The keys have to be far away and everything.
I could see that.
But it's not a voltage thing right so the thing
you're describing is can you pull enough amps temporarily to to turn the car over right so
it's like measuring like rush current that it can deliver yeah so this $20 thing what it needs to do
is draw enough amps for a certain amount of time and make sure the voltage doesn't drop right i think that's but it can also
tell you it can actually tell you whether the battery like all the cells are just old or if
you have one bad cell i don't know what you would do with that information but it could tell the
difference this is probably a voltage based thing knowing that all the batteries are made from the
same chemicals or whatever so yeah that's right yeah i think if you have one
dead cell then you would have you'd consistently pull a certain number of amps but at a lower
voltage well you'd have one cell voltage drop missing so you like i don't know how many cells
are in a 12 but like you would go to 9.3 and that's like all your cells are good except for
one yeah versus if you just had like 10 instead of 12
then okay so that your your mission should you choose to accept it is to crack open the tester
to look at the circuitry figure out what it's doing yeah after the last episode where i burnt
the raspberry pi and i literally have a imprint on my finger from the SD card. I don't know if I'm going to crack open the battery.
But but yeah, I guess, you know, I'd be interested to hear what people out there do.
So if you just drive your car until it doesn't turn on, if you have some regiment that works for you and you've never had this happen to you either end of the spectrum let us know um you know
i just you know kind of like it didn't ruin our camping trip but it definitely set the camping
trip back by like multiple hours so so i kind of feel like i needed to improve this area of
my life so we'll see how it goes if you have a way of doing this less though actually patrick do you do anything or do you just drive
the car until it doesn't turn on yes that one which is horrible i i think when i take it for
its regular oil change i think one of the things in the checkpoint is they hook up their expensive
battery tester to it and then try to convince me to buy their battery um so that's been my i do
regularly get the oil changes on
schedule um i don't know if that actually actually is super critical but i for whatever reason i just
choose to do it that way and so yeah i hope you did the same thing i get the all changes on schedule
but i've never as far as i know i've never had a mechanic tell me i need to change the battery it's always just died oh man all right uh on to the news your first up all right so you know we never really talked
about this uh this is a news article about you know google continuing to to do layoffs
um but you know the particular article isn't what i want to talk about. I want to talk about tech layoffs in general.
My view of it, there are layoffs.
There have been layoffs for, what, probably 18 months now.
It kind of oscillates.
So there's layoffs, and all of a sudden, there's even articles about companies laying people off and then desperately asking them to come back.
It's a very volatile situation.
You know, my overall take on this is that, you know,
I think that it's a really complicated situation because on one hand,
I do think people should choose degrees and majors and careers that have an addressable market,
right? So if you get a degree, well, depending on what your goal is, but if your goal is to
enter industry and you get a degree in American history, that's going to be really difficult. I mean, there are, you know, you don't need to do what your degree is. But, you know, it's just making things a little bit more difficult.
And so someone might say, well, with these all these tech layoffs, maybe we shouldn't
get a major in engineering or computer science. I think my overall take of this is, is, you know,
engineering is still an extremely strong discipline career-wise relative to the spectrum of degrees or things you could study.
And so I wouldn't be really phased by this.
If you're a high school student, college student, I would choose engineering if it's something you're interested in and not really be worried about the layoffs. I wouldn't make that a really big factor in which college major you're going to
pick. What do you think, Patrick? Yeah, I agree. I mean, I would love to say that
if you look back to other periods in time, like 2001 or or other the great financial crisis and that the layoffs were
completely merit-based but i mean i think the unfortunate truth is when a company goes to lay
up when a company hires people hire against different reasons different performance targets
they might be over specializing over generalizing hiring less quality than they need overpaying for
whatever there's just so many variables that go in
i think when a company goes to do layoffs it becomes very difficult to sort of like run some
strict merit based thing because also you layoffs normally come from top down right normally the
bottom people aren't doing the layoff so it's the top down that hey there's a target to hit
and it but bottom up there's this like obfuscation that happens where everyone says their team is overperforming
because it's sort of a game theory thing right
like you don't want to say oh i have a team full of underperformers
uh and that's why my you know that sounds bad on the manager right so you
get this like information problem and so i think when
the layoffs happen it unfortunately it isn't always very
fair um but to jason's point i do agree that I think when the layoffs happen, unfortunately, it isn't always very fair.
But to Jason's point, I do agree that on a whole, I think the industry is fairly robust and there are other people hiring.
But times are exuberant and times are lean.
And I think what often can happen is if you are in the wrong part of the hype cycle when
you come out of college, you can have sort of wrong expectations like, oh, you know, I'm going to start out and make X dollars
per year.
And then if times get lean and you're sort of waiting for a better, better offer, you
know, that matches what you heard.
It's like, well, unfortunately, the situation is dynamic and like, you know, it's not necessarily
true that those offers are still there.
And so you got to kind of you know be aware
of that but but on the whole i think you know as an industry there's still growth here i i mean
people kind of say which they've always said oh ai offshoring whatever there's always some threat to
computer science or engineering you know disciplines but i don't i don't i'm not
over worried about it myself and i would still encourage people who are interested in it to pursue it.
I think there's lots of adjacent jobs you can do if you, you know, if things that really,
really lean, there's plenty of other things you could apply the skills you learn in getting
an engineering degree to.
And so also, though, I will like give a shout out for the finance stuff we talked about
sometimes, which is the importance of having savings and not, you know, living paycheck
to paycheck. And it can be hard when you live somewhere expensive, but making sure that if
that does happen, where you have to sort of go a few months before you can land another job and
have the freedom to not have to just take the first thing that comes across your desk, then I
think that's a huge benefit and gives you a little bit more confidence to kind of
seek out the thing you're looking for. But yeah, I agree. I'm not overly concerned about it. I think
like this latest round that you're mentioning are a lot smaller. So the news likes to hype it up
because everything's dramatic. But, you know, there have been serious cuts. I think the cuts
are tapering off. We're probably seeing the last of them. But you'll always see companies wind down, but new companies start up.
And, you know, it's kind of an ebb and flow.
And so as a whole, I think the health of the industry in terms of, you know, compensation
and stuff is probably steady, probably not as exuberant as it was two, three years ago.
But holding steady.
Yeah, yeah, I agree with that. Yeah, I think you brought up a really good point, which is, you know, and I know we've talked about this before, but when you start out, let's say you're in college right now, you're looking for different careers.
Your goal should be to learn as much as possible. And in fact, like most companies will have source control, will have peer reviews,
will have like a lot of different disciplines that will be really important for you to pick up early.
And so you really can't, as far as I know, you really can't go wrong. Obviously, I've only had
one first job as one human being. But you really can't go wrong with your, I've only had one first job as one human being. But you really
can't go wrong with your first job. You can always find stuff to learn. There's just a huge
expanse of things that you need to know to be a professional developer. And you will learn
at least 70, 80% of that almost anywhere. I would say that maybe the place that's the most risky would be to join something extremely early.
I heard a hilarious story.
I don't think I've ever talked about this on a show.
This was back when Java was really popular.
I had a buddy who worked at a really small company.
It's only just a handful of engineers.
Most of them were straight out of college.
There wasn't really anybody who has experienced. And, you know, you know, Java used to have
these things called jar files, right, which are basically zip files with some metadata
that told you like, which Java file to run first, and I'm probably butchering that, but
it's basically what it was. And so most people don't know that jar files under the hood
are just zip files, but these folks did.
And so their continuous deployment strategy
was to build the code on their computer,
guess at which dot class files changed,
based on which Java files they changed,
and copy those into the zip, into the jar
that was running on the deployment,
like on the production server.
So there was no source control.
Everyone just copied, you know, source files to each other
over like Windows shares.
And so when it was your turn to update the code,
you would open up this jar file
and start dropping Java class files in there.
So that's a difficult situation to learn
continuous deployment and source control and all of that.
So that might be something I'd be concerned about.
But most medium to large size companies most you know medium to large size company actually
all medium to large size companies and most small size businesses will be just fine
all right well taking a hard hard turn there from the seriousness of that uh but but i agree
agree with what you say uh my next news article is my continual stream of shader descriptions, which is in my...
I've given up the belief that I'm going to build a killer game.
So now I'm just like, I'm going to build a killer shader for making a cool visual demo.
So this is real-time dreamy cloudscapes with volumetric ray marching.
Ooh, that's a mouthful.
Check this link out.
This is by, I'm guessing by the name of the blog,
I think it's by someone named Maxime,
Maxime Heckel is what I would guess there.
But the, yeah, I think that's right.
And they have a article on their blog
about doing a couple of things.
First of all, rendering clouds,
as the name says in real time.
So, you know, doing it nice and fast. It's a beautiful thing if you check it out. But also interesting is
if you sort of click around through the linked articles, they have a ton of really just
interesting examples of doing shaders and how to kind of make them work. And so shader is
a code that roughly runs on your GPU and, you know, sort of has instructions per pixel about
what to
render uh and it's source of a lot of like how do you call like computer generated art demo scene
style things which we've alluded to before um but also this discussion of ray marching has come up
a few times into i feel like those things ebb and flow i think it's the algorithm just like honing
in on me but like whatever um so i've seen a lot of youtube videos or whatever about ray marching recently and so it's on my like list of things to look into which is a way of
sort of thinking about ray casting and you know understanding how close you are to something
uh sign distance functions uh anyways tons of good quality things like a jumping off point
so uh i one day aspire to make beautiful like like mellow screensaver like shaders like these articles I keep touting out for everyone.
This is amazing. One PSA. Don't look at this in Firefox. It will cause cause all sorts of heartburn. But on Chrome, it works just fine. Do you do this, Patrick? I use different browsers on my work computer for personal and for work things
that way it's like the most isolation isolated environments i could think of oh yeah i i do do
that but uh it's sort of a more because it's a requirement rather than uh that's okay um
but yeah on chrome it works just fine. That is so cool.
I wonder how many of these things are commodities.
Like if you were to use Godot or Unity or one of these things,
do they just have volumetric ray marching or is it too cutting edge?
Oh, that's a good question.
It's like you could probably find out implementations of it and you know pieces but i think this is one of those things that you kind of piece together the sort of almost
algorithms into building the look of your game or your your you know demo or whatever so you may do
things like compile various ways of doing water rendering and cloud rendering if you're doing
you know this kind of thing and so i think you might be able to pull down certain it's not like shaders i don't think
that you know my understanding is it's not like pulling down a model like downloading a model and
like you know rendering the model and it's rigged and you do animations like those things i think
you have common utilities for the shaders are more like the sort of look and feel and so you can
probably have a jumping off point with them but you sort of combine them in your own way to give your game its unique style um you know or realistic you know
kind of thing but i think a lot of people make independence are pursuing a very stylistic
approach so that kind of looks unique to them or whatever so um rather than you know hyper
realistic is a really tough game to play in yeah that totally makes sense um all right my second news
is uh i've been diving into okay so what was your first computer patrick first thing with a keyboard
it must have been one of those is that like an apple 2e okay yep yep we had those in school i had a commodore 64 and uh you know i also had an apple
tv at school and i've just been going down memory lane on old like original games like the first few
games ever played um one of them was this game called street beat where you ran around with a
boom box uh trying to uh hit people in the face with music notes to make them
dance. And so you actually, the way it worked was first you had to find a song and the songs were
like in people's houses. So you'd see a glowing door. That was a sign that you,
this person had made an awesome song you'd go in his house you
take his song and then you turn on your boom box and hit people in the face with music notes
and once you've hit a certain number then the song was like you know i guess like proven to be good
or something baked in or whatever it is and then you would take your your baked in song to your house your own personal house
and score a point and so i don't remember how many points you have to get but um anyway so that was
street beat um i came across one that i played with my my family when i was really little that
kind of blew my mind and that it was really a interesting crossover that I don't think I've seen since called Robot Rascals. And so the way Robot Rascals worked is it was a card game and a computer game
all in one. So the cards were your private information that no one else could see,
and the computer was sort of the community, you know, information. And so you would take turns playing
this, this game on the computer, but you would have your private information. And, you know,
you try to based on what other people were doing, try to guess at what their private information was
so that you could exploit it. And I just, yeah, so many good memories of this, but I haven't really seen anything like that.
There are some companion apps.
So if you're playing Monopoly, for example, there's a companion app where you don't have to pass paper money around.
There's things like that, but those are more assisted tools.
I haven't really seen a board game app kind of you know crossover like like this
um have you patrick have you seen anything like that yeah i mean there are a few so i think
you know i've played like a ticket to ride where you know like everyone has their tablet and
you know the screen is rendering the common thing but you can also do pass and play or whatever
where like the color oh i don't know how people anyway the color of the cards you have you know it shows at your turn there's also i think one for
suro um wait hang on i don't know maybe i don't know saying that you're okay so what i'm saying
is you have physical cards so so you're playing a physical card game but then the computer has
like the other half of the game you know so like the computer
doesn't know it's like combined like hybrid so it's not like you're playing the board game
as a okay no i missed this okay so like you have like a deck of cards like holding in your physical
hand right right oh wait what how did and so you have to enter information reliably into the
computer so yeah so like you could uh like basically like information reliably into the computer? So yeah, so you could basically tell the computer,
I have a right to mine this rock.
And the computer has no way of knowing that
because it doesn't know physically what cards you have.
So it just assumes when you say something, you're right.
But yeah, you're also playing with other people.
If you have a right to mine this rock
and someone else is just looking at that card,
then he knows you don't have it.
So you can't really cheat, per se.
But yeah, you need the physical and the virtual
in the same space.
I think I have heard of something like that,
but no, I've never played it. never, I never played it where like,
yeah,
like the,
the engine of the game can be very complex because the,
all the accounting is done by the computer,
but you still are playing a physical game.
Right,
right.
Exactly.
Yeah.
Yeah.
And that's cool.
Yeah.
And this was like an old,
old game,
right?
So it looks like you were saying it's like,
I definitely played it as a very little child it came out uh 1986 so i was actually four when it came out i mean i played it i was probably
like eight when i played it but very cool wow that's nice yeah so folks out there if you're in
game development this would be like a interesting area i don't know if it's sort of like just you
know past its usefulness but i just i feel like there might be something cool there we'll see
maybe someone could write in and say what some contemporary equivalent is well i mean an
alternative to what you're proposing is you could go into virtual reality and the game could just
be sitting on the table in front of you and if you wanted uh sort of a newly released virtual reality headset you could use the new meta quest 3.
no i don't i don't but i just bring it up because i feel like it's one of those a couple like
uh you know space travel uh random finance threads uh virtual reality. There's like a few things I feel like we keep weaving through our podcast over the years.
And so our podcast is old enough that we've seen some of these things mature.
And so interesting that MetaQuest 3 is like a sort of, I guess it's the latest released
hardware.
Of course, you know, I think there's the Vision Pro, which people are talking about coming
from Apple next year, I think is what they're saying but um for the meta quest 3 this is out
you can buy it and the the sort of like it does a lot of things right like it just revs everything
but the pass through where your cameras on the front that are sort of eye distance apart passing
through to you on the inside so that you can do the mixed reality this is the thing
that you know everyone really wants to make happen is the augmented reality the mixed reality uh and
so you are seeing which is kind of the funny thing the what happened when google glass was coming out
which is like people sort of out in public using their headsets in weird places to you know record
record stereo video yes but also to like be somewhere but also not
be in that place and so it's sort of uh this interesting uh i guess you know zone between
this being a real thing where you just put on glasses and see it because you're still wearing
a headset people are very aware of what you're doing um but yeah i i know i do play still my i have a quest two i i want to
call it an oculus quest but they deprecated that but to be clear it was like on the branding of
my box i feel like i still call it that uh anyways the quest two and i've really enjoyed it i feel
like the they sort of figured some stuff out the games you know we play we keep playing and and
we've talked about that before on the show you know getting exercise in the game and just generally
how immersive it is i've been playing the i guess i could have made that my
tool of the show i didn't oh well uh which is i expect you to die number three um which i told
to someone and they looked at me very funny like what kind of horrible game are you playing it's
like no no no it's like a light-hearted puzzle room uh so anyways i expect you to die number
three i feel like you know people have kind of figured out the formula a bit for some of
these things that work and it may be your cup of tea or not, but excited to continue
to see stuff coming out in this area.
I don't know where it's going.
I don't know if this is like the next big thing or not, but definitely excited to continue
to see how it shapes up.
And at least, you know, for me, you know being a casual uh person involved here uh you know
enjoying some good games and entertainment as these things mature yeah i mean i'm one of those
people too that plays the the oculus the the oculus quest to um at least once a week sometimes
even like multiple times a week and uh but you know something that like uh it's kind of interesting you know i i
play basically the same two games that i've had almost since the beginning and so um you know i
don't know how much of a commercial success it is um but but it creates a ton of value um and
basically for me i just play boxing and sword fighting. And, you know, originally, I got it to where I was
playing boxing on the harder and harder difficulty levels. But I realized two things. Number one,
you know, I'm not really very, like, fast in terms of reaction time. And so number two is like,
when you start whipping your head around, it's just not very comfortable. So I thought, okay,
I can make this more difficult by adding weight, because i know boxers do this in real life right they have like really weighted
gloves when they train and stuff and so i got these uh wrist weights and i've been kind of
just adding more and more weight and playing these games on the normal difficulty just with more
weight um it is amazing like number one how much harder it is even with
one pound of weight on each wrist you'll be amazed yeah you'll be amazed at like how slow
you punch and how slow like you get your hands back in a guard and all that um and i got this
is kind of ridiculous but i'm up to like five pounds of weight on each arm so i have these uh
they're meant for your ankle, but I put them
around my wrist and I have another one that has a thumb loop. And so I have these like basically
two weights on each arm and I'm doing boxing and it is like a heck of a workout. I mean,
it's just, you'll be sweating bullets up. You do it. It's, it's a ton of fun.
So you're ready for a street brawl is what i'm hearing yeah i'm talking
to a friend of mine who does he's kind of a hobby blacksmith and he told me that um i always thought
and i always thought that that swords were like 30 pounds uh but no like like swords you know in
real life like a broad sword or whatever is only like six pounds and so um now you know granted there's even more
of a of a cantilever effect because the sword's weight is you know not even close to your body
but but but i'm starting to approach what it feels like to hold a real sword um and uh it's it's it's
a ton of fun so if you get stuck in one of those time vortexes or whatever, and you end up in medieval times, you'll be ready to go.
You're ready to rock.
Yeah, I think the challenge is, you know, I'm used to moving through people.
I'm not used to people actually existing in a physical plane.
So as long as it's a lightsaber, you're fine.
Yeah, as long as nobody hits me either.
Every plan is good until you get punched.
Yeah, Mike Tyson, right?
Yeah.
Yeah, random sidetrack.
Okay, I wanted to go back to one thing you were saying,
which I don't know if we talked about on the show,
so I'll mention this just really quick,
because we've talked about it before.
But when Activision was on trial,
that was earlier this year, I believe,
it came out that there are a million PlayStation users
who literally only play Call of Duty.
Literally nothing else.
They have their PlayStation and they only play one game.
And something like there was 6 million who spent 70%
or more of their time playing a Call of Duty.
And so you're like, I only play one or two games i'm the
same i feel bad if i just keep playing the same game you know factoria um but uh you know if you
just sit there and play all your but actually i think you know this is quite normal is what is
is what the statistics show is people just end up really liking one game and for very long periods
of time they just play a single game.
Yeah, that's wild.
I mean, it's very hard for the industry to make money if they have to sell those units.
I think they sell the units at a loss.
And if someone only buys one game,
it's just really difficult.
Yeah, no clue.
All right, time for book of the show.
All right, so I'm going to go fast
because mine's a cop out.
Mine is the paper we're about to talk about today.
So we're talking about an algorithm, HyperLogLog.
Stay tuned.
And so my book of the show, which isn't a book, it's a paper.
And it's a paper by Google Research, which is where HyperLogLog was sort of written up.
So check that out.
If you're not a fan of reading papers, I know Jason reads a lot of papers, but I don't.
I find them very difficult.
I don't prefer them. I don't like the way that they're presented. I think they're pretentious,
but whatever. Anyways, all the PhD people are coming after me soon. Anyways, but I find that
they're not written for clarity, but written to sort of affect a certain style. This paper is
no different, but there is some good stuff in here. There are, you know, good graphs.
And even just, I've just given up
that you don't have to understand
every line of a paper,
or even like the notation.
Just kind of like doing it once or twice.
And if you sort of go through a couple of papers
and people will often reinterpret a previous paper.
So for papers that are sort of seminal,
you can kind of read a later paper
and get a better description.
And so anyways, don't feel bad if you're not a paper reader.
You know, I would say as part of like, you know, some small percentage of your education,
if you're not a sort of master student or PhD student or PhD graduate, still don't shy
away from looking at papers every so often when they come up and just sort of like reading.
You can occasionally glean really interesting tidbits or or insights not always there's a lot of junk
uh you know anyways hyperlog log paper that's my that was a weird rant for a book of the show but
there we go no i think it was really important i mean we must be on the same wavelength because
my book of the show is also a research paper and i didn't know this. I saw it in there and I thought it was an actual book.
Yeah, I went into today's show thinking,
man, I've read so many research papers
but made zero progress on my Audible books.
I'm just going to talk about a research paper
and then here we are.
I'll go pretty quick too
because I want to get to the content.
But basically, NVIDvidia has this really interesting
thing where they're using large language models to try to fabricate goals for this uh reinforcement
learning robot so for example they'll ask chat gpt and but i'm just starting to get into this paper, so I'm not going to do it justice right now.
But they'll basically ask ChatGBT for, like, what are certain things that we can do to show that we kind of have good ambulation?
And then they'll kind of just keep riffing on that. So I do feel like one thing that's always missing with reinforcement learning
is you have to train from scratch versus, you know, for example, if you're driving a car or
writing something down or whatever, you're starting from this base of all this common
sense reasoning. So, okay, here's a good example. So good example. So we have the AI to play Atari.
And so the way it works, if you want to play Pong,
is you start off just moving the paddle randomly and losing.
But then sometimes you just randomly hit the ball
and you get a point.
And so that creates a target. And then you start marching towards that target.
This becomes a really big problem
in something like an RPG.
So if you want to play a role-playing game with AI,
you're going to have to like try every spell randomly,
move around randomly
because you're not drawing
from a common sense reasoning bank like like you don't the ai
doesn't know that well forests usually have more dangerous enemies than roads because roads are
generally guarded or or fire works well if the enemy looks like he's made of ice like these are
all things that you could just do on the first try but the the AI has to trial and error its way through.
And that's why you don't see AI playing Dragon Warrior, for example, because there's just too much common sense that has to be learned from scratch.
And so this interested me because I feel like maybe we can somehow draw from the common sense that's been accumulated in something
like chat gpt um or gpt4 i think there's something to that like imagine if you somehow just fed the
visual and the textual content to some type of embedding and then trained on that then maybe
you know like fire one and fire two
would go into this embedding and you would just learn in general some things about fire
the fire spell or whatever um so i feel like there's something there but it's it's very early
days that's interesting so rather than like combinatorially exploring the space and finding out that like, if the attribute ice or whatever is attached to a enemy,
then use a fire spell.
I guess I could use a Pokemon.
Maybe it's a good example.
Like,
Oh,
visually this thing looks like it's frozen or,
you know,
has an ice type associated with it.
Or I've memorized that.
Like I should use a fire spell.
Like you said,
that's done that way,
even though it's an abstract concept, there just rock paper scissors right but that transitive is encoded to you as a human
because it maps to something you understand which is like you know if i have something made of wood
something with fire will destroy it and would have a very hard time fighting something made out of fire and so you're sort of saying you could kind of like get that understanding by combining two approaches something that
understands the context being given to the human and then the normal you know sort of yeah
interesting yeah exactly exactly like i mean of course you could well i don't know of course i
mean you could theoretically cheat if somehow you could
read the game's ram and figure out like okay this this character has this you know ice attribute
right but like that's not how humans do it like humans are just looking at the picture
and so to really play final fantasy or dragon warrior these other games barely like to show
it that an ai that is
really representative of solving the problem it has to be able yeah to just look at things
and and infer based off what it knows about the universe
so i guess the equivalent would be if a computer designed a game that it thought was interesting
based on the rules of the game and then presented to as a human you would be like i this is not fun this is just like some abstract concepts
uh the ai would be okay you would be on a more equal footing but because game designers are
humans designing for humans yeah you kind of you got to cross that chasm before or you put a
computer on equal footing yeah i mean imagine like a poco they actually have these
where it's like uh they randomize the game have you heard of these game randomizers yes yes yes
yeah and so they you know imagine a game randomizer for pokemon where it just randomized all of the
strengths and weaknesses of all the pokemon it would be just terrible because it's like oh this
the turtle that shoots water is actually you know know, a fire, you know, entity.
And it would just it would just be really frustrating.
Like, that's how the AI feels.
Am I supposed to empathize and feel bad for the AI, Jason?
Help me out here.
Yeah, I mean, you know, it only makes billions of dollars, controls the stock market.
I mean, you should feel bad for this thing.
If you like takes like that,
you should subscribe to us on Patreon.
If you're not an AI,
you should subscribe to our Patreon.
We really appreciate it.
If you are an AI,
you can also subscribe to our Patreon
and fund this
very important source of training
data that's right
so yeah thanks to all the patreons
we had a lot of people stick around for a very
long time so it's always an
encouragement and you know
thankful for the support
yeah definitely
alright it's time for the tool of the show
well I'm returning to form and giving another game.
This is, I mentioned Factorio earlier,
this is in the style of,
I guess it's closer to Satisfactory
if you've played Satisfactory,
which is, you know,
there was a kind of a lot of stuff
routed up around the Minecraft culture,
you know, sort of Terraria and several others.
And I think we're seeing kind
of the same thing with i don't know what you call them i guess factory building games i'm not sure
what the official i think that's right like base building plus factory building uh and so tectonica
is a new one i've been playing on my steam deck actually um many of these games like satisfactory
actually don't really have very good sort of controller support and people do play it so
someone's going to write in and be like, I use
the touchpads for the 37 shortcuts
that you need.
You know, it's not something you would sit down
and expect to play on your Xbox.
Recently, Factorio added that.
That was actually when I picked it up and
started playing on my Steam Deck. But I've been playing this one
Tectonica. And so all these add a
twist. So this one is 3D,
which adds an interesting dimension to your ability to one tectonica and so you know all these add a twist so this one is 3d which is uh you know
adds an interesting dimension to your ability to uh you know build your factory and and how to align
things because it's in a first person sort of 3d 3d thing uh and then the second thing is that uh
you're underground so you're in caves and caves of course like limit how you can sprawl your factory
even kind of more than normal.
But you're also given, you know, drilling equipment, basically.
So you can carve out, you know, paths to, you know, extend your factory line or add stuff.
And, you know, you need plants to fuel some of your, you know, combustible uh like smelters and uh so you need to you know to hydroponic
growing because uh you know it's a pain to run around and mine mine all the mine gather harvest
harvest yeah bring on the walls uh so anyways uh tectonica it's still early access so i guess in
theory it's buggy i did i did actually hit one bug on my computer um but in general very playable
i've not hit like some nasty crashes or anything.
And a lot of these games, interestingly,
are satisfactory.
I think it's the same thing.
Still technically early access,
early release, whatever it's called.
And so it's not formally done yet,
but very, very playable
with beginning and end game and all of that.
And so Tectonica also is nice
because it has a...
I mean, you may not like the story. It's not an rpg level story but there's definitely like a story to it
about what's going on and sort of like accomplishments to work your way through that
that are sort of very specific to unlock a narrative that goes along with it so i'm enjoying
it tectonic i checked out oh on pc very cool as far as i know i think xbox pc i don't know about playstation
oh very cool check it out yeah i love factorio and satisfactory so this is right up my alley
i'm gonna give it a shot all right my tool this show is uh the esp32 development board have you
heard about this patrick yes you have okay i'm gonna say what I think it is. I literally just got one. I have it. I've actually two of them right here. I'll
hold it up for the audience to not see because we don't record video. But actually, that was useful
because I was wondering which development board you have. So I have a tiny, tiny one. Yeah,
it's like the size, maybe the size of a stick of gum, basically. Okay.
And yeah, and it's, you know,
the thing that caught my attention was that it has Wi-Fi.
So it's, you know, the Raspberry Pi Pico W,
I have a couple of those.
Those also have Wi-Fi.
Those were significantly more expensive.
This is like extremely reasonably priced.
I have a link to the actual one I bought and I bought, yeah, I bought two of these for $12.74 US. So it was like $6.20 a piece or $6.70
a piece. So yeah, very reasonable. And I haven't tried it yet. They're still in their vacuum sealed bag or static bag, whatever it is.
But I definitely want to give it a shot.
It supports Arduino and MicroPython, which is pretty cool.
So you can use either of those.
Yeah, the Raspberry Pi and Arduino, they only support their own thing.
This one supports either.
I'm generally a little leery of things that
are just so open-ended
that maybe it won't do any of those well.
The instructions
that came with it
walk you through how to set this up
for Arduino, so I might just stick with that.
But
yeah, just the fact that I can get
Wi-Fi in a small form factor is
really neat. I think next year what I might do is I have a one of my Eero, you know, Eero is a mesh Wi-Fi thing.
One of my Eero capsules is pretty close to the front of the house.
And so I'm thinking I can actually have robots in the trees next year for Halloween.
And they'll get this wi-fi packet and when they get
the green when they get some wi-fi packet they'll swing swing from the tree like some giant ghost
or something so um that's my plan um uh last halloween which was you know just a week ago
from when we're recording this um you know i i actually went down a step i i only did about half of it um things have just
been really crazy for me so i didn't have a lot of time to invest in it but i'm gonna try and make
up for it next halloween i'm gonna bring back all the robots i didn't do this halloween and add some
more in the trees if i can very cool yeah so i guess like the commentary i have there is the uh company i think it's
expressive is develops a little uh chip that does like all the wi-fi and has i believe their arm
cores um on them and so all of that to i guess like what jason is saying is it's kind of irrelevant
unless you're doing embedded programming directly um but they're very, very cheap and common.
And so a big community has sprung up around them.
So Amazon's a good source if you want them quickly.
If you are willing to wait for overseas from AliExpress,
you can get the original version,
or at least the original one I came into,
which is ESP8266, which is a 16-bit microcontroller, I believe.
Again, maybe not important depending on what you're doing,
if you're just sort of controlling a few servos
and also has the library associated with it.
And that one's like $2.
Wow.
For an ESP8266 for like $3.
So you can search like, they're kind of like all cloned,
but Wemos, W-E-M-O-S, is a very common like dev board you can kind of look up and they
have esp 8266 versions and esp 32 versions the esp 32 is a you know 32-bit processor faster so
if you're needing to you know do more computation they have variations of it with you can attach a
camera even and do some minimal image image processing on or use as webcam of course if
you're just going to need a webcam
it's better to normally just go buy one but then you normally you know pulling the data down and
doing processing rather than being able to handle it sort of device side itself and they all have
you know we talked about that last time with the raspberry pi it has the sort of headers for doing
i squared c or spi and know, a lot of example code.
And like Jason mentioned, you can use the Arduino, I guess you'd call it like middleware,
but basically like the libraries and the interfaces associated with that. They do need to be for like
hardware stuff. They have to be ported to ESP32. So you can't run the sort of Arduino libraries
directly in some cases if their software is fine
but for you know things like i squared c handling for a specific something you know you may you may
need some variants but it's relatively straightforward much easier than sort of
building your own pcb and doing it from scratch but they also support micropython as i mentioned
and lua as well are pretty commonly seen as ways to sort of
control these. And the big win, like Jason mentioned,
which you don't normally see in a
bog-standard Arduino, or at least not
at this price for sure, is the ability to connect
to your Wi-Fi network.
So you'll see a lot of home-controlled devices
actually use these as well.
And there's a lot of firmware floating out for reflashing
sort of, you know, the light bulbs
that screw into your roof will have one of these chips inside.
And that's how they connect to your network so that you can control them from the app.
And so I figured out that a lot of these run very, very similar chips.
And so there's various from easy to hard ways of sort of flashing your own stuff onto them.
And in many cases, the other cool thing is rather than having it hooked up to your computer,
there often is the ability to update firmware over the air.
So you can just send a new update to your bat in the tree
and change its behavior
without having to bring it back inside and hook up cables.
Oh, that is awesome.
Yeah, I'm really excited about this.
I'm going to have to...
I bought a...
This is probably just really amateur to you to uh i bought a this is like uh probably just
really like amateur to you but i bought a crimper i've never crimped anything before
but i wanted to like basically have this thing control the servo but i need to have a wire come
out of this thing and go into the servo and it can't't be just a DuPont cable because of the way
this thing is set up.
And so I actually bought a crimper.
I'm going to try doing my first crimp at some point.
Yeah, I don't do a lot of crimping.
I do a lot of soldering, but not too much crimping.
And a correction, the 8266 was actually still 32 bits,
just less powerful less
purple devices do you have to buy like a thousand of them if you're buying it from
aliexpress or can you still buy like five or ten you can buy like one or two yeah oh nice
shipping has caught up it used to be a much better arbitrage or however you say that but now
uh yeah normally you know they want
you to buy a few to basically amortize the cost of shipping but it's not like hundreds um you know
if you're buying a tape a tape of them uh you know the actual chips you probably get to buy
500 or something but most people are buying a dev board so you're only buying a couple
uh makes sense very cool all right well it's time to talk about HyperLogLog. So Jason has warned me that he has been willing to play foil to my explanation here.
For more broad topics, I do have some other specific algorithms in mind. So if this goes okay, maybe we'll do some others in the same style.
But let me kind of like set up and walk it through.
And Jason, ask away the questions.
Represent the average person out there or the everyone else or even me
because normally I'm the one asking you these questions.
The algorithm here arrives from a difficulty
that probably most people haven't really thought about.
In fact, I really hadn't thought about it until it was sort of posed, which is how do you count the number of unique items in a set?
And so if you sort of imagined the classical example, we'll just use this one.
You know, everyone's pretty technically oriented here. Imagine you're a website and you don't want you know you see that
visitor account right and your mom goes and visits your website and then she just like keeps refreshing
and the counter just keeps going up yeah i have a popular website most people will sort of give
this number instead as you know unique visitors so you know you could say oh okay i'm gonna you know store a
cookie and tell if you've been there before or or whatever right there's various ways but just
imagine you know it's sort of being written to a log let's say the ip address or you know a unique
identifier for a user and you want to count up how many of them there are so you know like the first
first blush you know sort of like the naive answer is you sort of you look through your list of all the visitors and you keep track of which ones you've seen.
And then every time you go to the next one, you look through the list.
And if you see that user, you throw it away and then you count the size of the list at the end and then boom.
And that that is correct. That actually will produce the correct answer, the so-called
cardinality, the number of unique visitors that was in your set of consideration. And so you may
say, well, that's pretty obvious to improve. If this was a programming interview, that would be
a suitable first answer. And then your interviewer would normally up the challenge a bit, which is
ask you, well, how efficient is that? Could you do better? And so of course you can do better than searching through a
list for unique items and you could use a tree, right? That would be one option. And the other
option is you could use a hash map, right? So you could have a hash map of all the unique identifiers
and then every time you, you know, get a unique id in you hash it or maybe
it's already you know you randomized enough but normally you want a really really rigorous hash
and so you mix up all the bits you get essentially a random number but a repeatable one and you go
to a set of buckets and you is that one present or not if it's present you throw it away and if
it's not you insert it and we've talked throw it away. And if it's not, you insert it. And we've talked about hash charts on the show before.
If not, you can go listen to them
if you don't know what they are.
And so at the end, you count how many entries you have.
And this will also work.
Yeah, I mean, the limit of what I,
off the top of my head, would do just to scale up
is have a distributed hash table
where basically you do the hash and then each,
it's like if the hash starts with a one,
send it to computer one.
If the hash starts with a two, send it to computer two.
And so then you have a computer for each possible value
of the first digit of the hash.
And so then now if there's, I don't know, 56 different characters for the first digit of the hash and so then you know now like if there's i don't know 56 different you know
characters for the first digit of the hash then like you have 56 computers and you get 56 way
parallelism um but yeah i guess we're gonna hear about way even better than that no no no no so so
this is great so you know i i had a same kind of like thought process in my head. And I think that
works great for live, right? So if these if you wanted to sort of like, what I was mentioning,
you know, these visitors are coming in and you sort of counting them, but it doesn't work like
after the effect, or sort of analytic purposes. So imagine you like I was mentioning, you have
this is a log and you want to cut it by day, by month, by, you know, people that came from Europe, by people who ended up buying something, by whatever, you know, these kind of filters. and so they have a distributed system for querying the column score and and this was developed
because they went and looked and people doing analytic queries or i forget what it said in
there i won't quote it because i'll get it wrong but a very large number of count distinct so you
write this sort of sql query and you say hey i want the distinct number of these but they're
over arbitrary input so what you say will work jason but the issue is sort of like that's like
if you only really want you know a handful of Jason, but the issue is sort of like, that's like, if you only really want, you know, a handful of them live, and you're not sort of
like infinitely slicing them later. So not not not an off approach, we'll come back in a second,
I think you're heading in a good direction. But the sort of first, there's many algorithms have
been done to sort of solve this. And the paper even references one and to be clear,
they even recommend this one, if you're sort of not going to have very many.
If you don't believe that there's a lot of unique items,
this one will work.
And I want to talk about this one
because it references something else.
And we already kind of mentioned
hashing unique identifier
and then trying to put it in a bucket.
So this first one is linear counting.
So if you hash it and you put it in a bucket,
but rather than in a normal hash it and you put it in a bucket,
but rather than in a normal hash map,
you would store the identifier.
And one way is store a linked list of identifiers.
So the first thing in the bucket is the head node. And then it points to the next thing
and you have a linked list, right?
Until you sort of grow the hash map number of buckets.
In this case, what we're going to do
is actually only store a single bit
and not store the actual ID. So you just insert the value. And if it goes in a bucket, you just
mark that bucket as having seen something, you know, Boolean equals true, and all the other
buckets are Boolean equals false. So now you have an issue is when you have a hash collision,
you're losing some information, right? You're losing the information that there were actually,
potentially, it was the same item,
but potentially it was two different items.
And so you are getting this sort of probabilistic structure.
But this is going to turn into an advantage
and the rest of this is going to talk about this.
And the reason why I think this is an interesting topic
for the show is my background
before kind of coming across a few of these
was always like,
you would
definitely not want a probabilistic answer you would want an exact answer uh and so it would
not be acceptable to be even off by one right that's an off by one error and so you yeah exactly
what you have but if you're willing to give up an exact answer the design space widens dramatically, not only for the number of
solutions you can have, but also for things like being able to tune your trade off, because you
can say how reliable can I trade space or processing for more reliability. And an example
that you see a lot recently with machine learning is approximate K nearest neighbors. So we've
talked about k nearest neighbors
before i think maybe if not maybe that's a good one to talk about i think we have yeah okay okay
good um but what if you were willing to accept an approximate answer then you can kind of approach
it differently so this first one that i'm mentioning is called linear accounting so you
you hash it you keep track and in cases, when you produce this random number,
and you go to insert it in a bucket, you're going to collide. And like I mentioned, sometimes that's
you want it, you want to throw away because it's the same ID, and every one of the same IDs will
collide. But sometimes another ID could probabilistically collide. And that's the second
bit of this is using statistics to say, when you're done done don't just count the number of buckets that have
an entry in them which would be an approximation but you can do better which is look how many of
the buckets like the percentage full your hash map is and use that to estimate what the likelihood
is that you may have seen extra insertions that miss and so this this is like this like clever unlock right for me was
um we don't know how many of the same item and how many you had duplicate uh
hashes or portions of the hashes that you used are actually different ids but if your you know
hash map is only 10 full only 10 only 10% of the buckets had values,
that's, it's pretty unlikely, like, you can just say 10 of the buckets had items. So I'm just
going to guess 10. But if you know, 90% of your buckets, you had 10 buckets, and nine of them
had it, then the chance that the next thing being inserted being random, 90% chance that it's going
to collide, right?
And so you sort of take that into account. Now there is some balance there, right? If you had 10 buckets and 10 million unique items, you're going to get like a terrible guess, right?
Because you fill it up and then everything will collide and you can't, the information is lost.
So you do need to have an understanding of kind of the approximate number for this example,
the sort of approximate
number of items that you would have. So this one's called linear counting. And when I was reading
about this, the thing that comes up here, which I don't know if you've used these before, Jason,
bloom filter, have you ever used a bloom filter before? I've used it, I don't remember exactly
how it works. Okay, so this is very similar to a bloom filter in a bloom filter
uh the idea is you're trying to have us what things are members of the set and you query the
bloom filter and say via the same mechanism basically you hash it you look in the bucket
and if you normally do a couple different buckets um and you say if every one of those buckets has an item in it,
then the thing I'm looking for,
I'm going to do a more expensive query
or check or actually go to the database.
But if any of the things that should be present aren't,
then I know with 100% certainty
that this item is not in my set.
So if I'm looking up the username Jason,
and I do this operation, and it comes back that one of the buckets that should have had a bit set don't, then I know Jason isn't in, he's not a user.
So I don't actually have to query my backend database.
And the advantage is...
So let me see if I understand this right.
So the idea is like, I mean, you just created an example.
So there would be sort of a bit for every letter and position pair. And so it's like, you know, if you don't have the, so JSON is J-A-S-O-N.
If you don't have the, you know, A at position two bit, or you don't have the O at position
four bit, if you're using any of these, then you know that the word JSON just can't be
there.
But if you have, you know, J1, A2, S3, O3 o4 and 5 then like you're still not 100 sure you have to
do some extra step but but like you're you're confident enough that you've eliminated so many
other things that's still like a huge time save yes so if you took each of those like j in the
first position a in the second position and you you know basically hash them and then put them in the bucket and the probability of all those things you said right the probability
of false positive is controllable if you say i think i'm gonna have this many true entries i want
to use this many bits and there's calculators online for sort of fine tuning your parameters
and if you have lots and lots and lots of buckets, you can decrease your false positives.
But if you, you know, shrink it down,
the really nice thing is, say, you know, on the edge node,
you can't store that you have, you know, 10 million usernames,
but you could store a megabyte or whatever.
And in that megabyte, you could store, you know,
this Bloom filter and give you a reduction of...
And it's only worth it if you think
there's lots of users going to try to log on
who don't actually have usernames in the database
or whatever you're trying to do.
But if they do, then you can sort of
not have to go to the backend.
So if you imagine for something like a spatial index
and you're trying to query for an area on your computer,
like your video game, you're looking into this portion of space, and most of the
portions of space are empty, and you just have entries of where there are things, then you could
filter out the expensive sort of collision detection and all of the other things by basically
knowing up front which things are present or not. Now, there are other ways of handling that as well,
but you're right, it doesn't give you a 100% answer. But it prevents you from doing an expense query
by filtering out many of them. Got it. And that's a bloom filter. Or did what is it called a bloom
filter? Do you know? That's a great question. I think because you do this multiple hashes,
so you don't just do a single entry into a bucket, you normally bloom I would,
this is my imagination, because it blooms out and you normally bloom. I would, this is my imagination because it blooms
out and you normally do a couple different buckets by taking portions of it. Okay. I have a really
funny, true answer. So, so the, well, that might be true, but also it's called it because it was
created by Burton Howard Bloom. That's a true story in 1970. But, but i mean uh that's uh um that's really interesting man i
can't believe that you know that this this uh someone in 1970 was thinking about this problem
that just blows my mind you know so this one was the first this bloom filter was the first one i
came across that was this it's okay to be wrong.
And in fact, it's just a tunable parameter.
This is sort of weird.
Sometimes you get this if you do like DSP work or filtering or something, you know,
or you talk about floating point math.
But for something that was like a data structure, a data structure that's like meant to be somewhat
broken, it's like intentional that it sometimes gets the wrong answer.
It's by design.
So Bloom filter and linear counting,
at least in my mind, pretty similar.
So moving on from linear counting
to the next step,
we're going to go to the tail end
of HyperLogLog
and talk about LogLog.
So LogLog does something
really interesting
that if we were talking
about the Bloom filters taking,
you know, Jason was talking about taking the J
and hashing that and taking the A and hashing that, right?
So instead, if you just hashed JSON, the string JSON,
you have a really good hash function.
You could just use the first couple bits,
the first however many bits you want
to determine, you know, the bucket that you're going to go in.
And then instead of scoring just a single bit
of I had something in this bucket or not,
you go to that bucket
and you look at the rest of the hash.
Say your hash was 32 bits
and you had five buckets.
So, oh, did I pick bad numbers?
27 bits left.
And those remaining 27 bits,
you could either use the first or the last.
It doesn't really matter.
Count the number of sequential zeros.
Now, this part was really, really weird to me.
I didn't understand this.
So take the number of,
let's say starting from the left to the right,
count the number of zeros in a row
that are in the next 27 bits.
And then instead of storing one or zero in the bucket,
store the number of zeros so if the
first digit was a one you would put zero if the first digit was a zero and then a one you would
put one because there's one zero but you could have four zeros in a row and you would store that
in the bucket now if you were like me that number of leading zeros in a row. Okay, got it. The number of leading zeros in a row.
And the clever probabilistic thing about this
is that if you imagine that there...
So your hash needs to be very good.
So it needs to have a 50% chance of a zero
and a 50% chance of a one in every bit position
or as close to that as you can manage.
Then what you would say is if you're going
to look at a stream of unique ids you would expect that the length of sequential zeros that
you would get is a function of the number of items you've looked at
the length okay let me say if if i said you're only going to play the game once you're going to generate one
hash and look at it and i asked you to derive like what what is what would you be the expected
number of zeros you would see it's much more likely to see one zero or two zeros in a row
than to see 16 zeros in a row on the very first time yeah it's like it's like a coin toss right
it's like how many times can you toss the coin and get all heads well like as the number of tosses go up that
that probability goes down that's exactly the key to this so if you look at a given bucket and let's
say you only kept one bucket that's fine you could just have one bucket and you keep track of this
then at the end of seeing all of the items you look at what the
sort of like longest run of zeros you now and you say that i i believe that i've seen one over two
to that many number of zeros that's the probability so then i've seen two to that many number of zeros
items is my guess so yeah let me let me see if i can can rephrase it, paraphrase it, see if I have it right.
So you flip a coin.
There's a 50-50 chance you get heads.
If you get tails, you're done.
And, you know, that's a zero.
You get zero points.
If you get heads, that's a point.
And then you try again.
You know, you keep going until you keep accumulating points or you're out, right. And so, you know, 50% of the time, you'll get you'll get zero points.
75% of the time, you'll get zero or one point. And so it just keeps going up like this, this
hyperbolic ramp. And so like, you know, 99% of the time, you're going to get less than 10 points or
something, right. And so what that means is, what you're saying is, you're going to get less than 10 points or something. And so what that means is what you're saying is if you did get 10 points ever,
then there's probably like 99% of the other trials there that you just didn't record
because all you did was record the best score.
It's like if you go to the arcade and the high score
on pac-man like maybe they just rebooted like reflashed all the machines yep yeah and the high
score of pac-man or actually this happened to me where um you know i'm obviously not super mr
athletic or anything but i went to chucky cheese with my kids and i always play the football game
you have have you seen this at chucky cheese you I always play the football game. You have you have you seen this
at Chuck E. Cheese, you have to throw the football through the holes. Yeah. So I played the football
game. And I won the high score of the day. And I felt like for a moment, like, wow, like I should
have like, you know, played in the NFL or something. And then like five minutes later, like some guy
behind me, like, you know, got the high score of the day. That's because they just reflash the
machines, right? So the high score, you know, there wasn't a lot of samples. know got the high score of the day that's because they just reflash the machines right so the high score you know there wasn't a lot of samples uh and so the high score was easy
to beat um conversely if you go to like uh you know some other arcade or if you go later on in
the day the high score will be super high and you can't even get to it and so you're using the high
score as a approximation for how many people have played the game yes that's
the yes that's the key and so i guess we didn't talk about that part here but you you caught on
to it which is you want to record the maximum you've seen so in this bucket you record the
maximum and as you said as the number of attempts goes up and unlike your example where it's skill
based every one of these should be random so right it's just like slot machine and it's the whatever anyways and so yeah so the higher that number that you've recorded you're
going to make your guests go up now you could get really lucky unlucky however you call it and they
vary you see only a single number and that number happens to have you know 100 leading, we said 32 bits, it has 27 zeros in it. And so you think you've seen
two to the 27 items, but you really only ever seen one. And so this is, this is not good, right? This,
this would make you have a very, you know, a high chance of getting an error because of this one
problem. And so the second thing, which I started off with there,
is by having multiple buckets and then averaging them together,
you take the first few bits,
the first five,
and you choose your bucket.
And you use the last 27
and you do this thing I'm mentioning
that we've talked about
recording the high score.
And then by averaging
all of the buckets together,
if you've only seen one,
sure, you're going to have
this really large number in a single bucket, but all the other buckets are going to seen one, sure, you're going to have this really large number
in a single bucket,
but all the other buckets are going to say zero.
And so you're going to pull it back down
to try to correct.
And we're setting up for the second thing here in a minute,
but this is called, I guess, like log log.
I don't see a ton of evidence
that this was really rolled out or used
because there's some pretty easy ways to improve it
here we'll talk about in a second.
But this idea of recording multiple max scores
and then looking at the end and taking an average
allows you to sort of approximate it in a very small way.
So the other thing we didn't mention with log log
and linear counting in both cases is you kind of measured,
well, in linear counting, you measure this amount and log-log.
You have these coefficients as well
because you just sort of know on average
you're likely to kind of see these things.
So you run some simulations
and you kind of tune some constants
to accommodate for kind of like edge effects,
I guess you would sort of call it.
So hyperlog-log improves on log-log.
Well, one thing actually before before
you jump into hyper log log i want to double click on something you said earlier which is you know i
also went through this where um you know i felt like even just hashing i kind of felt like you
know how do you trust the hashing you know and, and even like, what happens if your data set is biased in a way where, you know,
because you're maybe your data set all starts with HTTP colon that now like
your hash function isn't pure and you know,
you have more collisions or even like Mac addresses, you know,
I'm pretty sure Mac addresses are, well, actually I,
I don't have no idea,
but I'm assuming that there's some kind
of randomness there there's no way all you are pre-determined it's by like it's hardware vendor
sets the first few bytes okay got it but then within a hardware vendor at some point there's
probably a collision right or are there there can't be like a global count of mac addresses
while they're the factories like churning out tons of these.
Yeah, maybe there is.
I think there's not supposed to be, but probably there is.
But, oh yeah.
So whenever I see hashes or all these things,
I used to feel like, well, this isn't kind of pure.
It's not going to work out.
And how do you even know if you have a problem,
right? And it turns out that, yeah, at scale, you know, you can't do anything very precisely.
And so what you do instead is you try to measure, you know, the bounds statistically of some issue.
And then you're constantly correlating between other things. Like, for
example, maybe there's, you know, a one out of 10,000 chance that you overestimate your unique
visitors by 10x, right? And so, and so one out of 10,000, you know, that means, you know,
Google has been around for like 15 years or something so it's probably like a 50 50 chances it happened once right but what they're doing is they're
constantly correlating that with other things so you know if the viewer count goes up by 10x but
the revenue doesn't move well then you know that might be an error that day in the viewer count and
so you're constantly trying to cross correlate between different
data sets, different signals, and then also auto regress, you know, correlate across time for a
signal. And so it really just becomes about reducing the error and trying to reduce the
bias and the variance of these signals. And so when you start thinking about it in those terms, then it becomes okay to kind of
say, well, yeah, all of these errors are there. We're just going to measure them all in gross
and try and get that measurement to be as accurate as possible. And so as they're making changes to
these filters, they're comparing the correlation to, again, just using our example,
to revenue. And they're seeing, oh, the new value is much more correlated to revenue,
which is what I would expect. As unique visitors go up, revenue goes up, and all of that. So you
can actually measure all of these things and get better in an incremental way, even though it feels like a very helpless situation.
That's a great call out. I will say also, I think a lot of these techniques we're talking about
time and a place. So like your financials, you wouldn't want to do this. You probably do a
different sort of way of the expensive query, but you can't maybe do that in real time.
And then as you said, not only comparing it with other metrics, but even just also running, maybe like a second method that is less reliable or different parameters and checking that as well. So maybe it's less accurate, but it gives you a coarser bound, right? And says, you know, hey, at least I know that like, if I get a 10x measurement, I need to throw it out,
like that's, that's going to be wrong. So it can give me an order of magnitude kind of kind of swag.
Yep, makes sense.
Okay, so moving from log log to hyper log log, it's not a big jump. So the first one is for any buckets. So it's basically what happens with, you know, kind of very high numbers, like when you're starting to get very dense filled buckets, or very low numbers where you have a lot of empty buckets.
In the case of very low numbers, we have a lot of empty buckets.
You may consider, instead of taking the max counts and doing the average, where, like I mentioned, you have five buckets, you happen to get one that was 27 zeros, you're going to way overestimate it.
If you have a lot of empty buckets, instead, maybe just estimate the probability of having that many empty buckets
so what is the chance that you saw you know in the case your estimate there would be very high
like a million or something rather than estimating a million what is the chance that you saw a million
numbers and had nine empty buckets this is going to be very low. And so you use that for sort of some corrections, as well as very dense, just like in linear
counting, we're talking about the more buckets sort of like have bigger numbers in them,
the more chance that you're sort of seeing these collisions and you're sort of seeing
a different behavior of the numbers.
And so handle those as well.
And a lot of that, you just end up running
by either computing the statistics
or running simulations
or sort of handling that
rather than kind of a single formula that you run,
you kind of have a sort of set of,
if you meet this condition or that condition
or the other condition,
apply a slightly different formula,
but roughly the same thing.
And that gives you hyperlog log.
So this hyperlog log allows you to perform these analytics,
get an approximate answer.
I think the number that I was sort of seeing in the paper
is like for the majority of cases with these corrections,
you see like 2% error.
So if that is important,
like in the case of maybe financials or something,
you don't want to use it. But for the case of maybe financials or something, you don't want
to use it. But for the case of like counting how many visitors in a day, you know, if you don't
want to be lying to people, if it's like a court thing, so like you wouldn't run it. But if it's
just, you know, keeping track from day to day or hour to hour, as an example, you could run something
like this and know that you're going to be generally close most of the time and of course a lot of those uh things are tunable no very cool it makes a ton of sense the final trick and the sort of
like second thing that kind of like i don't know melted my mind a little uh and there's a variety
of these problems and this one's going to get to jason's example so here we are at the very end
and you foreshadowed by talking about, you know, sending it to different computers and having each of the computers handle it. So let's imagine we chop up our logs and send all of our log, you know, one tenth of each of our logs to 10 computers. And each of the 10 computers does the thing we just described with hyperlog log. What do you do at the end? Well, you can't sum the 10 results, right? Because if you sum the 10 results,
now if you divided them like Jason said, where I looked at their username and sent them to a
computer based on their username, you could. But if I just chopped the first 10% of the logs,
the second 10%, and I sent each of those, some people will have two different computers will
have seen the same user potentially. And so you don't want to just add up, hey, each of them,
so 10 times whatever number each of them,
just kind of add them up.
That won't work. That's not a good estimate.
You need to dedupe them again.
Well, the sort of mind-blowing thing, or at least for me,
is if you look at those buckets we talked about,
and everyone's running the same algorithm,
you can just transmit the final counts of the buckets
and just run the same max operation. So if you run the same max operation across all of the buckets,
and then do your final calculation, that actually works perfectly fine and is incredibly cheap.
So if computer one says my first bucket has eight zeros, and computer two says,
oh, my first bucket only has five zeros, then your final count for the first bucket is eight zeros.
And it's the same answer you would have gotten
if you had sent all of the queries to only a single computer
because of this, you know, counting, keeping only the max.
So this max for individual computers is,
if you take the max across all computers,
the same as if they had seen them because everything-
Oh, right.
This was not,
this is like, what?
No, this isn't going to work.
It was my first inclination here.
But it's this aggregating by Macs is the unlock.
Because you're aggregating them all together by Macs,
you can just run it at the end
across all the computers
and you have it distributed as many ways as you want
but wait what was wrong with your idea of saying well just just adding up all of the
oh because there's duplicates across the computer so imagine the entire log one million entries is
all username jason and i send each to 10 different computers each computer is going to estimate one
let's say and then my final result say 10 the answer is actually only one right right so again
for something like visits to google you may know on general how many times a given person visits
in a day maybe you could do this via other means but if you're using this for an analytical engine that doesn't know how likely repeats are or
not then this is a very difficult you know thing to solve a priori with sort of like specialized
knowledge yeah that makes sense um yeah that's totally mind-blowing you know another thing that this benefits from is that each bit doubles the count.
And so the difference between seven and eight bits is so large that if one says five bits and the other says eight bits, by throwing away the five-bit one, you're not throwing away that much because it's as i said it's it's uh it's
hyper logarithmic or whatever its contribution is irrelevant yeah right wow that is super cool
i guess a couple things slipped over so one is that um moving from log log to hyper log log
you also move from taking an average to tape to taking a harmonic mean so you take the harmonic mean of the buckets which helps now what is the
harmonic mean harmonic mean is one over two to the n time and then you sum those up for each bucket
and then you multiply by n at the end um and so that's used for rates uh so i try to look it up
as well it's sort of it's like one step past my intuition
about probability so I'll keep I'll keep noodling
on that one and maybe have to get back to you
but they see it by moving
from the sort of normal
averaging that we would think about you know just
sum them all up and divide by the count
and instead moving to this harmonic mean
instead they see improvements there and again
I mentioned these coefficient factors
and you would run simulations and so they give estimates for what they see for these
correction factors for high, low offset, these kinds of things. So definitely read the paper.
But I would also say that day to day life, are you going to implement this? Yeah, don't. I mean,
maybe if you have good, but one, we mentioned a bunch of things along the way here.
Bloom filters, probabilistic counting,
knowing to go reach for one of these
if you don't need an exact answer.
Second thing is a lot of databases
have already like Redis and other things
have already implemented this under the hood
for these kinds of operations.
So you're likely already using it.
And it's just for me,
one of those mental like aha moments
where the chance that this specific solution is what I need to reach for is very low.
But understanding a bit more about the space I feel helps me sort of know when to go to the Internet to search for something I don't know.
Yeah, so I mean, as far as using it, there must be some way in sql to say like approximate count distinct oh interesting i didn't
look at so every sql is different so let's let's see postgres this is this is fascinating awesome
type postgres oh man same wavelength all right wavelength i mean it looks like there's a uh some
add-ons you can kind of do to to kind of distribute a distinct count with
hyperlog log on postgres i see i see a lot of articles about this so uh again yeah it looks
like uh yeah i agree it looks like a plug-in so oh yeah this is so you install this plug-in in
postgres and then you do hll underscore cardinality and so uh but i remember i think this is sawzall uh i
remember there was some i do remember in my life typing select a prox count distinct like i did
like that for some reason it is like just radiating through my mind so there's definitely some system that has it built in. Yeah, I mean, I think they're like probably in Hadoop
or this sort of very distributed processing of things.
So unlike when you have it on a single computer,
but you have it across multiple computers,
doing large analytics, column data.
I would imagine like Apache Arrow or something.
Yeah, exactly.
PySpark and and pi spark have a
prox count distinct that's where i've done it um that's amazing i had no idea that that was
doing that under the hood that's like totally mind-blowing i don't know so so maybe right in
if you uh like this detailed singular algorithm exploration. I mean,
I guess we touched on a bunch of stuff. So maybe it's not that dissimilar as I thought it might be
in my head. But Jason had to play along here. So he's at least acting like he walked away
understanding this a bit more. No, this is amazing. I definitely learned a ton. No,
we should definitely do more of this. Yeah, please write in, let us know, plus or minus. But I actually learned a ton.
This is really satisfying.
Another thing is, you know,
you know, when people get a CS degree
and they learn about quicksort and merge sort
and all of these things,
they might say to themselves,
well, like, you know, okay,
like we figured out how to sort, you know, we're done.
Like why even teach this to me? Like, why are you teaching me a one-liner in any modern programming
language? And, and what you've kind of shown folks here today is that like, this doesn't end,
you know, like sometimes you need probabilistic things. Sometimes you're working in very limited
environments. You know, there's all sorts of different types of counts and things you do and so and so computer science you can do it all day as your job and and like literally the science
part of computer science and and there's a lot of fun stuff well i think that brings us to the
end of another episode 169 wow. Wow. Yeah, pretty wild.
Thanks everyone for sticking around.
Yeah, we should definitely do something
for our long-term patrons.
I'm gonna do some digging on my end,
figure out if I can get that number of months subscribed
and all of that.
But thank you so much for supporting us all of these
years. You've been able to continue to grow the community. I actually, I'm considering using some
of the community money to buy Twitter ads. It does feel like our Twitter presence is growing pretty
good organically. And so, you know, at the end of the day, you know, we try to get as many folks
interested in programming computer science as possible. We couldn't do that without all of
your help. And so we do put, you know, every dollar that you donate towards trying to accomplish
that goal. Yes. Thank you to all the people who have stuck with us for, oh, I always forget how
long it's been, but a very long time,
including Jason.
Thank you, Jason, for all your hard work.
Thank you.
Someone commented asking, we did two shows on Bitcoin.
One was episode seven or something,
and another show a few years later.
And they were asking if we bought Bitcoin at those times.
And I'm sad to say the answer is no, we did not buy
Bitcoin at I think they said 50 cents, or $500. We didn't buy Bitcoin at either of those prices.
I didn't. But after that show, I did install I got some Bitcoin out of a faucet. But unfortunately,
it was like point oh one Bitcoin or something. So that was the only bitcoin i had left from that so yeah we we close we we answered the that's definitely the most asked
question i think i think about once a month we get asked that someone asked it on discord last
week but i do get a lot of email hey are you guys did you guys buy a bunch of bitcoin. We didn't do that. All right. That wound
has been sufficiently salted.
But
thank you so much, folks.
It's awesome
that we have such an outpouring of support.
A bunch of folks asking for different languages.
We will definitely get to that.
But I also want us to cover algorithms.
This was a really exciting kind of new branch
for the podcast.
And so thanks, Patrick,
for doing the background homework on this.
All right.
All right.
Catch everyone later. music by eric barn dollar programming throwdown is distributed under a creative commons
attribution share alike 2.0 license you're free to share copy distribute transmit the work, to remix and adapt the work, but you must provide attribution
to Patrick and I
and share alike in kind.