Coding Blocks - Designing Data-Intensive Applications – Secondary Indexes, Rebalancing, Routing
Episode Date: November 22, 2021We wrap up the discussion on partitioning from our collective favorite book, Designing Data-Intensive Applications, while Allen is properly substituted, Michael can't stop thinking about Kafka, and Jo...e doesn't live in the real sunshine state.
Transcript
Discussion (0)
You're listening to Coding Blocks, episode 172.
Subscribe to us on iTunes, Spotify, Stitcher, wherever you like to find your podcast apps.
I surely hope we're there.
Or I should say, not find your podcast apps, but where you like to find your podcasts using
whatever app.
You know what?
We'll just, whatever.
Forget it.
Just go on.
We can't have nice things, Jay-Z.
You start off.
I don't think Gen Z calls them apps anymore whoa they call them dillies that's weird anyway this is the
codingblocks.net where you find show notes example discussions and uh links to all our dillies i
don't know why that's so funny to me but it just is it's so much funnier i'm surprised you didn't know i mean of
course i did but uh yeah so uh where did you leave off with the dillies and you talked about the
the uh twitter at coding blocks yep or uh head to www.codingblocks.net and find all our social links there at the top of the page. And with that... Oh, I'm Joe Zach.
I'm Alan Underwood. And I am Michael Outlaw.
This episode is sponsored by Datadog,
the cloud-scale monitoring and analytics platform that unifies metrics,
traces, and logs so you can identify and resolve performance
issues quickly.
And Linode.
Simplify your infrastructure and cut your cloud bills in half with Linode's Linux virtual machines.
I'm going to get some flack about that.
It's worth it. It always is is that's the thing like if you're going to risk not being able to
participate on one of the episodes then you have to know that the other two-thirds of the of the
show must fill in for you and that is just a requirement that's right so throughout the show
we'll be um speaking as alan and we won't be explicitly calling out.
So you're just going to have to know when we're representing his viewpoint.
It's going to be very confusing. We're sorry.
Sorry, not sorry.
We'll make up for it with all the dub W's, dubs, as we say, any URL for the remainder of the show.
That's right. That, that was right there.
Reference.
Okay.
All right.
Well, today we're continuing on with our favorite book,
Designing Data-Intensive Applications,
and we're going to finish up the chapter on partitioning,
which is amazing.
This is a chapter in just two episodes.
It's a record.
Well, yeah, but you want to hear record though like technically if you if you were
to like look at the physical version of this book we have covered less than half of it yeah it's
crazy even after this episode we will have still covered less than half of it that is how like you
could take that as a bad sign like wow we're really slow at like reviewing this book but also
you could just take it like that's how full of content of great material this book is that we've gotten less than half the way
through yeah it's it's dense it is i don't know cheese like what's something that's dense and full
of goodness like i don't know cheddar i mean sure i'll take that havarti yes for some reason i kept wanting to think of like
cakes but then i'm like well not really i don't know a good cheesecake well good cheese i was
thinking like for some reason like do you guys yeah you have public's down there so i was thinking
of like the public's uh buttercream icing cakes you know for some reason and i'm like well that
doesn't really meet your dense criteria. But oh my gosh,
now I can't think of anything else. Yeah, it's good.
And now
the listeners are like, wait a minute, I got to pull over.
I knew I needed to go to the grocery store
for some reason. This is what happens
when Alan's not around.
This whole show will be about guitars and
mountain bikes and somehow will fit
in partitioning.
All right.
And Metallica.
Yeah.
And Metallica.
All right.
Well, last episode, we talked about data partitioning.
This is how you can split up your data set when you've got too much data to fit on a node or you have performance requirements where it just makes sense to split up data so you can do more work in parallel or find it faster. And we talked about two different partitioning strategies,
basically using key ranges like 0 through 100 go over here
and 100 through 200 go over there,
which is nice when you have homogenous data with well-balanced.
And we talked about hashing,
which is a way of kind of distributing things based on the key
that we use a little bit of randomness there to hopefully spread things out more so you can avoid hotspots or places where your data is just uneven and causes unnecessary strain on parts of your system when you've got other parts of your system that are just bored to death. And so this week, this episode, we're going to be talking about the rest of the partitioning chapter, which primarily focuses on secondary indexes, rebalancing partitions and data, and also routing.
But first, we've got a little bit of news.
Yeah, so I guess we're going to be doing the game to jamuary again so have we set on a date i think i saw that you had you had created that
uh the game jam invite but i didn't notice i don't recall if it had a date specified is it
january uh it is currently uh scheduled for jam uh january the month and i did keep the same dates
as last year so i didn't have to update
the artwork and it happened to fall on a weekend again so i was like oh so i don't know where those
assets i don't know where that you know p and that uh pd whatever psd is um so currently it's
scheduled for the 21st to 24th but i haven't thought a lot about those dates so i haven't
looked to see if there's anything else major going on but currently that's when it's planned and i think uh i think i don't know
no one's complained about it so i guess that's when it is yeah we're still looking for a theme
for it though so i did see an email went out with the survey for theme ideas i read some of those
theme ideas we got some good stuff coming in um but i don't know that you've picked a final or
i guess
i assume we're going to do the same way as what we did last year yeah where we'll take everybody's
and then let everybody vote on the the top favorites and you know go with that yeah so
we got that from a super good dave who uh saw that with other game james doing that basically
what we do is we gather up uh ideas for themes so if you got one email us or uh tweet at us
and we'll add it to the list
and we'll do a couple rounds of voting.
Or hit me up on Slack, at Alan.
Yep, there you go.
And let them know we'll get it on the list
and then eventually we'll do a couple different rounds of voting.
And it's not going to be annoying.
We'll probably just...
I forget how we did it last time,
but whatever we did last time is how we'll wean it down.
And the final theme will be announced basically right at the start of the jam and that's just to get your kind of creative juices flowing there's something about having a
theme that kind of either makes it inspiring or it will get through that blank page problem yeah
it puts some it puts some boundaries on it so like you're not left to think about like the entire
world of possibilities you're like scoped down to a set of things.
Yeah, I really like that.
Last year's theme was everything is broken, and we got a lot of broken games, which was really funny.
And it worked out really well, especially if you're making a lot of bugs like I do.
How about there's a ticket for that?
Hey, that's a great theme.
Let me add that to the list here.
Well, I was thinking mine was about tickets yeah it was and it was like that was a great game uh i'll
done an angular so yeah so looking forward to that and i did just finish up the create with
code course right um that was a something on unity one of their free courses that i mentioned
where i worked through five game prototypes so i'm going for unity this year i'm really excited about it i don't know what i'm
going to do this year because like last year as you mentioned i did the angular one and i thought
about like maybe doing another angular kind of approach because from the simplicity of like not
having to worry about like things to install and it being easy for other people to use
but it definitely didn't have the polish of uh the other games for sure some of them are so good man
like just like in terms of polish like sounds and art and the gameplay like there were some
there's some serious game developers that were uh part of that game
jam for sure absolutely i can tell them the um i i never finished uh this one i need to go back
because i figured out the or i got some help in the comments about the one where you're wandering
through the woods so like these uh polygonal trees yeah i remember that one yeah it was really cool
so yeah so uh yeah we're looking forward to doing that
again we're going to do a couple rounds of voting go ahead and sign up we'll have a link to the game
jam in the show notes here and we'll be talking about it a few more times so uh keep an eye out
for that and it's just fun to keep up with the um the uh what's called the theme voting and stuff
and so if you've got one that you like a lot before it let's know.
And one more thing.
Yeah.
Uh, so I got a new Mac book pro holla showed up almost a month early from when
they originally projected it.
Look at the big dog over here.
Yeah.
Super excited.
So it does have a notch,
which I,
uh,
instantly learned to just not see.
It doesn't bother me at all.
Oh, really?
Like you just immediately.
Does your phone currently have that?
No.
Oh, okay.
No, it does have a little dot for the camera, though,
which also I didn't even know until I looked at it.
So I just have learned to ignore it.
Oh, really?
It's got a little hole, a little circle.
And so like you have screen above it and below it and beside it
neat yep and i'm loving it i haven't done a whole lot i am able to run unity which has been nice
because i haven't i've been locked out of that uh you know it was just too hard to even open
a blank project on my 2013 laptop and uh yeah i'm looking forward to being able to do some more
couch type stuff there and i've already been doing some experiments and tutorials just kind of while
watching tv in the evenings whatever so it's
really exciting and i've just been loving it so i haven't had any problems with it um there's a few
things that i've used uh installed the beta for because uh they didn't have a like apple silicon
full-on version for but the intel versions also work you know i guess they do some sort of
emulation there that's so i haven't run into problems. Yep.
Yep.
Did you run it through like a geek bench?
No.
No?
I shouldn't. You're not – I'm always curious.
Any new piece of hardware, I'm always like, I just want to know, like, how did it do?
Yeah, I should compare it to the geek bench I did like eight years ago.
Maybe pour one out for that old machine yeah yeah yeah put ubuntu on it
yeah i still yeah i still use my old hardware though so i don't know that i want to know what
your geekbench score is yeah or maybe i don't know man it's a great year to upgrade to mbp look it's looking like yeah all right so let's talk about partitioning i i do want to uh address one thing though so
you mentioned at the start about uh it being like because the data is too uh big to fit on
one machine so like scalability um you also mentioned performance was another reason and i don't know that availability was
really talked about oh yeah um you know well but but in you or at least you didn't mention it but
we definitely did talk about like replication uh in in the past but i i was uh as i was prepping
for the show i was you know doing some searching out there on the interwebs, as you do.
And I came across this one site called Interview Grid, and they were talking about the key benefits of partitioning.
And they included one, a fourth reason, that I kind of take issue with, but we hadn't ever discussed it.
Do you think you might know what it is?
Data retention?
Security. Oh. it do you think you might know what it is data retention security oh okay i kind of the reason why i take issue with that that was kind of like well i mean yeah i guess you could say it's kind
of like a benefit but it also is like i kind of think of that as like a reason why you would do it. So like later they,
they mentioned the different strategies for partitioning.
And so they talk about like horizontal,
horizontal versus vertical partitioning.
But then the other one that they mentioned was functional partitioning.
And I was like,
well,
yeah,
I mean,
you can almost view like security as like a functional type of partitioning.
Like,
Hey,
all my sensitive data is going to be off over here.
But then I'm like,
is it really,
is that really a partition?
Or would you just think of that as like a different table or even a different
technology?
Right.
Does that count as partitioning?
Like,
I don't know.
I kind of take issue with that.
Yeah,
that's weird.
I definitely hadn't considered that at all.
And,
um,
I,
you know,
I guess if you had it in at all and um you know i guess
if you had it in different physical data you know like co-location centers and somebody steals the
computer then having your data partitioned and split up is good or if you have things like key
differently but it still seems like in all the cases we talked about your clients would be able
to access they weren't limited to any partitions. But maybe, I guess, if you partitioned by,
like if you had a multi-tenant solution
and you partitioned your data by tenant,
then you could encrypt those partitions differently.
In that case, I see the security argument is just different.
Okay, I'll buy that then.
So security, if you were to partition,
because we did talk about in the last episode,
we used Fortune 500 companies as
an example and like how you could like implement i didn't i don't know that i caught it out
explicitly but kind of implicitly like some of the discussion uh i assumed that like there might
be some row level security there and so i think i'd said something like you know if you had it
partitioned by tenant you did a select star star, you might only see things for your specific – for that login as that customer, you would only see the data for yourself or your company.
Yeah, that's cool.
So, okay.
I guess I could buy that then.
All right.
So they won me over.
All right.
So security is a fourth benefit of partitioning
yep i'll allow it uh all right yeah so last episode we talked about obviously key range
partitioning and key hashing partition hashing to deterministically figure out where data should
land based on the key and the key is the really important part here because if you said, hey, I've got
user
ID 123 and I'm partitioned by
users, then we would know where
to go look. But that only helps
you in the case that you're trying to look up a specific
key. It doesn't
help you if we say, I want to find users
named outlaw or
users who are
in goods, who owe us money or something like that.
There's no help at all for that.
And so in those cases, you have to look at every single row and every single partition
in order to find that because there's no other help that you get.
All you know that the data is partitioned.
And the solution for that is what we call secondary indexes with the first you know the primary index is just being
how things are physically located uh on the partition yeah so i think like in to to
carry on with the example that we gave last time i think i had mentioned an example where like
um you want to look for you're looking for a specific car right and in the example that i mentioned was like oh well you're looking for a specific car, right? And in the example that I mentioned was like, oh,
where you're looking for Lamborghinis and depending on like how your index was
done, you wouldn't know how to necessarily,
it might be more difficult to find that.
And so we had talked about like the idea of like an encyclopedia where,
you know,
you could go straight to the L portion of the index for Lamborghini.
But if you were looking for a specific one, you know, then you could have it by license plate. And if your index was only by
license plate, then it was like you would have to look at every license plate in order to figure it
out. But in this case, with a secondary index, your primary key could still be the license plate
or, you know, VIN depending on your use case,
but let's go with license plate.
And then,
uh,
based on the secondary index,
you might say like,
Hey,
give me all the license plates for the Lamborghinis.
Like you could go and find all the Lamborghinis instances based on that
secondary index.
Yeah.
I like that.
Um,
I kind of come up with the contrived example here about
partitioning credit card transactions um and i use that throughout the rest of the show notes
i'm wondering if it's worth switching i like the car thing well the book also used cars um
they did they went on colors i was just trying to relate it back to what we were talking about
in the past but we can definitely go with the credit card thing it's fine i i just trying to relate it back to what we were talking about in the past but we can definitely go with the credit card thing it's fine i was just trying to like tie the two the two episodes
together yeah that was uh smart i should finish listening to that episode i wasn't there last week
uh but anyway i mean i like the credit card too because we also talked about e-commerce stuff too
so it's fine all right well let's stick with So some typing. So in this system that I kind of contrived up here, imagine that you have a system where you're participating in credit card transactions by hash of the date.
And by date, I mean the day, like November 11th or whatever it is.
All the transactions for the 11th go here.
All the transactions for the 12th go there.
All the transactions for the 12th go there. All the transactions for the 13th go somewhere else. And when I tell you that I need to sum up all the transactions for last week, well, that's really easy.
We go look at seven partitions and we just go through and sum all that data, which is really nice.
It also means you're going to have a hot skewed partition.
Absolutely.
Yeah.
Yeah.
So that's ripe for hotspotting issues.
That's a prime example of where we say like,
we mostly care about most,
you know,
recent data.
And so we're constantly pinging the most,
the most recent partitions and the other partitions are dormant.
So there's definitely problems with our partition partitioning scheme here.
Yeah.
And that's like a great example of it.
Cause,
cause if you were to think of like an Amazon or something like large like
that,
and if they were to partition on date and so like all of their customers as
they're buying new stuff are always writing to the same partition.
Yep.
Yeah.
There's some benefits to like,
we've talked about being able to do like hot and cold indexes.
So you can basically say like,
well,
the last seven days,
those partitions are always going to be located on the best hardware.
And the other stuff we're going to put that on like a cool or cold storage
that's like maybe spinning discs or something cheaper and i'll only use it for like reporting
or whatever i think they actually get into something similar to that later on in this
portion of the book where they were talking about you could have um like physically different hardware even for your partitions and like some technologies like
mongo and elastic search come to mind i believe and i think there might have been another one
that was called out where it'll it'll specifically allow you to say like hey this bit of hardware can
handle more load so it'll handle more partitions than this other set of these other nodes and so you
could have it to where like you know maybe those are the beefier machines that have like uh
ridiculous ssd raids versus you know the other one might be tape archive or something yeah absolutely
and it's great for data retention so elastic Elasticsearch, they call it, it's part of their index lifecycle management ILM policies.
So you can say like last 30 days goes on this pool of servers, next 60 days goes on this pool.
And then after that, it goes to, you know, like the kind of the worst hardware.
And then after a year, it gets deleted. So the data in that case is partitioned by date and it just kind of marches through so it
starts out on the hot servers eventually moves to cold eventually moves to really cold and
eventually disappears which is really nice for you know certain like government regulations
require that sort of thing but yeah absolutely that's the case where you're like absolutely
just embracing the hot spotting uh you know for for good or bad uh but
in that situation what if i ask you to count all the transactions for a particular credit card
uh like how many visa versus how many amex well i was thinking like outlaw calls and says
hey uh i've got a fraudulent uh transaction on my credit card can you tell me all my other transactions so we can figure out if they're legit or not?
Yeah, but because you keyed on the date, then you can't go and find all of Michael's transactions.
Unless you have a secondary index.
Right.
So if you imagine we've got 356 partitions.
That's an expensive query. So if that's the kind of thing you're doing often, then that's pretty awful. Especially if you consider, you know, maybe, maybe in a year I've only had 20 transactions or something. So you're finding these needles and a whole lot of a whole, whole bunch of haystacks. Yeah. And they referred to, I mean, we talked about it last time about the idea of like where you,
there might be reasons why you do want to have to query all the nodes for
something,
some piece of data.
Right.
And,
and some of the,
these technologies actually,
you know,
take advantage of that.
And,
and there was a term for it here that I didn't even,
that we didn't even hit on the last time that was called scatter gather,
where, where your query does get scattered across all of the different nodes. And then you gather
that up. And, and in the last episode, I think Alan referred to it. He was using the elastic
search hierarchy or, you know,
like architectural hierarchy as the,
as the example where we were talking about,
like,
um,
I don't know that I forget the,
the elastic search terminology,
but you would know Joe where it was like,
there'd be the master node and then underneath it,
there would be,
you know,
whatever the other data nodes are in the master node would be required,
would be responsible for spreading that query out across the different nodes and then it would gather up the results and then package that
up as the return yeah i think there's like four or five different node types in elastic so master
is the one that does the routing and data has the data it's got ingestion and i think there's
something else but um yeah it's going each node
can have different roles so you can have ones that are just for routing or ones just for data
based on you know your your use cases there which is really cool so and we're going to get into
routing later in this chapter yeah so scatter gather is so you could do the scatter gather
technique to to grab these credit cards across the 365 partitions which by the way that assumes
that your partitioning strategy here was like going to always produce the same result for a
given day regardless of year yep but when you said partition by date i assumed year was part of it
too so you might actually have more yeah it's it's 365 times however many years old your company is right yeah imagine seven years
trans so yeah seven years uh retention policy uh so there's thousands of partitions to check
every single record and i mean it's just a ton of work and if one of those partitions is unveiled
say a node's down or maybe some other operation that we'll have some examples coming up here in
a little bit is happening.
And so it's slower to get that data.
Then basically you are held back to whichever one's slowest or else you're going to get inaccurate results.
Yeah, and you'd hope that your company is successful too. So the first year or two, that stuff might be easier to search because the indexes are relatively smaller.
But those last couple of years, your business is booming. So those indexes take relatively smaller. But those last few, you know, those last couple of years, you know, your business is booming.
So those indexes take a while to search.
Yeah.
So you don't want to have to do these 2,400 searches or, you know, whatever it might be.
Right.
And the chances that your customer has had that credit card for seven years, too, is probably pretty low.
It's probably, you know, you probably only really care about, you know, if you're talking about fraud, maybe you only care about the first year anyway.
And so, you know, you can probably trim it down in other ways.
But yeah, so the solution here, though, the way to make this better is to keep secondary indexes, which keep some metadata about our data that helps us keep track of it.
So I mentioned the user.
We might want to keep a secondary data structure or even a secondary system that says, hey, this user shows up in these partitions. And so now, instead of looking at thousands, maybe I'm only looking at 11 or some much smaller number, which is really nice.
The example that they gave in the book for this, too, I always forget the term for it.
Because whenever we think of the word index in regards to reading or something like like that you always think of the thing in the back of the book but the thing in the back of the book that's called
an index in you know our in in in programming terminology is not uh it's like a reverse index
if i remember right is the term yeah and so but but obviously they never refer to like hey go to
the back of the book and look in the reverse index. But really, that's what the secondary index is.
It's basically like the back of your book where it's like look for the keyword of Lamborghini.
And it's like, oh, well, here's all the primary keys that have – or here's all of the – going with your credit card example, look for Michael, and here's all of the transactions that have his, where he used that credit card.
Yep.
Pages 113 through 115.
Pages 211.
Pages 700.
Yeah.
So that secondary index just points back to the primary key that's used used or at least an example that they gave here
underline you don't have that if you don't have that index you've got to read the whole book
every time you want to find it yeah every time you want to find it yeah the um i mean i don't
know that the author uh really got into like the underlying storage mechanism for this particular case of the
secondary index but i but the examples that you know at least in the pictures that he showed
he used the primary key as what the secondary key was pointing to so i was just assuming that
it would you know stay consistent with that yeah because they're really basic like we know
relational databases are kind of famous for keeping a lot of statistics about their indexes.
So they can kind of route things in a smart way.
And you can bet they're not doing just simple lookups on all these.
So definitely the data structures that we're talking here are very simple.
So you can just kind of imagine that things get kind of tweaked and taken on from there.
And this whole chapter is really kind of biased towards NoSQL, I thought.
And they talk a little bit at the end about kind of
more complex data warehousing.
Huh.
Like key value lookups was like a heavy
emphasis here.
Okay.
I don't...
Okay.
I didn't really take away
that thing, but I can definitely see why
you would say that. Yeah, I can definitely see why you would say that.
Yeah, I'll try to tell you on it later.
Because this builds up into relational databases.
We'll talk about how things break down a little bit coming up.
Okay.
Yeah, and then there's future chapters on it, of course, because of course there are, right?
Right.
That's the whole deal with this book.
So one other thing I just kind of wanted to point out that's kind of fun like a lot of times you can answer your query just by looking at the index alone.
Sometimes you just want to count things or just want to report.
And so you don't even need to look up the data.
You can just look at the index and see, okay, you know, Outlaw's got 20 transactions here over the course of the last seven years.
And that's enough for reporting, for showing a bar graph is like all we need to know is that uh you know that count in order to kind of show things it's cool yeah for
like a uh an olap type system where you just want to like do the aggregates yep so like how many how
many lamborghinis are registered how many how much purchasing does mich actually do? Yep. Those types of things, yeah.
I like it.
That was cool.
And I want to mention, too,
secondary indexes are just complicated,
so it's hard.
Like HBase and Voldemort avoid them entirely,
whereas something like a search engine
is something, a data structure that specializes it,
and there's some pros and cons to trade-offs
that we're going to be getting into here.
Yeah, they actually talked about how in, um like i think in elastic search like it was
specifically if anything that was going to be a have a secondary index uh that thing was specifically
called a field in the document if i recall correctly yeah absolutely they were like putting like elastic terminology on top of this uh
discussion yeah it's really cool we've talked about how other chapters were kind of like how
to build kafka this chapter it kind of felt like how to build elastic search to me which is really
cool see okay because now going back to your no sequel comment like a lot of times while reading
in in you know even mentioned it in the last episode, too, like I kept coming at this thinking from a Kafka point of view.
And oh, yeah.
And in fact, I'll save it for now, but I'll tease it that there's going to be a part where it gets even more Kafka ish.
Yeah, I know exactly what you're talking about.
And I called it out, too.
So, yeah, that's a really good point too.
I could definitely see that.
Just with the one thing that Kafka really suffers on is there's no secondary indexes.
So there's not a way to say, oh, I also care about the user ID.
So all you get is things split up by partition.
And in fact, you can't ever make updates.
You can't delete data once it gets into those partitions.
So it's read only.
So there's some tradeoffs there.
Like that's why Kafka is not, you know, it's not a database.
It doesn't work well as a database because you can't retrieve data except by the way it's partitioned.
Yeah.
And there's advantages to that, definitely.
And some cons, too.
I can't tease it any longer.
I'll go ahead and say like it was related to the routing.
Oh, okay.
That's not the part I thought.
Interesting.
I thought you were going to talk about
the number of partitions versus nodes.
No.
Well, we did talk about that last episode somewhat
where I had mentioned,
because a lot of this,
and I asked Alan this question too last episode,
was like, when you know i and i asked alan this question too last episode was like you know when you're
reading this you know do you often like have something that you already know in your head
that you're like kind of relating the information to and so like you're coming at this from an
elastic kind of point of view and between the two of us like you know a lot more about elastic
search than i do and so like you're reading this you're like very heavily elastic focus whereas
like i was reading this and i was very much thinking like Kafka in my mind, like, you know,
so I'm relating everything to Kafka or like, you know, and I'm sure we were both relating it back
to like all of our experiences with, you know, SQL related databases and whatnot. But yeah. So,
um, uh, definitely the, the routing section heavily, i definitely was like oh this is absolutely
heavily kafka yeah you know i keep wondering like you know maybe these episodes are terrible for
people that are only working with you know relational databases which is how i worked for
most of my career was only with relational databases but for me working with you know
elastic surgery kafka and a relational database kind of in daily situations.
It's so great to like, I'll read a sentence in the book that says something.
I'm like, oh, so that's why it's like this, or that's why it has this limitation,
or that's why we use this system for this use case.
Yeah, absolutely.
When you said that, I was like, no, that kind of makes me sad,
because I really want to hope that if you were working only in a single type of technology, to your point exactly, that you would know, yeah, but I keep running into this problem when I try to use this SQL server to do this.
And why is that?
And then this book can help to expose you to why those problems are happening and like why these other
technologies exist and like the problems that they solve and uh you know i had this thought that like
this book when when did this book come out like this book is only like a couple years old right
um so if i remember right i don't remember the exact date. Oh, 2017 is the first edition. So this, this book is like four years old or four and a half. It's right. Cause yeah,
cause it was March of 2017. So it's about to be about to be five years old.
And in that short, short time though, I think that this book is going to stand the test of
time of being like one of the, it is going to become one of our classic clean architecture, pragmatic programmers, you know, refactoring kind of books.
Like, you know, it'll be up there as like one of the greats, you know, with like the gang of four that we're going to like, because there's so much they do talk. Martin Kleppman does talk a lot about like specific database technologies,
but everything is so generic and applicable to, you know,
whatever the platform is that it just helps to like understand.
And he does a really good job of explaining some of this.
And that like like you know
you think about um like we keep getting things to where as a society we keep working to to tackle a
complex problem until we make it to where it's so easy and trivial and we have libraries to
abstract it to where it's no longer a thing right and. And so, you know, in the seventies where
like a lot of great technology apparently was invented. So we've learned during the course
of this podcast, uh, you know, there were things that, that, that it was, this was,
it was way too far advanced, right. To, to like get into like the routing and partitioning and
kind of problems that they're talking about here. But, you know, they had documented a bunch of
stuff back then that even this book has covered, right?
But, you know, now we don't worry about that.
But I just think that, like, this book is going to, like,
stand the test of time, you know,
until we get to the point to where we no longer care about routing
and indexes and secondary indexes.
But I'm like, I won't see that coming for like quite a long time.
Not in my lifetime.
Like I think it'll be,
so I think it's going to stay this time.
Yeah,
I totally agree.
Like,
you know,
we see those articles like top 10,
uh,
programming books,
every programmer should have,
whatever.
I think this one should be in like the top five of all those lists.
Like I was going to say,
it's not there now.
Yeah,
that was,
that's where I agree.
It should be.
But I think like,
you know,
we're,
we'll like this book is still kind of new. So it's still kind of saturating but i think like 10 years from now
we'll still be seeing this this book in those lists i mean you know the gang of four book
has stood the test of time and it's only had one printing it's still the original printing if i
remember correctly of that book so apparently they didn't make a single typo so kudos to them um i couldn't have done that but um you know you think about that
and how relevant it still is the information that they put in there still is but yet we have
so much technology technology today to abstract like things that we create and whatnot and yet
the patterns that they described there are still very much a part of
our regular world.
Right.
And still very important.
So,
so the concepts that are in this book,
I think are like really key.
So going back to your point about,
even if you were in a single database technology,
you know,
that's fine,
but you should still, these are things and concepts that you
should still know. Like even it would help you to understand the underpinnings of that single
technology period. Even if you never did use something like a document database or, you know,
something like a Kafka or whatever, like if you stayed, if you stayed in a relational database,
for example.
Yeah, I totally agree.
You know, a lot of people kind of push back on design patterns sometimes,
or at least you'll see it on Twitter, specifically the book.
But to me, the argument is kind of that the language features have gotten better,
so you don't need to really be programming these anymore.
But good luck using modern JavaScript frameworks without observers.
And good luck using Java without builders or factories.
These things do kind of become baked into the languages and frameworks,
but that doesn't mean it's not worth learning about because that's how you make new ones.
It's how you build those languages and frameworks
and also how you use them.
So yeah, I think that book is also top three.
What were we talking about the the two main strategies for
secondary indexes right uh so document-based partitioning and term-based partitioning
and term-based is really kind of an evolution of documents so let's start with document
and so you remember our example there we talked about partitioning our credit card transactions by date and also the the example you gave of encyclopedia imagine if we kept a secondary data structure
along with each partition so it's kind of like having an index in the back of each book in your
encyclopedia set so you go through and pick up like aardvark to Automobile and you look in the back and look for Aardvark and it'll tell you page one, page seven or whatever.
That's a very different solution from having a centralized kind of almost like table of contents or like one big index that tracks across all the references.
Oh, you know what?
I didn't even that never even dawned on me. But yeah, I do like this example because then like technically every encyclopedia does have a secondary index along with it that is the document one, right?
Yeah, that's pretty cool.
No, it would be the term-based one.
I'm sorry.
But it's, it rides along with it.
So with term-based is like you'd have like one book in the set that all it does is just point to different areas.
And, you know, obviously in the encyclopedia, it's already organized alphabetically.
But, you know, there's certain themes, like, you could look in the index and see, like, the 1920s.
1920s are referenced for the Great Depression, but also Prohibition and also, I don't know, Spanish-American War.
And so those are different places where, you know, you want to go look, and those are going to be in different actual books.
But in the books themselves, you could also flip it back and see if there's anything about the 1920s.
And that would be what we call document-based partitioning.
It's been so long since I picked up a physical encyclopedia that maybe they don't actually have a reverse index in the back of it at all because really the like one of the key differences here was that the document-based partitioning like you said you have whatever
partition has the the partition on it the secondary index is also beside it it's with it
so from you have the performance gain of now when you want to do that read it's all right there on the same thing versus
the term-based partitioning it was also like kind of referred to as like a global
uh type of partitioning where there's think of it as going back to your elastic terminology there
the master node like it knows that hey for, for this secondary index, I want to look for, um, you know, transactions based on this credit card.
It knows all of the partitions that need to, it needs to hit to run that query. And so it might
get spanned across like say 12 different nodes to to do that and that's where that scatter gather
comes back um so you have the complexity there so that was the downside of of the term base versus
the document it was you know just the one thing doing it yeah absolutely so i just kind of um
peel it back a little bit so specifically talking about document-based partitioning
and we said each node now has track of its own indexes. So we can go, when we query for users,
instead of going and looking through every single row and every single partitioning and comparing,
is this my user ID? Is this my user ID? Is this my user ID? Now we go to each partition and say,
hey, do you have any user ID one, two, three? And no no no no yes no no no yes and so the ones
that say yes are the ones that end up getting queried so it's much faster than looking at every
single document in those databases but still you have to talk to every single partition to ask if
it has that uh secondary index and so counting is easy because you can just go and you know like we talked about before
we say hey node uh 2021 uh october 5th do you have any user id uh you know 123
and it says yes i've got three of them like okay great and that's all i need from you
let's go check the 26th the 27th 28th 29th so we can count that up really easy and this
is a much better solution than obviously looking at every single record.
But, oh, you know what's fun?
I forgot about this.
So remember Big O?
So we talked about, yeah, we probably talked about that very early on.
But it's the difference between a log of n search where we know, you know, things.
We basically can go to each index and ask do you exist? And if so, you know, assuming that the list
is ordered by the keys, we can go and easily look those
up in log of n time. But if we have to look at every
data point, then we're basically looking at each record, which
is O of n, where n is the document
in your entire data set
so
bigger the data set the bigger the savings yeah because if you looked at um
o of log of n it was like significantly smaller if you go back to the big o cheat sheet
you know like o of n was like very linear, right?
Because it's whatever your count is, right?
But the log of that would be a significantly smaller number.
So, you know, you could really save some time there.
Like on the big O cheat sheet, it was almost a flat line.
Yeah.
And so it's really great because the bigger the numbers get, the more efficient.
So let's imagine you've got a data set of 1 million records.
If you need to check 1 million records, you've got to do 1 million comparisons.
If you want to do a log-in-based lookup of a million records, you have to do 20 comparisons.
Actually, 19.
And that's max. you're doing max 20 comparisons
so imagine if we go to 10 million how did i what did i do wrong because if i do a log of
1 million i got an answer of 6 uh did you do log base 2 or log base 10 oh good point good point
yeah so sorry i should specify it So log base two of 10 million is.
Well, I don't know how to do that.
And apparently in Google log to 10 million.
23.
Yeah. So with 10 times more data, you have to do 23 point something comparisons.
So three more searches.
Yeah.
Then I shouldn't even say three more searches, three more comparisons.
Oh, right.
Looking at the data.
Yeah.
So, yeah.
I mean, it's just a huge savings.
Yeah.
So that's what I mean when I say that like log of N is almost like the flat line when
you look at the big O cheat sheet because, you know, you get to really big numbers for
N and it's still an extremely small number, relatively speaking.
It's basically the only way we can work with huge amounts of data.
If it's going to be O of N
or bigger, then you pretty much can't do it, at least not in real time.
So that's how when you go to Google and you see like, oh,
17.5 million results and it happened in less than a second.
That's how they did it.
The only way that's possible.
And the downside is we do have to take a small performance set whenever we insert new items.
So if we're writing a new credit card transaction, we also have to go and we have to write this into the index.
But if you're querying by that secondary index uh then that's fine you
know that's that's going to be worth it and there's also the downside from a um availability
kind of point of view yeah because it is like right beside that partition so
you know you would have to replicate it out as well
yeah so we call this a local index because it's stored locally to the partition and if that
partition isn't available we can't respond to your query so if one of my you know two thousand
or whatever we said seven years you know so like two three thousand um one of my partitions is unavailable for some reason i can't answer
your question at all which is really fragile so so it's a you know it's a matter of query
performance in case you know i have to wait for whichever slows the one that slows but also my
availability is taking a hit which stinks uh one thing you know we did mention is nice about this is with data retention when the data drops
off in case of local indexes done there's no cleanup you just drop the whole index it's
sorry the whole partition and the indexes go with it if we did have our indexes centrally
located and we dropped you know older data which is every day we'd be dropping a partition, we have to go and we have to kind of
scrub that directory of that partition saying this doesn't apply anymore. It just takes a little bit
longer. Yeah. So just to elaborate on that, if you were going to clean your data, if you were
only going to retain 90 days worth of data and you wanted to drop the partition for the day and you didn't use the document-based
partitioning and you'd have to go back to that global partition uh or that global partitioning
secondary partitioning scheme and remove every record from it which you know there could be
thousands for for like whatever that
day is, depending on how successful your e-commerce shop is. Right. Which let's face it,
it's going to be extremely successful because you've read this book.
That's right. Totally, totally. Uh, transfers there that, that skill transfers.
Yeah. Yeah. So that was, uh, that was pretty much it for document based partitioning. So in case
we've had these local indexes on the partition. We take a little bit of a hit
on writing,
but hugely performance,
huge performance boosts on querying.
So let's take that one step further
for term-based partitioning,
which is evolution.
And like we've been,
you know, talking about,
we've been kind of learning
the lines there a little bit.
But what it does
is it takes those locals,
gets rid of them,
and it keeps a global index so that clients can go and talk to this global index.
And then it'll tell them which partitions to go to.
And so if our user only has 20 transactions over seven years and those 20 transactions only happen on 11 different partitions, then we only need to go query 11 partitions.
So it's a lot less stress on the system.
It's a whole lot less network traffic.
It's going to be faster
because we don't have to wait
on whatever is the slowest of 2000.
Now we're waiting on whatever is the slowest of 11,
which is really great.
And it's a whole lot less fragile
because we've got this,
presumably distributed system now storing this much smaller data set, which is going to be really fast to query.
Yeah, they don't really talk about like how that partition might be replicated and spread across, you know, because in like it kind of it got super meta at this point because it was like, well, okay, so now we're going to create another index of our indexes.
And you might want to have a keying mechanism for that, spread that across partitions to increase performance and availability and whatnot and scalability.
But maybe it would be okay because these are smaller
like i don't know but it did get very meta at that point yeah it's basically like you've got
a database for your data and now you need a database for your indexes but it's smaller so
it's a little bit easier so uh you know that's what you got going for you i did think about like
as i was reading this part of the book though, like, I wonder, like,
if you had to just for a challenge, if you decided, Hey, I'm going to write my own database,
right? Because like, if you remember, that's where this book starts with, right? It's like,
Hey, let's just write a simple key value store, uh, to a text file. And let's, you know, we'll
just have some simple bash functions to, uh, read and write data to it and start from there.
Right.
And, you know, imagine like the fun types of challenges that you would get into as you would start to scale this thing out.
And specifically things like this.
And then I got to like thinking about like how complex it would be because like, I mean, really, you know really, props to the authors and the developers that work on these technologies.
Because I started thinking about just the file I.O. management of these systems.
Like, you can't write to this index at the moment because I need to put a lock on this index at the moment to keep prevent you from writing to it because maybe I'm trying to split it in the case of like automatic partitioning systems or whatnot. whatever reason you know just just the file io management alone is complex enough now to think
about like these um you know the different data structures we've talked about like ss tables and
and lsm trees and whatnot uh you know to think like okay now like i need to read this into memory
and like where do you go for it and you know some of the file limitations like they were mentioned i want to say maybe it
was like h base that was like had a 10 gig limit on partitions before it would like automatically
split but it also had the capability of um combining those files back together if the if
they fell below so they made the analogy of uh i say they, but I mean Martin Clipman, he made the analogy of the B-tree and how the B-trees can work to where it can split things out and then collapse them back depending on what's going on there. Thinking of like how you would write the file management alone aspect of all of this.
It was like just kind of like mind blown, kind of like thinking of like, well, this is why it's taken us decades to get to here.
Right.
Yeah.
Oh, yeah.
And yeah, it's just the low level details are so hard to get right and so important to get 100% correct.
Oh, yeah.
And you're going to take that and you're going to replicate that data to multiple nodes and you're going to split it up into different partitions.
You're going to split those partitions, rebalance.
And eventually, I think the peak is kind of like distributed transactions across multiple partitions and nodes.
And it's just amazing well it works at all one of the things that also came to mind
too is that like as they're talking about all of this um you know when we're talking about
the partitions and the replications we talk about like one of the things that that's a key
advantage is that you want to have the multiple nodes so that they can serve
uh different parts of the data because one like, like, you know, we talked about like
scalability, but also, you know, availability might not even be able to fit on one machine
or whatever. But when you talk about like repartitioning some of the data, you know,
on the fly or whatnot, whether it be to like expand or collapse the partition based on the need,
then, and if you needed to like reassign the partition to like
another node, one of the things that came to mind was in from a DB2 world in Oracle and had a similar
concept, but I'm more familiar with it from a DB2 world, although the name eludes me at the moment,
but their high availability solution. One of the things that you could do was you could have like a SAN,
like a storage area network where it's all fiber-based, right? So you have all of these
things, all of this data sits on a disk array that you are accessing over fiber. And then there's the servers can like literally um they're literally sharing the same disk the
same underlying disk and so like transitioning from one node to the next doesn't necessarily
have to be really all that expensive so that type of idea was really dependent on like well what's your
physical architecture and like we'll get more to it in there in the routing section um because
physical architecture like really like it's that whole joke about like uh just turtles all the way
down you know kind of thing because uh it really started to matter like well okay if if all your partitions
are like physically separate you know because maybe it's like in an aws or or uh google cloud
or azure kind of world where like you don't control the hardware but if you are the googles
or the amazons or the mic Microsofts and you can control that hardware,
then, you know, you can have some greater, um, some efficiencies because you could like
localize that better, you know? And so your use cases can vary. Does that make sense?
Yeah. It opens things up. So, you know, if you imagine I had like a really, uh, good network Does that make sense? S3 and Google Cloud Storage and Azure Blob Storage is basically, you know, kind of what that is. They went and they created these dynamically shiftable storage options that store what you want and they keep track of it.
Yeah. So in the world where like you can't control your storage, then you, you know, you have different set of needs.
But when you can control that, then like if you were to be all on a sand then but you know you can have some some
improvements there with like how you could go from one node to the next but also you lose
uh some of this uh benefits but in in way of like uh protection like i don't know that we
really talked about that.
We talked about it from,
well,
I guess retention or no,
not retention.
Availability would kind of be like under the availability area.
Cause like if you lost like that one data center,
if then that kind of assumes that if you have access to that sand,
then you're all in the same data center or else you couldn't be on that same
sand,
you know,
just because of the limitations of the fiber networks.
But, yeah.
Yeah, it's all the way down.
It's like the area is.
It so is.
So to take it back to term-based partitioning,
basically, you know, it's just what you mentioned.
So we've got the global index now,
and the benefit is that you've you've taken now this
huge benefit of searchability for document-based partitioning and now you've made another huge
relief forward in performance because now instead of uh us having to go to each partition which can
you know if you have thousands of them uh is really slow and you're kind of network dependent
and you've got this kind of fragility built in now you can just go to the smaller system and say you know hey where's my stuff and then you reduce that right out of the
gate to you know potentially uh really much smaller number of operations that you need to
complete but the downside is that there's overhead and keeping track of those indexes that's much
higher than the document-based partitioning because, like we said, if things get modified,
we have to go update the index.
If data drops out because of retention,
we need to go modify the index.
Every time we insert new data,
we got to modify the indexes.
And this is like this one smaller system,
but it's really high traffic.
And it's a database of your database.
You want to delete that?
You know, you got to go update this other database.
Yeah, exactly.
And just by having, you know, a separate index alone, you've got some kind of like an async
problem built in there.
So I need to insert my document, but also need to go insert the metadata about my document.
And, you know, depending on how many secondary indexes I have, you know, there could be multiple writes that need to happen in different spots just in my indexes alone.
So now every single insert to my into my database is now, I don't know, 20 inserts, 11.
Once you add a replication, you know, it's a multiple.
So each insert is now, I don't know, 20, 30 operations that need to be written across some number of nodes.
And that's not all going to happen at the exact same instant.
And so that's why when we talk about some of the stuff,
like you're kind of looking at basically,
you know, you're talking about eventual consistency
just on a single write,
which is part of the reason I kind of thought of this chapter
as being closely associated with like NoSQL type things.
Yeah, I know that like in the last episode, part of the reason I kind of thought of this chapter as being closely associated with like no SQL type things. Yeah.
I know that like in the last episode,
Alan and I had debated on whether or not,
you know,
you think of these indexes as another database or not,
you know?
And,
and so we kind of talked about that and,
you know,
and even in the book there,
I'm trying to find it now,
but there was like a, a little blurb in like one of the footnotes where they they did make the point
of like referring to it as like yeah you could think of these as like little mini databases
within your database yeah even though it wasn't and like how i have traditionally thought of it
and like that's one of the things that we talked about in the last episode was how I've always thought of these as just ways of different tables of it.
But I do get it because of all the complexity around it and the management and operation around maintaining it.
Now, Dynamo does say that you can generally expect a fraction of a second for the inserts to happen
to both the data and the indexes but you can imagine that like that would kind of stink if
you are inserting data and imagine something goes wrong and it takes minutes and if i look up the
data one way it works so i look up another it doesn't and calling the secondary indexes an
optimization is misleading because it's not just about making it faster because if
that optimist if that secondary index hasn't been written yet you won't find the data so it's not
just making it faster it makes it findable at all well that depends on like how you're searching for
it though right well i assume you're trusting the indexes well unless you're well indexes what i
mean though is like if it was written to the primary index and you're only looking for orders for today, then you'll find it.
But if you were to look for orders by credit card,
then by a specific number, then yeah, maybe that secondary index
hasn't been updated yet for whatever reason. And so now you can't find that.
And so we've all been in that situation where we're like, I don't understand.
I can find this
data this way but yet if i use this other method it doesn't show up yet right right i think which
is kind of funny i think i think in the past we've also talked about like when we talk because
because this kind of um only because you brought up dynamo db it kind of reminds me of like eventual
consistency and we've talked about like how you could go to Reddit and you could submit a,
a post and then you refresh the page and you're like, wait, where did it go?
Yeah. You know? So, um, I mean,
it's not exactly the same as what you're talking about,
but it just kind of came to mind.
Yeah. It's kind of funny to think of, you know, we talk,
I at least I think about when writing the data as
either it's either there or not and so we talked about one of the strategies was like read your
own rights so we say hey i'm gonna like this post on reddit but i don't show or don't let the client
return until we verify that we can look that like up and that we've got it in our database so it
comes back and says okay we got your like and then uh maybe you refresh the
you know the page somewhere else and it doesn't show up because it hasn't made that that secondary
index that keeps track of those likes by user or by post or by date or however else they do it
hasn't shown up yet and so it shows up as here on one page for your user and not here for another
just kind of confusing.
Today's episode of Code in Blocks is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications.
Datadog's machine learning alerts, customizable dashboards, and 450-plus vendor-backed integrations
make it easy to unify disparate data sources and pivot between correlated metrics and events
for faster
troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve
your application's performance. And I can't emphasize that combining all that in one,
because we're talking about indexes and databases as it relates to this episode,
specifically partitions and whatnot, right? And Elasticsearch keeps coming up in the course of our conversation. And of course, of those 450 plus, 450 plus
integrations that Datadog has, of course, Elasticsearch is one of them. Not only that,
but they actually have like the single pane of glass view that you could have of your Elasticsearch
cluster to know like, hey, what's the current rate of your queries and the current rate of indexing?
So you want to know about the search and index
and performance of your cluster.
What about resource saturation and errors?
All that with JVM heap and garbage collection metrics,
network collection, single pane of glass,
all of that for your Elasticsearch cluster.
And that's just one example
of the many technologies that they cover. Yeah, you want to make great decisions about partitioning and nodes that you got to have
the data. And this is the way to do it. It's fantastic. And it looks beautiful. But I don't
actually want to talk about that because Dash just happened. And I was just browsing YouTube.
And as far as I can tell, all the talks for Dash have been uploaded. And so start with the keynote and then dive into some really great stuff there.
I was just watching the Ask an SRE videos.
Looks like there are two of them.
And I had only watched one of them.
So I'm excited to check out the other one for North America.
They've got panels, case studies, a bunch of sessions on security and incident response, supply chain attacks.
They're just really great stuff.
And it's all up on YouTube.
So you can be watching or listening to that while you're doing the dishes or, you know,
hanging out.
So give that a give that a shot and then go and try Datadog free by starting a 14 day
trial.
And you also get a free T-shirt once you install the agent.
Yep.
So visit Datadog HQq.com slash coding blocks.
That's datadoghq.com slash coding blocks to see how you can unify your monitoring today.
All right.
So I don't even know how we do this anymore.
Am I supposed to?
Because I kind of gave up on the late night.
Hey, listener.
But then even in the last episode, in the background,
like Alan was still doing.
So I don't know.
Maybe I'm supposed to like, hey, listener,
if you wouldn't mind leaving us a review,
you need to head to www.codingblocks.net slash review
where you can find some helpful
links and all your
late night coding favorites.
Morgan
Freeman. That was great.
I was
first. I couldn't
tell if you do an Al Bundy or Morgan Freeman,
but then, yeah, I mean, it's obvious.
Well, that's how bad my late
night radio DJ voice is then because it was supposed to be, you know, I was thinking of like the smooth, sweet sounds of W-J-Z-Z.
No, I mean, that was straight up Morgan Freeman, just perfect.
Now that I'm trying to like, you say that, I'm trying to think of like a Morgan Freeman thing, you know, or a catchphrase, and I'm like coming up blank.
Yeah. So, all right. Well,
so once again, we will not have a, you know, survey for you to answer. Cause you know,
wouldn't it'd be a little bit odd to only have one answer and it would be
super odd if you still lost even to yourself, that would be uncomfortable.
I would feel bad for you. I don't yourself. That would be uncomfortable. I could do it.
I would feel bad for you.
I don't want to put you through that.
That wouldn't be fair to you.
I guess we're kind of in my favorite portion of the show.
Survey says... Wait, wait.
Can you read this survey as Samuel L. Jackson?
Wait, Samuel L. Jackson?
Yeah. Well, I might have to drop way too many f-bombs to do all right we're gonna have to put an explicit tag on this show if i do it that
way say what what no i'm not gonna um all right so uh well okay fine how about if i ask you this first then do you know why the new skindle
uh skindle uh kindle screen is textured to look like paper no i don't so you feel right at home
oh that's terrible micro g actually that was dad jokes that was the dad jokes api yeah all right uh okay so for this
episode's survey we ask how many different data storage technologies do you use for your day job
now this is important it's for that you use not the ones that are available in your company but
just the ones that you use so So your choices are just the one
and it's the, it's our hammer or two to three. It's a quaint little data pipeline
or four or more. Oh my God. Why do we have so many or none? Keep your data crap out of my CSS.
Or, you know, whatever your front-end choice of technologies are.
I'm assuming front-end.
Yep.
This episode is sponsored by Linode.
Simplify your infrastructure and cut your cloud bills in half with Linode's Linux virtual machines.
Develop, deploy, and scale your modern applications faster and easier. Whether you're
developing a personal project or managing larger workloads, you deserve simple, affordable,
and accessible cloud computing solutions. You can get started on Linode today with $100 in free
credit for listeners of Coding Blocks. You can find the details at linode.com slash codingblocks.
And Linode has data centers around the world with the same simple and consistent pricing regardless of location.
And I don't know if you've ever looked at their pricing calculator.
I was just looking at their website.
But because the experience for Linode is so simplified compared to other cloud vendors,
there's really only a couple dry boxes to drag
it's really crazy because their pricing is so reliable and affordable and predictable it's
literally like how much memory do you have how many nodes do you have and you've got a little
slider here for transfer and that's it uh that's incredible experience compared to some of the
other things i've seen where it's just impossible to know how much you're going to be spending.
And it's such a relief to know.
And with $100 extra credit, you can see just how much you can stretch that.
I mean, I was able to run a three-node Kubernetes cluster for months for less than $100.
Yeah.
And when we say that you could use that $100 to go towards even personal projects,
we're not kidding. What if... Now, hear me out, Joe. Hear me out because this one's crazy. You
ready? What if you just wanted to have your own CS Go server? Oh, I didn't even think about that.
Yeah. So you can go... I'm not kidding. I'm not making this up. You could go to Linode,
go to the marketplace, and there's a whole slew of things that you could just easily one click, add to your cluster,
add to your environment, right? Including CS go. But I know you're thinking like, yeah,
but we've been talking about databases and whatnot. Like, okay, fine. MongoDB is in there
as well. Postgres is in there as well. Like you could have your traditional databases and, you know, play your traditional databases and play around with partitioning and learn about that if that's what you want to do.
But it's time to go have some fun too and play some games. And that $100 can go a really long
way. Like Joe said, that pricing calculator, it makes it super easy to see what your cost is
going to be. Plus their costs are so small. So like you can really stretch
that a hundred dollars out. So it's a, you know, high, uh, what was the word I'm trying to look
for here? Like value, uh, you know, bang for buck that you're going to get here, extreme,
extremely high value. Uh, so you can choose the data center nearest to you. You can receive 24 by seven by 365 human support with no tears or hands off,
you know,
or handoffs rather regardless of what your plan size is.
Cause haven't you always hated,
like you,
you call tech support in the time that you need tech support.
And instead of being able to get to that person,
you keep getting bounced around from one person to the next or,
or whatever,
or you can't even get to a computer to a person because you keep going through an automated system.
And with Linode, human support,
with no tiers or handoffs,
regardless of your plan size.
You can choose shared and dedicated computer instances,
or you can use your $100 in credit
for S3-compatible object storage,
managed Kubernetes, and more.
Yep, if it runs on Linux, it runs on Linode. Visit linode.com slash coding blocks. Again,
that's linode.com slash coding blocks and click on the create free account button to get started.
All right. So let's talk a little bit about rebalancing because we don't always get things
right the first time or things sometimes change like if you have to
add more nodes because you've got cpu or ram problems or maybe you've got lower traffic than
you need to uh than you used to and you need to save some money sometimes nodes just go down and
you need to be able to recover from that so those are all examples where you might need to repartition
your data and uh basically no matter how you do it there's a couple goals that
you generally want to kind of aim for and the first that you want to try in most cases to load
uh distribute the load equally ish and i didn't say distribute the data equally ish because we
as we mentioned there's different strategies for that. And sometimes you want to embrace hotspotting by kind of having different hardware.
Or, you know, you could have mismatched nodes, like different node pools where, like, you know, you're hooking up whatever hardware you've got in the closet.
And so you want to make sure that the load is distributed equally to keep a consistent kind of query times.
Also,
you probably want to keep the database operational during the rebalance.
This is like,
now we're starting to get like heavily into where I was thinking of like
Kafka as we started getting into like rebalancing and routing types of
conversations,
because Kafka especially can be a bit of a beast in,
in this particular type of problem,
because like depending on the technology,
it might not repartitioning your data may be a,
you know,
click of a button thing like,
Hey,
it's done versus it may be a huge effort to go
and like rekey all your data so that it gets into a new partition because, you know, like Kafka
specifically, if you were to just add on another partition, it's going to be like, great, but I'm
still going to keep all the data, all the old data is over here because that's where it is.
And, you know, you would have to like specifically reread and rekey data to move it around.
Yeah.
Absolutely.
And there's a couple different approaches and some pros and cons that we're going to get into.
But there's one way that you absolutely should not try to basically partition your data, which is you should not be hashing by the number of nodes, like the number of computers involved.
This part was so great.
Yeah, I loved it.
Yeah.
I hadn't considered it before.
I don't think I've ever seen it partitioned that way,
but number of nodes changes, right?
Sometimes you'll add another.
Sometimes one will go down either temporarily or permanently.
And every time you do that, if you're hashing by the number of nodes, then you're going
to be moving data around a ton.
And I didn't even realize how much it was until I looked at some of the examples in
the books, in the book.
And it said, you know, obviously if you're going from one partition to two, at a minimum,
you're, you know, moving 50% of your data, right?
So what I didn't understand is even if you have
say 100 nodes and you've got a key i'm just going to pick one of 1000 and you drop a node so you
go down to 99 well a thousand mod 99 uh is hashes to one as opposed to zero so you've got to move
that record what if we went to 102 well that hashes to node four so now that key of 1000 needs to go to
four we uh go to 103 nodes guess what it's moving again so you are constantly juggling a ton of data
way more than you would think every time you change the number of nodes which is a ton of work
and all you need is for one part of your network to be slow to respond for the rest
of the nodes to think oh we just lost the node time to repartition the data yep and so yeah you
can have like that what they call the stampeding cattle or uh i've heard they call it but basically
the problem just blows out of proportion because now it starts adding nodes really quickly because
it sees the system is not performant now the rebalancing has gotten out of proportion because now it starts adding nodes really quickly because it sees the system is not performant. Now the rebalancing
has gotten out of control and it's just taking forever
across all these nodes and removing all this data, which
causes the need for more nodes.
I hadn't heard of that term before.
Is that really what it's
called? I probably got it wrong.
Stampeding herd?
Huh. Interesting.
I mean, I could totally see it, though.
It definitely spirals out of control fast
um so yeah i can't remember now oh but you know if you remember uh audience and leave a comment
because we're giving away copies of the book forgot to mention that i forgot yeah which means
i need to give away the last episode all right so i wrote that
yeah there's some there's some cool name for it and we've talked about on the show before i'm sure All right. So I wrote that.
There's some,
there's some cool name for it. And we've talked about on the show before.
I'm sure if you heard it,
you'd know it.
But yeah,
so that is the one thing that the book cautions you very strongly.
Like do not do this because you're tying yourself to the number of nose and
you're going to be moving data.
What a lot more often you might think it's counterintuitive.
Well, the thing, the thing is that I had never thought about partitioning it that way,
only because I guess I never had to think about it.
But when I read that, I was like,
well, I could definitely see why somebody might be tempted
to want to partition it that way.
Yeah. Right? Well, hey, I have 10 nodes. see why somebody might be tempted to want to partition it that way yeah right like you know
well hey i i have 10 nodes let me just uh partition it evenly across my 10 nodes so
whatever your key is i'll just mod that and you know that's where you fall and uh you know
but like as you said as you as that comes and goes and goes, as your nodes come and go, then it would really cause problems for your data.
And constantly have to move it around.
And that's where that SAN idea was coming into play here, turn this thing.
I was like, well, I guess it wouldn't be too bad if you were able to live in a world where everything could share the underlying disk structure.
And because it's all fiber, then it could be super fast.
But that's not the real world, more often than not.
Yeah, luckily I've never fallen into this.
But I can see myself being in a case with Kafka where, let's say, I've started a small project with Kafka.
And let's just say, for example purposes, I've got one broker.
I can see myself having one partition for my topic that I care about.
And then I add another broker.
Well, now I should go ahead and add another partition.
Adding partitions in Kafka, changing those partitions is painful.
You basically have to kind of rerun everything through.
It's slow and it's a problem.
And you've got to pick when to swap over.
And you've got data coming in in the meantime.
And it's just a pain.
It's actually not even advisable by the way like it's actually if you really wanted to do that
the way to do it in kafka would be to the the more the more advisable way to do it would be to set up
an entirely separate topic that is partitioned the way you want and then just parallel your
rights to both topics and then swap over to the new one as you can, because otherwise you'd have to reread the original topic from the beginning of time to rekey it.
You know, to get the data spread across that partition as the way you would want it.
And even with the idea that I mentioned with the sand, though, to where that like that kind of falls apart is um that only works it
if you're able to take the partition as is but in this rebalancing portion of the book we're
talking about like you're taking like go back to your example i think you used with um h base or
couch base i forget which one you said but where like you went from one to two nodes and you were like literally, you know, rewriting 50 percent of the data because you're physically like moving the data from one part of the from one partition to another partition.
Even if it even if you did have access to the same disks, you're still moving it to a different part of the disk.
So you'd still have the IO expense there.
So that's a really good point.
Yeah.
And I should mention that when you go from two to three, it's not a 30% of the data that's moving.
You have to move 30% of the data, but then also you're rejiggering the data on the nodes that you just dropped that data from in order to kind of take up that space.
You're writing to 100% of the records, even if you're only writing tombstones to 30% of it.
You know what I mean like yeah because you're you're constantly moving bits around yep yeah yeah absolutely so yeah you
can't just leave these giant holes of zombies you know you have to fill that stuff in or otherwise
you don't get the benefits of the other kind of lookups and depending on like the underlying
technology too you might have compaction that would eventually kick in to pay based on the amount of data so which would likely just result
in rewriting the entire uh smaller portion of it so that's where you get into that 100 thing
like yeah it really grows out of control depending on what you're trying to do
and it was surprising too that like some of the systems they talked about here like actually have
this capability like as a built-in feature.
And I think I alluded to it before.
I think it was HBase maybe was one of them where it could dynamically expand or contract the partitions based on a file size limitation.
Yeah, absolutely.
HBase Rethink were the examples.
And apparently Mongo has an option for it um which is really cool yeah sorry yeah no it's a typo yeah so um
the the kawasaki example going back there real quick i wanted to point out that um if you um
if you care about the ordering of the events you have to fully copy that data to the new topic
before taking anything new into the new topic so So it's almost like you're going to have some downtime
there. Then there's no automatic way to do it because Kafka knows that this is a terrible
operation and they don't want you to do it. But I can see myself naively getting in that situation
where I had one node, I have one partition. Why not? I had another node. Well, crap, I got to do
that expensive operation. Fine. And we add a third one. Now I'm realizing the error of my ways.
But is there something I could have done to prevent that situation?
And the answer is yes.
The widely kind of accepted solution to that problem is to have a number of partitions that is greater than your nodes.
How many partitions are you going to have?
How many nodes are you going to end up with?
That's a hard problem because you have to play a guessing game that you have to kind of guess at
how many nodes you could forcibly have in some kind of imaginable future and then come up with
a number of partitions that's greater than that. And there's a huge benefit here, which is that
you don't have to write individual records when things move you have to write individual
partitions so in the case where we say we have one node and let's say it's got 10 partitions on
it for a single you know topic or a database a data table or collection whatever on the database
now when we add a second node we have to move half of our partitions there, which is much easier than going record by record because we've got these
things already kind of, you know, split off.
And so we can move those partitions and it's a much,
much easier move.
You know, you can, you were, you were,
we were talking earlier about like you know,
if you only ever like what your value of this book might be,
if you only stayed in what your value of this book might be if you only stayed
in one technology and whatnot and i can think back to like some of my earlier experiences with kafka
which prior to reading this book and like when it would come time to where i would want to
repartition things like why is this so complicated i don't understand why it doesn't just support
this like why is this such a such a hassle and now i'm like yep totally get it
totally makes sense to me now yeah yeah i remember us running the problem just being like
just come like kind of mad at it it's like what do you mean you don't have a way for me to change
this like that's such a obvious feature that anyone would want and then when you start thinking
through it's like oh there's actually a bunch of trade-offs and there's some decisions that you
almost you like you need to have a manual process there because there's things that they can't do
for you yeah and it's kind of like um your comment a moment ago that made me think about this because
you'd said like where the they're you know trying to prevent you from doing things so like kind of
going full circle right like you know uh interfaces are you you know, eyes for interface episode,
right?
And those being like guardrails,
but you could kind of extract the same concept to like whatever technology you
pick in this case,
like a Kafka,
for example,
is your,
your,
that's your guardrail.
And they're trying to prevent you from doing certain operations that they've
deemed to be inefficient for their use case.
Right.
And so they're like,
no,
we don't support that if you
want to do it you're going to have to go through some pain to make that happen yep yeah absolutely
and so you know that that example that we started with where we said we've got um you know seven
years uh retention for credit card transactions um sorry i'm typing along i'm uh updating my
my notes here
so we said 70 years partitioning
if you're going from 10 nodes to 11
well okay so you're going to have to go through those
partitions and figure out you know roughly
10% of them are going to need to move
but that 90% those other partitions
you don't have to touch them you don't have to think
about them they're out of the equation
if we didn't have this case where the number
of partitions was higher than nodes we'd be looking at every piece of data which is just awful yeah and what
if you have more nodes than partitions right what if you know you had thousands of nodes in that case
well i mean would it even have to be that extreme it could just be like you have three nodes and two
partitions yeah then you're not using all your nodes yeah absolutely so you're going to have
nodes that just aren't doing anything they have no data on them yeah just waste of money yep and
so uh say most vendors don't actually support it and And so it's not even sensible in most cases.
Kafka, you can absolutely do that.
Oh, yeah.
And there's a couple of different cases where consumers, producers, you care about partitions.
And the only thing I can say there is that sometimes you'll want to over-provision.
And so it kind of avoids a cold start problem.
And one kind of use case I thought of here is if you're launching a big video game,
it's a big online video game.
Say Call of Duty 2022 is coming out.
And you know that you're going to be going from zero users to a million users for an hour.
That can be a really rough scale problem.
And so you can do some kind of paper napkin
math and say you know what let's go ahead and pre-provision 100 nodes and just have them ready
and we'll pre-provision a certain number of partitions that are greater than that number of
nodes and so that when people start coming online we can you know start balancing that stuff out
better so it avoids this kind of problem i And I skipped ahead there. It is kind of the downside to that though, to that pre-provisioning though, is it does,
it flies smack in the face of like horizontal scalability, right? So like if you go back to
your, like you put your Kubernetes hat on, right. And you want to think of like, well,
you know, I want elasticity here. So like, if I don't need need 100 nodes, then I don't want to have to pre-provision those.
But specifically to pick on Kafka for a moment,
the things that I've read about Kafka provisioning
is the advice is to architect the system
for the next two years.
So don't think about your needs today, but what your needs are going to be next two years. So don't think about your needs today,
but what your needs are going to be in two years. That's what you need to, that's how you need to
architect your, your topics and brokers and replication factor and whatnot. Right. And the
number of partitions and, you know, uh, so you have to really think through your use case and,
and think through like what, where you think you might be in two years or where you expect to be within two years and then that's the way you should size your kafka
environment so you're definitely you know that cold start problem as you mentioned like you're
you're definitely heavily front loading it and you know to avoid uh future problems so that because
kafka has those limitations about like how you can
uh oh gosh i need to repartition things and you know so yeah yeah absolutely and um so yeah it's
it kind of flies in the face of all this stuff we spend so much time talking about like this
ability to dynamically scale things and this is a case where like whoa whoa step back a little bit
uh cold start serious like you want to
put some thought into these systems ahead of time which is unfortunate but uh you know when
after we talk about this stuff it makes a lot more sense i mean date serving data is so much harder
than um serving a website right because or like html because you know anytime you have to deal
with state just like think back to like everything we've ever talked about with unit tests, for example.
Anything that doesn't involve state is super simple, right?
It's super simple to write unit tests about.
It's super simple to scale it because you're like, oh, well, if I have 1,000 of them, then it's going to be 1,000 times more performant.
But anytime you have to deal with state, which is all data is, it's like infinitely more difficult to work with.
Yeah, and there's been talk about scale to zero systems.
Like we talk about like serverless functions and things like that.
Sometimes they'll talk about basically having service that's free until you need it and then you pay for the usage.
And that's really hard to do.
I think a lot of people don't realize how hard it is to do because you need to start somewhere.
And it takes time to get up to speed.
So going from zero to thousands of requests per second, that's really tough.
And you're probably going to lose a lot of those first requests.
You just can't accelerate that fast because there's overhead.
And then it's easy to accelerate too much.
And now you're spending too much money.
So scaling all the way down is really tough. And fast is tough yeah scaling is tough well so why don't we just
have a million partitions right and we don't have to worry about we just create way more than we
need and we never have to worry about it well the downside there is that there's overhead yeah and
we kind of talked about that on the last episode. Again, I keep picking on
Kafka here because I keep thinking of this book. I keep putting on my Kafka hat as I read through
it. But the specific example that I gave last time was that for each partition within that directory,
you actually get at least two files. So then depending on your operating system,
there is a limitation on the number of files that you could have written onto the disk, which in Linux, you can actually configure that.
But then there's the overhead of just having the file handle open itself.
You know, there's there's some memory overhead to just related to that.
So, you know, you can run into problems if you, if you try to
have those million partitions, uh, you know, it's actually, there's an amplification factor to think
to consider because each partition is actually a couple of files. Yeah, absolutely. And we mentioned
too, like if you're, um, rebalancing, right. Uh, it's really nice to have partitions, uh,
greater than the number of nodes, because you only have to look at subsets of your data.
On the flip side, if you imagine the alternative strategy, like to have no effectively no partitioning, each record is its own partition.
Right. That's the extreme other end of it.
That's the same as looking at every piece of data.
It's like you have completely mitigated the benefits of partitioning at all by having one partition for every record
and so you've got to try to find that sweet spot and depends on your use cases obviously
but i just want to point out that you can't just pick a huge number and go with it and be safe
yeah you you it really helps to know your use case know your data and and go from there like
even in um i'm not sure like how this works from elastic
scaling perspective, but like, um, I've never thought about this way related to like SQL server
or, you know, Postgres or anything like that. But like specifically in the Kafka world,
you, you actually want to think about it in terms of, you know, bytes. You want to,
you want to think about it in terms of like size
of what you're going to be transmitting back and forth and what your latency is going to be
required to that, uh, that transfer. And so you then size it based on, on that.
But I never, I've never thought about like sizing a SQL server or a Postgres or anything else like
that. Have you?
No, definitely not.
I kind of wonder if part of that is because Kafka, it kind of views things as chunks of data. So as it writes stuff in, it writes it to contiguous space on disk.
So it's almost like a big buffer.
And it keeps track of where the individual offsets you know, offsets are for each individual message.
It's got like a separate point.
It says like, I know file one starts here, file two starts there.
So it kind of throws that stuff in as a big blob and then figures it out later when you ask for it.
Kind of reminds me of like the deli counter of like a Publix or something like that, where you go and you order a pound of cheese or a pound of meat or something.
And they go through and they slice, slice slice slice so like the the data is stored together is that big hunk and then you
come and say well hey give me a pound or give me 10 slices or whatever it can do that now i want
some deli cheese thanks i know and you know that is really like how the consumers work in kafka too
it's funny um they don't say um like traditional queuing systems this is total tangent but
traditional queuing systems you might say like hey give me a record and i'll do something with it
with kafka you generally told it um to pull based on a latching system so you say like
hey give me either some number of messages or some amount of time and then i'll stop so the
consumer will say i'll take a thousand or ten seconds whichever comes first and then it'll get
that big chunk of data because like we said kaf, Kafka stores stuff in these kind of chunks.
And it'll just kind of slice off as much as you want, subdivide it as necessary based on the offsets.
And then there you go.
There's your data.
But that's a big part of why Kafka works so well and why it works so well in low latency environments is it's really good at dealing with chunks of data and
because of that they think of it in terms of data size much more so than actual files because it's
not a whole bunch of little files on disk it's like one big one which goes back to what i was
talking about with the sizing yeah absolutely like like i know like i ended up creating a whole
i called it the kafka calculator um in order to size, to figure out like how to size a given environment based on, you know,
number of connections in and out and clients and expected data size and what
kind of latency you want, et cetera, et cetera.
And I'm not the only person that has done something like that for their
particular usage of Kafka.
Cause at the time when I was trying to work on sizing Kafka,
like there were other people
that had done similar things, uh, you know, to size it. Cause it is very specific to, you know,
what kind of, uh, you know, the, the, the size of data that you're going to use and, and to,
you know, to your point about wanting to serve it and contiguous blocks. So.
Yeah, that was really great. There were definitely things, assumptions that were wrong.
And if we didn't have that calculator,
it was really hard to know
how you ended up with those numbers you did.
So if you come back and say,
oh, you know what?
My average message size is actually
seven kilobytes, not five.
It's not going to change things by,
you know, seven divided by 5%.
It's going to be much more dramatic than that.
And so it was crazy to see just how much the numbers would change based on one piece of
information.
Yeah.
And, and like, given, depending on what your storage technology is, like, they'll have
like, you know, a recommendation for like, Hey, you know, any given node should be no
more, it should be responsible for no more than this amount of, of data or throughput
or whatnot.
And so that's definitely true in Kafka. They,
they have like recommendations for, you know, uh, topics and partitions to be spread apart so that
no broker is responsible for more than a certain number of partitions in an ideal situation.
And, uh, you know, you, you, you can start to, especially if you were to like live in a world where you're creating partitions, or I'm sorry, topics on the fly, you know, maybe as like, hey, Joe just signed up for my new service.
So I'm going to go spin off these five topics related to Joe.
And then, you know, now your service becomes wildly popular and you're spinning off five topics for every new user that signs up to your service,
then it can grow out of control fast. And that's where like this whole file handle
thing comes into play. Because, you know, now you're at the scale of Facebook and you have
three billion, three or four billion concurrent users a day logging in, you know, it can get
out of control quickly if you don't think
about it up front yeah and even just adding five topics is a lot of work so you get a new customer
you add five topics uh say they've got 10 partitions in each topic which is like a fairly
normal number and then a replication factor of three so each uh each one of those partitions
is replicated three times well uh we just added 150
um wait did i do that right it's gonna be five times ten times three yeah so 150 uh partitions
to our system which is whoo that adds up fast right just for each customer yep
all right well uh what about rebalancing so we've talked about hot spotting a little bit
and uh we talked about a few times about hb so how it's really good about kind of automatically
uh dynamically i skipped section i know i mean kind of i mean we've talked about dynamic
partitioning uh already we've kind of hinted at it about like you know uh because the trying to
determine the number of partitions up front can be difficult uh some systems will allow you to do
it dynamically so i think we mentioned like h base and rethink were a couple examples of that
yeah i wanted to point out that it's really important to know that there's no magic algorithm
that they're using to like split your load up like super intelligently they're just kind of slicing it up like in kind
of chunks and so you add a new node and say all right you get half and i'll move that stuff over
and it's really nice that you don't have to think about it it scales really well uh cassandra ed um
has an interesting way of doing some partitioning it well as well because it's got the whole like
leaderless architecture
that deals really well with adding new nodes
and being available even when nodes
disappear.
It's a different use case there, but it just
handles things really well there.
But there's no magic. Ultimately, it's
doing some kind of dumb math in order to
figure that out. Well, which might work
against you, though.
If all it's doing is just splitting it based on,
well, the file size has now reached some maximum,
so I'm going to give half to this one and half to the other one,
it could give, you know, the new node could get half,
the half of the data that the new node gets
could be the half that's most heavily requested.
You know, if all i did was just
simply divide it in half right yeah and the cold start problem is really bad there too you know
like your launch ecology 2022 and h base starts out at zero and then it fills up really quickly
and it splits in two and then those fill up yeah and it just keeps going they did talk about like
in the in this portion of the book though where
for that specific problem where like even in the automatic world or the dynamic partitioning world
where you would probably want to go ahead and start it out with some minimum that you thought
would be reasonable so that you don't get into that a stampede problem that you referred to earlier
yep absolutely and so yeah we talked about we
just talked about a system that automatically rebalance and you know we talked about the kind
of downside there it is really nice if you have workloads that are really like well homogenized
or known and uh you know it doesn't grow too quickly and you don't have like hot spotting
issues and so that's really nice for that but if you do have um situations where you
really want to protect your system so like maybe you want to not rebalance until off hours because
you don't want to cause unnecessary strain on your system or you know we talked about that video
game launch where you want to kind of scale up ahead of time where you've got stuff over provision
for a period of time because you expect to get a really quick flood. And so there's just considerations there
whether you even want it to be automatic or not.
And then some systems, I mentioned Couchbase, React, and Voldemort
will suggest the partition assignment, but they won't do it.
So they kind of wait for you to hit the enter key there
in order to make sure it happens at a good time
that you're around in case something bad starts happening,
like nodes start reporting as unresponsive because they're rebalancing and
you don't want to end up in a situation where things go off the rails.
Yeah.
Now,
now we're getting into the really fun part though,
where it's turtles all the way down,
which was the request routing part.
So I'm going to throw this out there and then you tell me like,
as we go through this,
if this is kind of what you were thinking of, but basically we're talking about at the large front was just like service discovery.
But I was thinking about this from like a Kubernetes point of view where I'm like, well, you have the, the Kubernetes service that could be, that could be like the, the front for like, it could be the load balancer across
multiple of these things. Right. But then any one of those things might now then be the front for
like it's given nodes or partitions. Cause like, you know, if the, if the service is pointing to
like say five brokers, but each one of those brokers is, you know, Oh, well here's the nodes,
you know, like it's doing its own little service discovery within it.
So that's where it was like turtles all the way down
because there was like, even at a physical network layer
and then once you get to a particular node
and then, you know, how you would get to a visual file system,
this whole request routing concept, you know.
So with all that said, now, like, just keep that in the back of your mind and then we
get into this section yeah so we talked about having a database for your database and the
database for your indexes now we're talking about having a database of uh that stuff is
yeah and then maybe having uh something like a zookeeper in the background to keep track of uh
you know kind of that communication there see and we all thought we all thought we were being
clever when we made a meme out of exhibit.
And it turned out he was really onto something like he knew,
he knew all of this well before we ever did.
Yeah.
He knows.
He knew.
Yeah.
And I remember,
um,
one of the first things that I,
when we first got started with Kafka that I was like really frustrated with
is like,
I want to get started with Kafka.
You download a Docker compose file.
It's got like eight different systems.
Like this is terrible. This is such a, this is such a bad database. And now I kind of understand more about it. It's like, Oh, you download a Docker Compose file, it's got like eight different systems. This is terrible. This is
such a bad database. And now I
understand more about it. It's like, oh,
actually, it's leveraging these really smart
systems that do these
really important things. And it's kind of cool
to be able to see those moving parts there and to be able to scale
them independently. So it kind of
exposes more of the gears than
a traditional relational database.
So you can see the parts of the engine.
But now that I know more about it, it's really cool to kind of see all that stuff being there.
Yeah.
It's not so cool when you don't want to be the mechanic, though.
Well, yeah.
I just wanted the engine.
Why do I need to know all this stuff?
I'm just trying to hello world here.
Right.
Yep.
Yeah.
So, yeah. right yep yeah uh so yeah and this is kind of an instance of a more general problem here that you'll see uh in just distributed systems called service discovery which is how do i know when new
things are coming in and out like if if services can kind of scale themselves up and down independently
and users are doing deploys and systems are just crashing. How do I know as a client what the heck is going on and who I should even be talking to?
And so specifically referring to partitioning, there's a couple different ways to solve this.
First is that nodes keep track of each other, which is almost like an anarchist kind of like hippie view of the world here.
You just talk to any node and that node knows where else you need to go
to.
So you can kind of imagine each node is responsible for doing routing.
So you just,
you just,
as long as you know,
one,
that one can eventually get you to others because it just kind of
propagates in a nice little mesh.
And that's how Cassandra kind of works in like a leaderless situation where
it's just very,
very resilient and highly available.
So that works out really well. And eventually the clients can kind of figure out, you know,
more information about those nodes and keep track of them. And eventually if it can't get to one,
well, it can try another one in the list and it'll just make it happen somehow.
Of course, I'm going to go back to my canonical reference for this because it made me wonder if that's what Kafka is moving to.
Because Kafka, I'm skipping ahead here a little bit, but Kafka has traditionally relied on another Apache project called Zookeeper to handle that type of requesting to know what node you needed to go to for a given partition.
But I haven't looked at it recently to see if it was implemented yet,
but I know that Kafka had an open issue that they were planning to iterate towards
where they were going to remove their dependency on a zookeeper. And so I don't know if the underlying, if they were planning to go to this Cassandra
kind of modeled where like every broker would know, like, so you could just connect to any
one of them and it'd be like, oh no, you need to go talk to my friend over here.
Go talk to broker number one.
Hey, I haven't looked really closely at how they're doing that.
I know it's ultimately stored in Kafka.
So it's like kind of Kafka on Kafka there.
But yeah, I don't know how that're doing that i know it's ultimately stored in kafka so it's like kind of kafka on kafka there but yeah i don't know how that uh how that works traditionally they had a bootstrap service that you could talk to that was aware of all the other nodes um but yeah i don't
know what's i don't know what's happening there yeah i know uh originally when i first started
working with elastic and just kind of experimenting like before kind of pre-docker world if you wanted
to spin up like two nodes locally you'd have to tell them about each other.
So you'd have like a host file entry,
for example,
on your computer.
And it was like elastic one,
elastic two,
and then the configurations for elastic one,
you'd say,
Hey,
there's also another node named elastic two.
And for elastic two,
you have to tell about elastic one.
And if you add a third,
guess what?
You have to go in and update that configuration and restart the elastic
search service.
So now it knows about three and that's how you would bring things up online and i don't know
if it still works that way you know like for i know the like kubernetes operators or whatever
kind of just doing that stuff in the background or if a service comes up and it kind of goes out
and looks like you know hey y'all uh i'm scouring your network for you know i don't know how that
works but i was just doing an in-map and like who
who responds to what port yeah which sounds crazy it's probably not doing that it seems
almost dangerous yeah you'd hope it doesn't do it that way yeah that sounds like a huge security
flaw yeah yeah i worked on a feature for a backup software at one point we called resource discovery
it was basically the same thing it would go out and look for systems that needed like that knew how to back up now thinking back though it's like crazy to
think like we just all look through everything i can see on your network for um i don't know
active directory or sql servers or exchange servers i'm just gonna look for those and then
try to set up backups on them it's like oh you should probably not have your network like open up to uh you know
any software that can just kind of go in and start messing with this data but yeah that was fun oh
yeah so the second approach we talked about you know just knowing each other the second one is
the centralized routing service that the clients know about and that's like the bootstrap service
i mentioned so you go you as long as you know how to get to the bootstrap service and the bootstrap
service knows how to get everywhere else. And sometimes this is called out explicitly,
like we mentioned Kafka, like you would typically set up literally a service called
bootstrap service, or sometimes it's kind of hidden behind something else.
And so I think Mongo is one of the ones that kind of has this
other system kind of set aside that does it, that they kind of mask away. But when you
put together your connection string for getting to the database it that's the address that you use i was thinking
of the master node in elastic in that case yep absolutely yep so if you have one master or
multiple master either ways like that's what the client talks to the client should not be directly
talking to any data nodes yeah and you know unless the master said these are the ones you need to
know about well even then though like the master doesn't route you to the client the master
abstracts that right i'm not sure i don't know i think it kind of hides it from you so i don't
i don't know okay yeah that's a good question uh i don't know how i get i'm so bad with networking
stuff too like even like load balancers like i feel like i don't know how i get i'm so bad with networking stuff too like
even look load balancers like i feel like i've looked up how load balancers work like a dozen
times but i still think about it like so wait when i talk to the load balancer does it like
funnel the traffic through it and back out or does it tell me about the ip but then how does
that work if my brain explodes well that's why i was like as i said it there i was like well i can't
do that it has to like throw you over to the node, right?
Because otherwise that would mean that that one master becomes the bottleneck for all IO in and out of that storage, out of that index, right?
Which can't be.
That would be, that sounds awful.
But maybe I'm wrong and it totally does work that way.
I don't know.
Let us know.
You might win book.
Yeah.
Yeah.
Enter,
enter a comment in on,
uh,
you can go to,
uh,
coding blocks.net slash episode one 72 and you can leave us a comment and,
you'll be entered in for a chance to win a copy of the book.
Yeah.
I'm taking another, uh, note here for load balancers networking.
We should do an episode on that so I know what the heck I'm talking about.
Yeah.
Networking in general.
Yeah, exactly.
Talk about turtles all the way down, man.
Oh, gosh.
Yeah, no kidding.
Let me change.
Get into all the different layers of networking.
Do you know the different layers? The seven OSI osi well i didn't mean the numbers oh yeah there's like application layer
you're talking about right like uh application the hardware layer yeah i might be able to figure it
out but network translation i doubt that i'll be able to get all seven uh but you you started off
pretty good though that you knew that there were seven so you know kudos well i haven't yeah i i don't know how any of it works aside but it's
like one of those things that like i just happened to hear like once and for some reason my brain
just like remembered that but i've never actually used it and like i never i don't know how to put
it into practice or make use of that information yeah definitely it's one of those things that
every time i hear it i'm like okay yeah that sounds cool that makes sense and i want to try to remember that but i never use it and so
therefore because i don't put it into practice i always forget it yeah exactly like what were the
seven layers again and when you tell me is layer five really never comes into play or you know
whatever exactly so it's like uh that and it's like stored in my brain right next
to like the pokemon theme song we're like if i start singing it i can do it or the alphabet i
can sing the song but you know you want me to use the letters i don't know i don't
okay so i'm not the only one that like when you were like trying to think of a particular part
of the alphabet you like instinctively like the song replays in your head.
Elemental people.
Yeah, exactly.
Okay.
Or if you ever tried to do – someone asked you a password or you're telling someone a password that you type often and you don't have a keyboard.
So you're like, it's the index figure of H.
And the next one is...
You have to almost
type it out to remember what the password is.
So we've talked about a couple ways to solve it.
So nodes keep track of each other.
There is one other, which is the client
be aware of every single node
and partitioning data, which sounds crazy.
I don't know any system that actually
does that, but you can see, presumably, the clients all just keep track of it and so a new
system comes online someone has to go out and say hey client here's another uh system that you need
to talk to uh you know just keep that in your back pocket for when you need to and whenever that
gets removed or added someone goes out to all the clients and lets them know like, you know, that's been removed.
I wonder if that happens more in like,
um,
I'm thinking like a Splunk or something where it's not so much like a necessarily a database,
but like,
I don't know.
Someone has to go in and actually configure these nodes to kind of talk
to,
uh,
you know,
the almost like agents that talk to different services.
I don't know.
I was trying to think through of like an example of that
and just coming up blank though
of an example where the client knows.
Well, okay.
Okay.
I guess, I guess
I kind of, I kind of,
I do remember now
as I was reading this book
that like the one thing
that did kind of come to mind of this is that, gosh, I got to stop talking about Kafka.
Because the client does kind of know based on the key, like, oh, this is where that's going to go.
Yeah.
The producer has to know and the consumer has to know.
Yeah, there is
some awareness there but i don't i don't but but the way you said it though made me think of like
a totally different world like you bring in a new a new one and you're like gotta reach back out to
the clients and say like oh hey clients we've added a new one yeah definitely something you
would not want to do if you know often so if changes happen a lot but i thought maybe dns
would be an example like you're bringing a new dns server online or you're changing your dns
servers and you would probably go into each uh you know node or agent or computer add the new systems
and then eventually after its transition you'd remove the old ones but that like happens never
and hopefully you're not, I don't know.
I guess you do use IP addresses for DNS servers,
right?
So,
because otherwise you'd be in trouble.
You had to like look up your DNS servers via DNS,
but yeah,
off the rails.
So yeah,
that's the only thing I can think of.
And the book doesn't give any examples of it.
But no matter which way you go, off the rails. So yeah, that's the only thing I can think of. And the book doesn't give any examples of it. Um,
but no matter which way you go,
uh,
partitioning,
no changes need to be applied somehow.
And it's really difficult to get right.
And even though it sounds like it's,
you know,
just another database.
And this is where a zookeeper comes in,
which is a really popular,
uh,
system for keeping track of configurations.
It's really resilient.
Um,
it's used with a ton of other Apache systems
and just used all over the place
for doing things like this or for coordinating.
But you don't normally talk to Zookeeper directly.
Like it's not your routing service,
but it's the thing that's in charge
of updating the routing service
or that the routing service will pull for changes.
And so it's just a really resilient key value store that you'll see used in a
lot of projects.
And like we mentioned,
Kafka uses a Druid,
a solar,
MongoDB has something similar.
So it's just a really popular tool,
but it's really basic,
but apparently it's really hard.
And it goes,
you know,
one thing that's really important about it is if it's storing changes about
what other nodes and services are available,
it also has to keep track of its own. Cause it not like it can defer to some other system it's the authority so if one zookeeper node goes down it needs to still be
able to function and you know we've talked about different problems with kind of network
partitionings or bad things that can happen like zookeeper really needs to be able to weather all
of those storms.
It's like the single point of failure for a lot of these systems.
So then you end up with a Zookeeper cluster to keep up with.
So it really does become like a whole other cluster to maintain an index
to tell you how to get to your main index
and what part of that cluster you need to get to your main index and what part of
that cluster you need to go to to read it.
Yeah.
Yes.
Yeah.
It's funny.
The turtles all the way down.
It's like,
I don't really trust my cluster.
So I'm going to come up with a,
another cluster that'll keep track of the nodes that are in that cluster.
So in case the first one goes down again,
exhibit comes to mind.
He's like,
yo dog,
I heard you like clusters.
Yep. Yeah, absolutely. So yeah, the comes to mind. Like, yo, dog, I heard you like clusters. Yep.
Yeah, absolutely.
So, yeah, the idea is that they keep the functionality, like the feature set, really limited.
Just really, really, really focus on being as close to perfect as possible.
And so that zookeeper keeps the zoo in line.
And so Cassandra and Rick react,
use a gossip protocol.
We mentioned about those keeping track of each other.
So they,
they kind of do these things where they're constantly chatting about
chatting to each other saying like,
Hey,
do you know I'm a node?
You're a node.
Do you know any other nodes?
I don't know about like they're kind of playing go fish or something.
Had you heard of that information?
Had you heard of this gossip protocol?
A little bit. When we talked, like we, you know, data stacks, uh, something had you heard change that information had you heard of this gossip protocol a little
bit when we talked um like we you know data stacks uh wanted to show a while back uh and we got a
kind of a tour of how they work and cassandra and so we spent a little bit of time there looking at
it so i read a little bit about then i thought i was it's it sounds cool and that's where you know
i said i mentioned that like feeling almost like hippie-ish with like leaderists and um it just
means that i don't know the whole spirit of Cassandra is just really cool and different to me.
And so I'd read a little bit about it then.
Yeah, I just, it's not like a term you come across a lot.
So I really liked it because I was like, well, it truly is what it is.
Because it's like, you know, yeah, I heard that there's this other node over here called Node 3.
But, you know know by the time
you go to get to node three it could already be down so that's why it's not like authoritative
it's just it might still be there i don't know yeah it's pretty funny to think about these notes
kind of you know chatting over playing cards or whatever like it's like oh hey alice haven't seen
you in a while how's it going he's like all right you know i just met a new friend named fred the
other day but oh nathan uh nathan's not around anymore and he's like okay well thanks for that update uh
i'll see you later wow you took that to a dark place yeah that's how it goes like haven't heard
from nathan in a while it's like oh yeah me either wow your examples man yeah we'll take
him off the list i guess do you have enough sunshine
in your life like you're in the sunshine state right oh wait no that's colorado
is it colorado here well they get the most sunshine right what's the florida tagline
it's the sunshine state oh is it yeah but but uh yeah it rains a ton here so i've always
had issue with that it's actually there's a home video of me being like 11 thinking I'm funny talking about how much rain Florida gets.
What's the Jerry Seinfeld kind of like, just some joke about like like 330 days a year of sunshine or something like that versus Seattle that gets 30.
Yeah, so I looked up the nickname for each state.
Apparently, there's some official nickname.
Colorado, they call the centennial state.
And Florida, they call the sunshine state, and Florida they call the sunshine state even though
Colorado gets more sunshine it's not fair yeah but they get the cold that you don't have to deal
with so there you go yeah Florida officially adopted the nickname in 1970 but it like
apparently like one of the suggested search terms for Google is what's the real sunshine state.
That's funny.
So yeah,
the gossip protocol.
Um,
so,
and then we already mentioned though,
they're like,
like elastic search has different roles,
uh,
that nodes can have.
So you mentioned there was the master,
there's the data,
the ingestion,
and then,
uh, a routing node. Yep. master, there's the data, the ingestion, and then a routing node.
Yep.
And from there, there's just one other thing I wanted to mention on real quickly, or hit on, which is parallel query execution.
So, so far, when we've been talking about things, you know, primary indexes, looking things up by key, or secondary indexes, looking things up by these secondary data stores.
Those, you know, kind of fit fit well like the NoSQL paradigm,
but that's far from the only game in town.
And so the book brings up massively parallel processing relational databases, MPP.
And these are kind of relational databases that have really complex joining situations,
filtering and aggregations and excluding this or excluding that.
And really joins is the big thing to me there.
So how does that work?
And that's where we got our old friend Query Optimizer, who's responsible for taking these big into stages that can be run independently.
And then ultimately kind of going through and joining those stages together and filtering to get the final results.
But when you think about it that way, you're like, well, hey, all this is really doing is it's taking that complex expression and generating a bunch of individual stages, just like the ones we've talked about that use primary indexes and secondary indexes or you know ultimately if those aren't available those full table scans going through every record in
the partition that it's got set up well hey we just described everything we talked about so you
can just kind of think about these massively parallel processing relational databases as
everything we've talked about today but just up one level where it's able to break them down
into these components and then ultimately put them together.
And the really cool thing there is that in addition to breaking them down,
these are all independent stages, so we can run them at the same time.
So if we take a query where you're joining a few things
and filtering and sorting and only grabbing these three columns or whatever,
it might break that down into 12 stages,
and maybe the first four can be run in parallel.
Then we do something with those results to bring them back together,
and then we can run the other eight stages in parallel from there
and then put those two pieces of information together.
So there's a lot of work going on underneath the covers
of those CryptoQuery optimizers in order to do that.
But ultimately, it's just doing the same kind of querying and filtering
that we've spent this episode talking about,
which is just really cool to think about it.
Yeah.
This part at this portion of the book,
they only give a brief overview of parallel query execution.
And they basically say it's a really complex topic,
and they do get into it in a later chapter in more detail.
So, you know.
Yep.
Makes me wonder if there's a book just on, like, query optimizers.
We talked about, like, NoSQL,
and, like, one of the advantages, we said, of relational databases
is, like, you don't have to know as much about your data,
its use cases, because the query optimizer is responsible
for really figuring out how to get your data back in the most efficient way based on statistics and
indexes that it has and so it frees you up to just focus on the the shape of the data modeling in the
real world where no sql is kind of the opposite where you really have to know your use cases
because you're the one kind of designing these smaller pieces so just kind of cool to think
about the query optimizer as being this like a big brain with a whole bunch of, I don't know,
spider legs or something that's able to do all this work
and put it all together for you.
So you don't have to think about it as a human or a query developer.
Well, I mean, you say that, but like,
I know that we've definitely been in situations,
specifically in like SQL related worlds, where you're looking at the execution plans.
Just like, hey, why is my query not working as well as I want it to?
And you look at those execution plans and like some of them, you can create some really hairy queries, you know, where like the execution plans are like, wait, what?
I can't even see this whole execution plan on one screen
and I've got like this giant widescreen monitor
and I still can't see it.
Yeah, there's a couple of tricks from where sometimes
SQL Server had a problem with it.
Was it either with and statements or or statements
where it was more efficient sometimes to do a union of two queries
than it was to do like an or statement,
which seems totally crazy, but we saw it in practice a lot. Just kind of a defect. Hopefully
they fixed that. So it always helps to know that you can't trust it
fully, but in practice, those query optimizers do a great job about
getting that stuff out of your face.
Alright, well, yes, that's it for the chapter
six. We are still not halfway through the book, but we hope you're enjoying it. And I know that we certainly are. It's definitely one of our favorite books that collectively between the three of us, I think we all agree that this is by far our favorite one so far. that you will agree once you give this book a read. So if you'd like to have a chance to own
this book, you can head to www.codingblocks.net slash episode 172. Leave us a comment for a chance
to win a copy of the book. And it'll definitely be in the show notes as one of the resources we
like for this episode. And with that, we head into my favorite portion of the show.
It's...
Oh, no, wait.
This is Alan's favorite portion of the show.
I'm sorry.
Yeah.
Well, you said that wrong.
Oh, you're right, Alan.
Thank you.
Yeah.
So this is Alan's favorite portion of the show.
It's the tip of the week.
All right.
And so I've got some terminal-related goodies here.
So first, I think we've given Oh My ZShell as the tip of the week before, I believe, right?
Yeah, pretty sure.
Yeah.
So just a quick recap.
It's basically a really nice plug-in for ZShell, which is the default in OSX and just probably the most popular shell at this point. And OhMyZShell is a really nice way of organizing
and has plugins and a really nice ecosystem around
and a really nice user experience.
I was thinking about it as almost like a brew for your shell.
So you can add like autocomplete for kubectl or Docker
or maybe show your Git branch that you're in
and how many files you have difference.
And just really nice things like that,
just quality of life improvements for your actual terminal.
And so that's a popular tool for that.
Well, Bobby, a friend of Bobby, a friend of the show,
had a really cool theme that he told me about
called Power Level 10K.
And we'll have a link here here and this thing is the best and
calling a theme is misleading because it makes you think like colors right but for me uh it brings in
some really nice fonts which gives you some really cool icons which look really cool and it also uh
is really good at uh supporting plugins so like um
for example you know i mentioned get being able to see your get uh history and changes
one of my favorite things is uh if i start typing k for kubectl which i have an alias for for
kubernetes it will automatically pop up the context that i'm in and will show it so it knows i'm trying to do
something in kubernetes uh cuddle does the same thing and will say hey this is the context you're
in so if you would run this command this is where you're running it which is like a kind of a common
pain point and it's got really nice visual customizations too so my favorite thing about
this is the reason why i had to switch is all that stuff i, like showing the git diffs and the branch you're
on and the path you're in and the user you're logged into and the context. You can put all that
stuff on a line above where you're typing. Man, I hate when your terminal takes up like half the
screen is your username and the directory you're in. And if you put the git branch in there,
it's like, you know, it's considerable size.
It's a considerable portion of your terminal size, potentially.
So any commands that you end up typing end up going off screen or whatever.
It's just annoying if you've got that terminal running in a smaller window.
And another problem I've had with shell commands, sometimes I'll run something.
And if it's a lot of output, you scroll up and it can be kind of hard to see where you ran the command
if you run it more than once, you know, to basically see
where the output of one command starts and
the next begins.
And so by having these like
nice looking
this information break between commands,
it makes it really easy to scroll, scroll, scroll.
That's why I ran this command and the beginning
of my output is for this run,
which has been really nice.
And it's got like a wizard that walks you through everything.
And it's really easy to reconfigure.
So you basically start this thing running.
It runs a command in the background called like init or something, power level init.
And it'll say, hey, do you see this icon?
Yes, no, no.
Okay, would you like to install this font?
Y for yes.
Okay.
Next, do you like this style? This style? Do you want this stuff on the next line? Do you like to install this font? Why for yes. Okay. Uh, next, do you like this style?
This style? Do you want this stuff on the next line? Do you like, you know, do you like this
kind of connector? Do you want to show the time? Uh, things like that. And it just kind of walks
through and asks it. And so it was just a breeze to set up and it was so nice for Kubernetes to
be able to see your context easily. Very cool. Yeah. We've mentioned, um, oh my z shell uh what back in episode 138 and as a tip of the
week and then in episode 149 when we talked about our favorite tools uh somebody i don't want to say
it was alan i think that mentioned it there um so yeah yeah. Yeah, I know it's really popular.
And I should mention, too, so I mentioned ZShell.
That's not the only way to get this theme.
I didn't realize this, but apparently you can just get it through Homebrew
or Arch Linux, or you can do a manual setup here.
So it'll just set that stuff up for you in just kind of normal,
you know, non-ZShell environments.
Yeah, it is crazy, like, like how z shell has taken over because
you know forever it was like uh it seemed like the battle was between forever that i recall like
bash and corn shell were like the two you know most popular and now in like recent years especially
after apple changed the default from bash to z shell um like that was the end of
the war yeah oh and i do have one other one uh so in uh in windows i think i believe this happens
by default when you install vs code it adds to your path so that you can be in a directory say
in a terminal and you just type code dot and it'll open up vs code in that terminal you can also
that's really nice yeah you can also write so yeah it sets up for you and same with like if you go
in your downloads folder say and you just want to open up a text file or something you downloaded
from the internet and you just want to view it in vs code instead of going through the file open
and browsing to that location if you're already there you can just do code in the file name and
it's going to open it for you you couldn't do that
in os in mac os because it doesn't automatically add to your path so i googled how to add vs code
to my path because i wanted to be able to just say like you know do a lot more stuff on terminal
on mac i want to say like code file name and just open up the json file or whatever i want to see
and you know have all the nice stuff i use code for. Well, I Googled it, and actually,
the recommended way to add VS Code to your path
is to do it through VS Code.
So if you do the Command-P and just type,
start typing, add to path,
it'll go ahead and prompt you and say,
would you like to, you know, it's got one of the actions you can do
is add VS Code to your path, and you just hit Enter,
and it just does it.
And it was so nice it was like wow i've been annoyed by this thing for so long and it was such an easy fix is it it's it's um it's shift there right like command shift p to bring up the oh
yeah you're right i always forget yeah you're right yeah command shift p and then for windows i don't see it
it's control shift p for windows code is so nice yeah and i believe it's command yeah
command shift command p yeah i just thought it was cool but i think though if i remember right
doesn't it prompt you like after a fresh install of it like as soon as you open it up for the first time it's
like hey do you want me to also add this to your path and you're like i mean unless you want to
feel pain because you don't like yourself then you would press no but otherwise you know maybe
i just missed it i didn't uh oh that sounds more likely because i hate to think that you needed
like you know some a mental day health you know whatever pretty much any time a system asked me
if i also want to do something because it makes my life better i'm like no like you want to check
me no maybe we need to get you some help then we can location services no oh well i don't know
like mobile apps i'm definitely like more uh know about you know what about um would you like to
enable notifications for this website no no block block your ability to know my location and i don't
want your notifications and uh yes i will be better for it thank you now one more for you what about
when you're using app it's like hey do you like using this app oh yeah i hate this i hate this yeah because if i say yes you're going to try and take me to the freaking app, it's like, hey, do you like using this app? Oh, yeah. I hate this.
I hate this.
Yeah, because if I say yes, you're going to try and take me to the freaking app store.
I'm like, nope.
I hate your app.
I hate it.
No, I always view it the other way around.
Really?
Oh, you say no, they ask for feedback?
Yeah, they want to know what you didn't like about it so they can correct it. If you do like it, they want you to leave a review so that they can get the five star.
But if you don't like it, then they want to know why.
How do they know that you left a review?
Do they know?
I feel like if I say yes and finally go review and you should review things because it's
really awesome and helps them out a lot.
If you leave review, crap.
How do they know?
And do they keep prompting you to leave review even though you've left review
i don't know i guess maybe that's if you say yes and you're like would you would you like to go to
the app store to review us you just say yes and then you do it because it's really helpful and
then they don't bother you anymore or maybe you say yes and you don't bother to like actually
leave the review just go back to the home screen, and then they've already written a bit to their log to say, no, he left the review.
Yeah.
But really, he didn't because you were a jerk.
But I would do that.
But maybe we're biased.
I don't know.
All right.
So for the tip of the week, I got a couple of fun ones here for you.
So one, this was mentioned a
long time ago. Uh, I don't even remember who mentioned it. So I'm sorry, I can't give the
credit where credit's due, but this was mentioned in our Slack a long time ago. So, you know,
as a parent, uh, you, you want to, to teach your kids things, right? And so this is a way to teach your kids all about the joys of
Kafka as a bedtime story. And it's called Gently Down the Stream. And it's a website that if you
go look at it, it's a very well-made website. And you can just flip through it and teach your kids
all about Kafka and how it works and the joys of Kafka and
whatnot. It really is written as a children's book with little otters and they're swimming
and the trees in the background and everything. It's quite impressive that this person put this together for his girls.
Yeah.
So there's your first fun one.
And then another fun one that was making its way around the interwebs that I also saw on our Slack.
So if you're not already on our Slack, by the way, you should definitely join our Slack.
So I think you can go to www.codingblocks.net
slash Slack, or even at the top of the page, there's a link for join our Slack.
But this one was making its way around the interwebs for the article's name is Lesser
Known Postgres Features. And there's like all sorts of things in here
there's like oh uh you know things that you didn't know that you could do with postgres like
find overlapping ranges or generate a unique id without an extension or um keep a separate
file history per database or uh multi-line quoting in,
in your SQL statements,
things like that.
Like there's a whole slew of,
there's,
there's like probably a dozen and a half or two dozen different things here of
things that you could do.
Like one of them was temporary views.
Uh,
so if you wanted to have like maybe in your,
your,
uh,
procedure or query, you want to do something
and you want the view they are available so that you could reuse it over and over and over
um and maybe a view would work better for your needs rather than a cte but at the same time
if your transaction fails you don't want to leave any like real schema around. So, you know, temporary view. So there's a whole bunch
of them in there. Um, ways to pivot, uh, to, to generate, to produce pivot tables, uh, how to
prevent setting an auto-generated key, um, how to grant permissions on specific columns or find the
value of a sequence without advancing it, uh, all sorts of great features there. So,
um, we'll have links to both of those things in the show notes. And with that, uh, if you haven't
already subscribed to us, you can find, um, us on iTunes, Spotify, Stitcher, wherever you like to
find your podcasts. We, we certainly hope that we're there. If we're not, um,
I'm not sure how you found us,
but Hey,
let us know when we will figure that out.
And,
uh,
like,
like I asked earlier,
um,
if you dear listener, if you would please,
uh,
find it in your heart to head to www.codenblocks.net slash review.
Um,
every time the, the late night, uh, DJ voices, it just cracked me up. dot net slash review.
Every time the late night DJ voices just crack me up.
Anyway, we would greatly appreciate it if you would leave us a review if you haven't already.
All right.
Thank you, Morgan Freeman.
And while you're at it, codingblocks.net.
Check out our show notes, examples, discussion, a whole lot more.
And feedback, questions, rants can be sent to our Slack and you can get there from
codingblocks.net slash Slack. It's easy to go ahead and just
click a link to join.
And you can follow us on Twitter at codingblocks
or head over to codingblocks.net
and find all our clippies
or what did I call them?
Oh, dillies.
Dillies. Yeah, see, I'm so old.
Find all our dillies at the top of the page.