Coding Blocks - Designing Data-Intensive Applications – Secondary Indexes, Rebalancing, Routing

Starting point is 00:00:00 You're listening to Coding Blocks, episode 172. Subscribe to us on iTunes, Spotify, Stitcher, wherever you like to find your podcast apps. I surely hope we're there. Or I should say, not find your podcast apps, but where you like to find your podcasts using whatever app. You know what? We'll just, whatever. Forget it.

Starting point is 00:00:21 Just go on. We can't have nice things, Jay-Z. You start off. I don't think Gen Z calls them apps anymore whoa they call them dillies that's weird anyway this is the codingblocks.net where you find show notes example discussions and uh links to all our dillies i don't know why that's so funny to me but it just is it's so much funnier i'm surprised you didn't know i mean of course i did but uh yeah so uh where did you leave off with the dillies and you talked about the the uh twitter at coding blocks yep or uh head to www.codingblocks.net and find all our social links there at the top of the page. And with that... Oh, I'm Joe Zach.

Starting point is 00:01:08 I'm Alan Underwood. And I am Michael Outlaw. This episode is sponsored by Datadog, the cloud-scale monitoring and analytics platform that unifies metrics, traces, and logs so you can identify and resolve performance issues quickly. And Linode. Simplify your infrastructure and cut your cloud bills in half with Linode's Linux virtual machines. I'm going to get some flack about that.

Starting point is 00:01:44 It's worth it. It always is is that's the thing like if you're going to risk not being able to participate on one of the episodes then you have to know that the other two-thirds of the of the show must fill in for you and that is just a requirement that's right so throughout the show we'll be um speaking as alan and we won't be explicitly calling out. So you're just going to have to know when we're representing his viewpoint. It's going to be very confusing. We're sorry. Sorry, not sorry. We'll make up for it with all the dub W's, dubs, as we say, any URL for the remainder of the show.

Starting point is 00:02:23 That's right. That, that was right there. Reference. Okay. All right. Well, today we're continuing on with our favorite book, Designing Data-Intensive Applications, and we're going to finish up the chapter on partitioning, which is amazing.

Starting point is 00:02:37 This is a chapter in just two episodes. It's a record. Well, yeah, but you want to hear record though like technically if you if you were to like look at the physical version of this book we have covered less than half of it yeah it's crazy even after this episode we will have still covered less than half of it that is how like you could take that as a bad sign like wow we're really slow at like reviewing this book but also you could just take it like that's how full of content of great material this book is that we've gotten less than half the way through yeah it's it's dense it is i don't know cheese like what's something that's dense and full

Starting point is 00:03:17 of goodness like i don't know cheddar i mean sure i'll take that havarti yes for some reason i kept wanting to think of like cakes but then i'm like well not really i don't know a good cheesecake well good cheese i was thinking like for some reason like do you guys yeah you have public's down there so i was thinking of like the public's uh buttercream icing cakes you know for some reason and i'm like well that doesn't really meet your dense criteria. But oh my gosh, now I can't think of anything else. Yeah, it's good. And now the listeners are like, wait a minute, I got to pull over.

Starting point is 00:03:53 I knew I needed to go to the grocery store for some reason. This is what happens when Alan's not around. This whole show will be about guitars and mountain bikes and somehow will fit in partitioning. All right. And Metallica.

Starting point is 00:04:07 Yeah. And Metallica. All right. Well, last episode, we talked about data partitioning. This is how you can split up your data set when you've got too much data to fit on a node or you have performance requirements where it just makes sense to split up data so you can do more work in parallel or find it faster. And we talked about two different partitioning strategies, basically using key ranges like 0 through 100 go over here and 100 through 200 go over there, which is nice when you have homogenous data with well-balanced.

Starting point is 00:04:38 And we talked about hashing, which is a way of kind of distributing things based on the key that we use a little bit of randomness there to hopefully spread things out more so you can avoid hotspots or places where your data is just uneven and causes unnecessary strain on parts of your system when you've got other parts of your system that are just bored to death. And so this week, this episode, we're going to be talking about the rest of the partitioning chapter, which primarily focuses on secondary indexes, rebalancing partitions and data, and also routing. But first, we've got a little bit of news. Yeah, so I guess we're going to be doing the game to jamuary again so have we set on a date i think i saw that you had you had created that uh the game jam invite but i didn't notice i don't recall if it had a date specified is it january uh it is currently uh scheduled for jam uh january the month and i did keep the same dates as last year so i didn't have to update

Starting point is 00:05:45 the artwork and it happened to fall on a weekend again so i was like oh so i don't know where those assets i don't know where that you know p and that uh pd whatever psd is um so currently it's scheduled for the 21st to 24th but i haven't thought a lot about those dates so i haven't looked to see if there's anything else major going on but currently that's when it's planned and i think uh i think i don't know no one's complained about it so i guess that's when it is yeah we're still looking for a theme for it though so i did see an email went out with the survey for theme ideas i read some of those theme ideas we got some good stuff coming in um but i don't know that you've picked a final or i guess

Starting point is 00:06:25 i assume we're going to do the same way as what we did last year yeah where we'll take everybody's and then let everybody vote on the the top favorites and you know go with that yeah so we got that from a super good dave who uh saw that with other game james doing that basically what we do is we gather up uh ideas for themes so if you got one email us or uh tweet at us and we'll add it to the list and we'll do a couple rounds of voting. Or hit me up on Slack, at Alan. Yep, there you go.

Starting point is 00:06:53 And let them know we'll get it on the list and then eventually we'll do a couple different rounds of voting. And it's not going to be annoying. We'll probably just... I forget how we did it last time, but whatever we did last time is how we'll wean it down. And the final theme will be announced basically right at the start of the jam and that's just to get your kind of creative juices flowing there's something about having a theme that kind of either makes it inspiring or it will get through that blank page problem yeah

Starting point is 00:07:18 it puts some it puts some boundaries on it so like you're not left to think about like the entire world of possibilities you're like scoped down to a set of things. Yeah, I really like that. Last year's theme was everything is broken, and we got a lot of broken games, which was really funny. And it worked out really well, especially if you're making a lot of bugs like I do. How about there's a ticket for that? Hey, that's a great theme. Let me add that to the list here.

Starting point is 00:07:46 Well, I was thinking mine was about tickets yeah it was and it was like that was a great game uh i'll done an angular so yeah so looking forward to that and i did just finish up the create with code course right um that was a something on unity one of their free courses that i mentioned where i worked through five game prototypes so i'm going for unity this year i'm really excited about it i don't know what i'm going to do this year because like last year as you mentioned i did the angular one and i thought about like maybe doing another angular kind of approach because from the simplicity of like not having to worry about like things to install and it being easy for other people to use but it definitely didn't have the polish of uh the other games for sure some of them are so good man

Starting point is 00:08:33 like just like in terms of polish like sounds and art and the gameplay like there were some there's some serious game developers that were uh part of that game jam for sure absolutely i can tell them the um i i never finished uh this one i need to go back because i figured out the or i got some help in the comments about the one where you're wandering through the woods so like these uh polygonal trees yeah i remember that one yeah it was really cool so yeah so uh yeah we're looking forward to doing that again we're going to do a couple rounds of voting go ahead and sign up we'll have a link to the game jam in the show notes here and we'll be talking about it a few more times so uh keep an eye out

Starting point is 00:09:13 for that and it's just fun to keep up with the um the uh what's called the theme voting and stuff and so if you've got one that you like a lot before it let's know. And one more thing. Yeah. Uh, so I got a new Mac book pro holla showed up almost a month early from when they originally projected it. Look at the big dog over here. Yeah.

Starting point is 00:09:38 Super excited. So it does have a notch, which I, uh, instantly learned to just not see. It doesn't bother me at all. Oh, really? Like you just immediately.

Starting point is 00:09:47 Does your phone currently have that? No. Oh, okay. No, it does have a little dot for the camera, though, which also I didn't even know until I looked at it. So I just have learned to ignore it. Oh, really? It's got a little hole, a little circle.

Starting point is 00:10:02 And so like you have screen above it and below it and beside it neat yep and i'm loving it i haven't done a whole lot i am able to run unity which has been nice because i haven't i've been locked out of that uh you know it was just too hard to even open a blank project on my 2013 laptop and uh yeah i'm looking forward to being able to do some more couch type stuff there and i've already been doing some experiments and tutorials just kind of while watching tv in the evenings whatever so it's really exciting and i've just been loving it so i haven't had any problems with it um there's a few things that i've used uh installed the beta for because uh they didn't have a like apple silicon

Starting point is 00:10:36 full-on version for but the intel versions also work you know i guess they do some sort of emulation there that's so i haven't run into problems. Yep. Yep. Did you run it through like a geek bench? No. No? I shouldn't. You're not – I'm always curious. Any new piece of hardware, I'm always like, I just want to know, like, how did it do?

Starting point is 00:10:58 Yeah, I should compare it to the geek bench I did like eight years ago. Maybe pour one out for that old machine yeah yeah yeah put ubuntu on it yeah i still yeah i still use my old hardware though so i don't know that i want to know what your geekbench score is yeah or maybe i don't know man it's a great year to upgrade to mbp look it's looking like yeah all right so let's talk about partitioning i i do want to uh address one thing though so you mentioned at the start about uh it being like because the data is too uh big to fit on one machine so like scalability um you also mentioned performance was another reason and i don't know that availability was really talked about oh yeah um you know well but but in you or at least you didn't mention it but we definitely did talk about like replication uh in in the past but i i was uh as i was prepping

Starting point is 00:12:01 for the show i was you know doing some searching out there on the interwebs, as you do. And I came across this one site called Interview Grid, and they were talking about the key benefits of partitioning. And they included one, a fourth reason, that I kind of take issue with, but we hadn't ever discussed it. Do you think you might know what it is? Data retention? Security. Oh. it do you think you might know what it is data retention security oh okay i kind of the reason why i take issue with that that was kind of like well i mean yeah i guess you could say it's kind of like a benefit but it also is like i kind of think of that as like a reason why you would do it. So like later they, they mentioned the different strategies for partitioning.

Starting point is 00:12:50 And so they talk about like horizontal, horizontal versus vertical partitioning. But then the other one that they mentioned was functional partitioning. And I was like, well, yeah, I mean, you can almost view like security as like a functional type of partitioning.

Starting point is 00:13:03 Like, Hey, all my sensitive data is going to be off over here. But then I'm like, is it really, is that really a partition? Or would you just think of that as like a different table or even a different technology?

Starting point is 00:13:15 Right. Does that count as partitioning? Like, I don't know. I kind of take issue with that. Yeah, that's weird. I definitely hadn't considered that at all.

Starting point is 00:13:23 And, um, I, you know, I guess if you had it in at all and um you know i guess if you had it in different physical data you know like co-location centers and somebody steals the computer then having your data partitioned and split up is good or if you have things like key differently but it still seems like in all the cases we talked about your clients would be able

Starting point is 00:13:40 to access they weren't limited to any partitions. But maybe, I guess, if you partitioned by, like if you had a multi-tenant solution and you partitioned your data by tenant, then you could encrypt those partitions differently. In that case, I see the security argument is just different. Okay, I'll buy that then. So security, if you were to partition, because we did talk about in the last episode,

Starting point is 00:14:03 we used Fortune 500 companies as an example and like how you could like implement i didn't i don't know that i caught it out explicitly but kind of implicitly like some of the discussion uh i assumed that like there might be some row level security there and so i think i'd said something like you know if you had it partitioned by tenant you did a select star star, you might only see things for your specific – for that login as that customer, you would only see the data for yourself or your company. Yeah, that's cool. So, okay. I guess I could buy that then.

Starting point is 00:14:40 All right. So they won me over. All right. So security is a fourth benefit of partitioning yep i'll allow it uh all right yeah so last episode we talked about obviously key range partitioning and key hashing partition hashing to deterministically figure out where data should land based on the key and the key is the really important part here because if you said, hey, I've got user

Starting point is 00:15:07 ID 123 and I'm partitioned by users, then we would know where to go look. But that only helps you in the case that you're trying to look up a specific key. It doesn't help you if we say, I want to find users named outlaw or users who are

Starting point is 00:15:23 in goods, who owe us money or something like that. There's no help at all for that. And so in those cases, you have to look at every single row and every single partition in order to find that because there's no other help that you get. All you know that the data is partitioned. And the solution for that is what we call secondary indexes with the first you know the primary index is just being how things are physically located uh on the partition yeah so i think like in to to carry on with the example that we gave last time i think i had mentioned an example where like

Starting point is 00:15:58 um you want to look for you're looking for a specific car right and in the example that i mentioned was like oh well you're looking for a specific car, right? And in the example that I mentioned was like, oh, where you're looking for Lamborghinis and depending on like how your index was done, you wouldn't know how to necessarily, it might be more difficult to find that. And so we had talked about like the idea of like an encyclopedia where, you know, you could go straight to the L portion of the index for Lamborghini. But if you were looking for a specific one, you know, then you could have it by license plate. And if your index was only by

Starting point is 00:16:32 license plate, then it was like you would have to look at every license plate in order to figure it out. But in this case, with a secondary index, your primary key could still be the license plate or, you know, VIN depending on your use case, but let's go with license plate. And then, uh, based on the secondary index, you might say like,

Starting point is 00:16:52 Hey, give me all the license plates for the Lamborghinis. Like you could go and find all the Lamborghinis instances based on that secondary index. Yeah. I like that. Um, I kind of come up with the contrived example here about

Starting point is 00:17:05 partitioning credit card transactions um and i use that throughout the rest of the show notes i'm wondering if it's worth switching i like the car thing well the book also used cars um they did they went on colors i was just trying to relate it back to what we were talking about in the past but we can definitely go with the credit card thing it's fine i i just trying to relate it back to what we were talking about in the past but we can definitely go with the credit card thing it's fine i was just trying to like tie the two the two episodes together yeah that was uh smart i should finish listening to that episode i wasn't there last week uh but anyway i mean i like the credit card too because we also talked about e-commerce stuff too so it's fine all right well let's stick with So some typing. So in this system that I kind of contrived up here, imagine that you have a system where you're participating in credit card transactions by hash of the date. And by date, I mean the day, like November 11th or whatever it is.

Starting point is 00:17:57 All the transactions for the 11th go here. All the transactions for the 12th go there. All the transactions for the 12th go there. All the transactions for the 13th go somewhere else. And when I tell you that I need to sum up all the transactions for last week, well, that's really easy. We go look at seven partitions and we just go through and sum all that data, which is really nice. It also means you're going to have a hot skewed partition. Absolutely. Yeah. Yeah.

Starting point is 00:18:22 So that's ripe for hotspotting issues. That's a prime example of where we say like, we mostly care about most, you know, recent data. And so we're constantly pinging the most, the most recent partitions and the other partitions are dormant. So there's definitely problems with our partition partitioning scheme here.

Starting point is 00:18:39 Yeah. And that's like a great example of it. Cause, cause if you were to think of like an Amazon or something like large like that, and if they were to partition on date and so like all of their customers as they're buying new stuff are always writing to the same partition. Yep.

Starting point is 00:18:53 Yeah. There's some benefits to like, we've talked about being able to do like hot and cold indexes. So you can basically say like, well, the last seven days, those partitions are always going to be located on the best hardware. And the other stuff we're going to put that on like a cool or cold storage

Starting point is 00:19:07 that's like maybe spinning discs or something cheaper and i'll only use it for like reporting or whatever i think they actually get into something similar to that later on in this portion of the book where they were talking about you could have um like physically different hardware even for your partitions and like some technologies like mongo and elastic search come to mind i believe and i think there might have been another one that was called out where it'll it'll specifically allow you to say like hey this bit of hardware can handle more load so it'll handle more partitions than this other set of these other nodes and so you could have it to where like you know maybe those are the beefier machines that have like uh ridiculous ssd raids versus you know the other one might be tape archive or something yeah absolutely

Starting point is 00:20:00 and it's great for data retention so elastic Elasticsearch, they call it, it's part of their index lifecycle management ILM policies. So you can say like last 30 days goes on this pool of servers, next 60 days goes on this pool. And then after that, it goes to, you know, like the kind of the worst hardware. And then after a year, it gets deleted. So the data in that case is partitioned by date and it just kind of marches through so it starts out on the hot servers eventually moves to cold eventually moves to really cold and eventually disappears which is really nice for you know certain like government regulations require that sort of thing but yeah absolutely that's the case where you're like absolutely just embracing the hot spotting uh you know for for good or bad uh but

Starting point is 00:20:46 in that situation what if i ask you to count all the transactions for a particular credit card uh like how many visa versus how many amex well i was thinking like outlaw calls and says hey uh i've got a fraudulent uh transaction on my credit card can you tell me all my other transactions so we can figure out if they're legit or not? Yeah, but because you keyed on the date, then you can't go and find all of Michael's transactions. Unless you have a secondary index. Right. So if you imagine we've got 356 partitions. That's an expensive query. So if that's the kind of thing you're doing often, then that's pretty awful. Especially if you consider, you know, maybe, maybe in a year I've only had 20 transactions or something. So you're finding these needles and a whole lot of a whole, whole bunch of haystacks. Yeah. And they referred to, I mean, we talked about it last time about the idea of like where you,

Starting point is 00:21:48 there might be reasons why you do want to have to query all the nodes for something, some piece of data. Right. And, and some of the, these technologies actually, you know,

Starting point is 00:22:00 take advantage of that. And, and there was a term for it here that I didn't even, that we didn't even hit on the last time that was called scatter gather, where, where your query does get scattered across all of the different nodes. And then you gather that up. And, and in the last episode, I think Alan referred to it. He was using the elastic search hierarchy or, you know, like architectural hierarchy as the,

Starting point is 00:22:27 as the example where we were talking about, like, um, I don't know that I forget the, the elastic search terminology, but you would know Joe where it was like, there'd be the master node and then underneath it, there would be,

Starting point is 00:22:38 you know, whatever the other data nodes are in the master node would be required, would be responsible for spreading that query out across the different nodes and then it would gather up the results and then package that up as the return yeah i think there's like four or five different node types in elastic so master is the one that does the routing and data has the data it's got ingestion and i think there's something else but um yeah it's going each node can have different roles so you can have ones that are just for routing or ones just for data based on you know your your use cases there which is really cool so and we're going to get into

Starting point is 00:23:14 routing later in this chapter yeah so scatter gather is so you could do the scatter gather technique to to grab these credit cards across the 365 partitions which by the way that assumes that your partitioning strategy here was like going to always produce the same result for a given day regardless of year yep but when you said partition by date i assumed year was part of it too so you might actually have more yeah it's it's 365 times however many years old your company is right yeah imagine seven years trans so yeah seven years uh retention policy uh so there's thousands of partitions to check every single record and i mean it's just a ton of work and if one of those partitions is unveiled say a node's down or maybe some other operation that we'll have some examples coming up here in

Starting point is 00:24:04 a little bit is happening. And so it's slower to get that data. Then basically you are held back to whichever one's slowest or else you're going to get inaccurate results. Yeah, and you'd hope that your company is successful too. So the first year or two, that stuff might be easier to search because the indexes are relatively smaller. But those last couple of years, your business is booming. So those indexes take relatively smaller. But those last few, you know, those last couple of years, you know, your business is booming. So those indexes take a while to search. Yeah. So you don't want to have to do these 2,400 searches or, you know, whatever it might be.

Starting point is 00:24:32 Right. And the chances that your customer has had that credit card for seven years, too, is probably pretty low. It's probably, you know, you probably only really care about, you know, if you're talking about fraud, maybe you only care about the first year anyway. And so, you know, you can probably trim it down in other ways. But yeah, so the solution here, though, the way to make this better is to keep secondary indexes, which keep some metadata about our data that helps us keep track of it. So I mentioned the user. We might want to keep a secondary data structure or even a secondary system that says, hey, this user shows up in these partitions. And so now, instead of looking at thousands, maybe I'm only looking at 11 or some much smaller number, which is really nice. The example that they gave in the book for this, too, I always forget the term for it.

Starting point is 00:25:17 Because whenever we think of the word index in regards to reading or something like like that you always think of the thing in the back of the book but the thing in the back of the book that's called an index in you know our in in in programming terminology is not uh it's like a reverse index if i remember right is the term yeah and so but but obviously they never refer to like hey go to the back of the book and look in the reverse index. But really, that's what the secondary index is. It's basically like the back of your book where it's like look for the keyword of Lamborghini. And it's like, oh, well, here's all the primary keys that have – or here's all of the – going with your credit card example, look for Michael, and here's all of the transactions that have his, where he used that credit card. Yep. Pages 113 through 115.

Starting point is 00:26:15 Pages 211. Pages 700. Yeah. So that secondary index just points back to the primary key that's used used or at least an example that they gave here underline you don't have that if you don't have that index you've got to read the whole book every time you want to find it yeah every time you want to find it yeah the um i mean i don't know that the author uh really got into like the underlying storage mechanism for this particular case of the secondary index but i but the examples that you know at least in the pictures that he showed

Starting point is 00:26:51 he used the primary key as what the secondary key was pointing to so i was just assuming that it would you know stay consistent with that yeah because they're really basic like we know relational databases are kind of famous for keeping a lot of statistics about their indexes. So they can kind of route things in a smart way. And you can bet they're not doing just simple lookups on all these. So definitely the data structures that we're talking here are very simple. So you can just kind of imagine that things get kind of tweaked and taken on from there. And this whole chapter is really kind of biased towards NoSQL, I thought.

Starting point is 00:27:22 And they talk a little bit at the end about kind of more complex data warehousing. Huh. Like key value lookups was like a heavy emphasis here. Okay. I don't... Okay.

Starting point is 00:27:40 I didn't really take away that thing, but I can definitely see why you would say that. Yeah, I can definitely see why you would say that. Yeah, I'll try to tell you on it later. Because this builds up into relational databases. We'll talk about how things break down a little bit coming up. Okay. Yeah, and then there's future chapters on it, of course, because of course there are, right?

Starting point is 00:27:59 Right. That's the whole deal with this book. So one other thing I just kind of wanted to point out that's kind of fun like a lot of times you can answer your query just by looking at the index alone. Sometimes you just want to count things or just want to report. And so you don't even need to look up the data. You can just look at the index and see, okay, you know, Outlaw's got 20 transactions here over the course of the last seven years. And that's enough for reporting, for showing a bar graph is like all we need to know is that uh you know that count in order to kind of show things it's cool yeah for like a uh an olap type system where you just want to like do the aggregates yep so like how many how

Starting point is 00:28:37 many lamborghinis are registered how many how much purchasing does mich actually do? Yep. Those types of things, yeah. I like it. That was cool. And I want to mention, too, secondary indexes are just complicated, so it's hard. Like HBase and Voldemort avoid them entirely, whereas something like a search engine

Starting point is 00:28:58 is something, a data structure that specializes it, and there's some pros and cons to trade-offs that we're going to be getting into here. Yeah, they actually talked about how in, um like i think in elastic search like it was specifically if anything that was going to be a have a secondary index uh that thing was specifically called a field in the document if i recall correctly yeah absolutely they were like putting like elastic terminology on top of this uh discussion yeah it's really cool we've talked about how other chapters were kind of like how to build kafka this chapter it kind of felt like how to build elastic search to me which is really

Starting point is 00:29:36 cool see okay because now going back to your no sequel comment like a lot of times while reading in in you know even mentioned it in the last episode, too, like I kept coming at this thinking from a Kafka point of view. And oh, yeah. And in fact, I'll save it for now, but I'll tease it that there's going to be a part where it gets even more Kafka ish. Yeah, I know exactly what you're talking about. And I called it out, too. So, yeah, that's a really good point too. I could definitely see that.

Starting point is 00:30:06 Just with the one thing that Kafka really suffers on is there's no secondary indexes. So there's not a way to say, oh, I also care about the user ID. So all you get is things split up by partition. And in fact, you can't ever make updates. You can't delete data once it gets into those partitions. So it's read only. So there's some tradeoffs there. Like that's why Kafka is not, you know, it's not a database.

Starting point is 00:30:30 It doesn't work well as a database because you can't retrieve data except by the way it's partitioned. Yeah. And there's advantages to that, definitely. And some cons, too. I can't tease it any longer. I'll go ahead and say like it was related to the routing. Oh, okay. That's not the part I thought.

Starting point is 00:30:50 Interesting. I thought you were going to talk about the number of partitions versus nodes. No. Well, we did talk about that last episode somewhat where I had mentioned, because a lot of this, and I asked Alan this question too last episode,

Starting point is 00:31:04 was like, when you know i and i asked alan this question too last episode was like you know when you're reading this you know do you often like have something that you already know in your head that you're like kind of relating the information to and so like you're coming at this from an elastic kind of point of view and between the two of us like you know a lot more about elastic search than i do and so like you're reading this you're like very heavily elastic focus whereas like i was reading this and i was very much thinking like Kafka in my mind, like, you know, so I'm relating everything to Kafka or like, you know, and I'm sure we were both relating it back to like all of our experiences with, you know, SQL related databases and whatnot. But yeah. So,

Starting point is 00:31:39 um, uh, definitely the, the routing section heavily, i definitely was like oh this is absolutely heavily kafka yeah you know i keep wondering like you know maybe these episodes are terrible for people that are only working with you know relational databases which is how i worked for most of my career was only with relational databases but for me working with you know elastic surgery kafka and a relational database kind of in daily situations. It's so great to like, I'll read a sentence in the book that says something. I'm like, oh, so that's why it's like this, or that's why it has this limitation, or that's why we use this system for this use case.

Starting point is 00:32:17 Yeah, absolutely. When you said that, I was like, no, that kind of makes me sad, because I really want to hope that if you were working only in a single type of technology, to your point exactly, that you would know, yeah, but I keep running into this problem when I try to use this SQL server to do this. And why is that? And then this book can help to expose you to why those problems are happening and like why these other technologies exist and like the problems that they solve and uh you know i had this thought that like this book when when did this book come out like this book is only like a couple years old right um so if i remember right i don't remember the exact date. Oh, 2017 is the first edition. So this, this book is like four years old or four and a half. It's right. Cause yeah,

Starting point is 00:33:11 cause it was March of 2017. So it's about to be about to be five years old. And in that short, short time though, I think that this book is going to stand the test of time of being like one of the, it is going to become one of our classic clean architecture, pragmatic programmers, you know, refactoring kind of books. Like, you know, it'll be up there as like one of the greats, you know, with like the gang of four that we're going to like, because there's so much they do talk. Martin Kleppman does talk a lot about like specific database technologies, but everything is so generic and applicable to, you know, whatever the platform is that it just helps to like understand. And he does a really good job of explaining some of this. And that like like you know

Starting point is 00:34:05 you think about um like we keep getting things to where as a society we keep working to to tackle a complex problem until we make it to where it's so easy and trivial and we have libraries to abstract it to where it's no longer a thing right and. And so, you know, in the seventies where like a lot of great technology apparently was invented. So we've learned during the course of this podcast, uh, you know, there were things that, that, that it was, this was, it was way too far advanced, right. To, to like get into like the routing and partitioning and kind of problems that they're talking about here. But, you know, they had documented a bunch of stuff back then that even this book has covered, right?

Starting point is 00:34:47 But, you know, now we don't worry about that. But I just think that, like, this book is going to, like, stand the test of time, you know, until we get to the point to where we no longer care about routing and indexes and secondary indexes. But I'm like, I won't see that coming for like quite a long time. Not in my lifetime. Like I think it'll be,

Starting point is 00:35:07 so I think it's going to stay this time. Yeah, I totally agree. Like, you know, we see those articles like top 10, uh, programming books,

Starting point is 00:35:14 every programmer should have, whatever. I think this one should be in like the top five of all those lists. Like I was going to say, it's not there now. Yeah, that was, that's where I agree.

Starting point is 00:35:21 It should be. But I think like, you know, we're, we'll like this book is still kind of new. So it's still kind of saturating but i think like 10 years from now we'll still be seeing this this book in those lists i mean you know the gang of four book has stood the test of time and it's only had one printing it's still the original printing if i remember correctly of that book so apparently they didn't make a single typo so kudos to them um i couldn't have done that but um you know you think about that

Starting point is 00:35:52 and how relevant it still is the information that they put in there still is but yet we have so much technology technology today to abstract like things that we create and whatnot and yet the patterns that they described there are still very much a part of our regular world. Right. And still very important. So, so the concepts that are in this book,

Starting point is 00:36:13 I think are like really key. So going back to your point about, even if you were in a single database technology, you know, that's fine, but you should still, these are things and concepts that you should still know. Like even it would help you to understand the underpinnings of that single technology period. Even if you never did use something like a document database or, you know,

Starting point is 00:36:39 something like a Kafka or whatever, like if you stayed, if you stayed in a relational database, for example. Yeah, I totally agree. You know, a lot of people kind of push back on design patterns sometimes, or at least you'll see it on Twitter, specifically the book. But to me, the argument is kind of that the language features have gotten better, so you don't need to really be programming these anymore. But good luck using modern JavaScript frameworks without observers.

Starting point is 00:37:06 And good luck using Java without builders or factories. These things do kind of become baked into the languages and frameworks, but that doesn't mean it's not worth learning about because that's how you make new ones. It's how you build those languages and frameworks and also how you use them. So yeah, I think that book is also top three. What were we talking about the the two main strategies for secondary indexes right uh so document-based partitioning and term-based partitioning

Starting point is 00:37:34 and term-based is really kind of an evolution of documents so let's start with document and so you remember our example there we talked about partitioning our credit card transactions by date and also the the example you gave of encyclopedia imagine if we kept a secondary data structure along with each partition so it's kind of like having an index in the back of each book in your encyclopedia set so you go through and pick up like aardvark to Automobile and you look in the back and look for Aardvark and it'll tell you page one, page seven or whatever. That's a very different solution from having a centralized kind of almost like table of contents or like one big index that tracks across all the references. Oh, you know what? I didn't even that never even dawned on me. But yeah, I do like this example because then like technically every encyclopedia does have a secondary index along with it that is the document one, right? Yeah, that's pretty cool.

Starting point is 00:38:35 No, it would be the term-based one. I'm sorry. But it's, it rides along with it. So with term-based is like you'd have like one book in the set that all it does is just point to different areas. And, you know, obviously in the encyclopedia, it's already organized alphabetically. But, you know, there's certain themes, like, you could look in the index and see, like, the 1920s. 1920s are referenced for the Great Depression, but also Prohibition and also, I don't know, Spanish-American War. And so those are different places where, you know, you want to go look, and those are going to be in different actual books.

Starting point is 00:39:08 But in the books themselves, you could also flip it back and see if there's anything about the 1920s. And that would be what we call document-based partitioning. It's been so long since I picked up a physical encyclopedia that maybe they don't actually have a reverse index in the back of it at all because really the like one of the key differences here was that the document-based partitioning like you said you have whatever partition has the the partition on it the secondary index is also beside it it's with it so from you have the performance gain of now when you want to do that read it's all right there on the same thing versus the term-based partitioning it was also like kind of referred to as like a global uh type of partitioning where there's think of it as going back to your elastic terminology there the master node like it knows that hey for, for this secondary index, I want to look for, um, you know, transactions based on this credit card.

Starting point is 00:40:12 It knows all of the partitions that need to, it needs to hit to run that query. And so it might get spanned across like say 12 different nodes to to do that and that's where that scatter gather comes back um so you have the complexity there so that was the downside of of the term base versus the document it was you know just the one thing doing it yeah absolutely so i just kind of um peel it back a little bit so specifically talking about document-based partitioning and we said each node now has track of its own indexes. So we can go, when we query for users, instead of going and looking through every single row and every single partitioning and comparing, is this my user ID? Is this my user ID? Is this my user ID? Now we go to each partition and say,

Starting point is 00:40:59 hey, do you have any user ID one, two, three? And no no no no yes no no no yes and so the ones that say yes are the ones that end up getting queried so it's much faster than looking at every single document in those databases but still you have to talk to every single partition to ask if it has that uh secondary index and so counting is easy because you can just go and you know like we talked about before we say hey node uh 2021 uh october 5th do you have any user id uh you know 123 and it says yes i've got three of them like okay great and that's all i need from you let's go check the 26th the 27th 28th 29th so we can count that up really easy and this is a much better solution than obviously looking at every single record.

Starting point is 00:41:50 But, oh, you know what's fun? I forgot about this. So remember Big O? So we talked about, yeah, we probably talked about that very early on. But it's the difference between a log of n search where we know, you know, things. We basically can go to each index and ask do you exist? And if so, you know, assuming that the list is ordered by the keys, we can go and easily look those up in log of n time. But if we have to look at every

Starting point is 00:42:18 data point, then we're basically looking at each record, which is O of n, where n is the document in your entire data set so bigger the data set the bigger the savings yeah because if you looked at um o of log of n it was like significantly smaller if you go back to the big o cheat sheet you know like o of n was like very linear, right? Because it's whatever your count is, right?

Starting point is 00:42:49 But the log of that would be a significantly smaller number. So, you know, you could really save some time there. Like on the big O cheat sheet, it was almost a flat line. Yeah. And so it's really great because the bigger the numbers get, the more efficient. So let's imagine you've got a data set of 1 million records. If you need to check 1 million records, you've got to do 1 million comparisons. If you want to do a log-in-based lookup of a million records, you have to do 20 comparisons.

Starting point is 00:43:22 Actually, 19. And that's max. you're doing max 20 comparisons so imagine if we go to 10 million how did i what did i do wrong because if i do a log of 1 million i got an answer of 6 uh did you do log base 2 or log base 10 oh good point good point yeah so sorry i should specify it So log base two of 10 million is. Well, I don't know how to do that. And apparently in Google log to 10 million. 23.

Starting point is 00:43:57 Yeah. So with 10 times more data, you have to do 23 point something comparisons. So three more searches. Yeah. Then I shouldn't even say three more searches, three more comparisons. Oh, right. Looking at the data. Yeah. So, yeah.

Starting point is 00:44:11 I mean, it's just a huge savings. Yeah. So that's what I mean when I say that like log of N is almost like the flat line when you look at the big O cheat sheet because, you know, you get to really big numbers for N and it's still an extremely small number, relatively speaking. It's basically the only way we can work with huge amounts of data. If it's going to be O of N or bigger, then you pretty much can't do it, at least not in real time.

Starting point is 00:44:39 So that's how when you go to Google and you see like, oh, 17.5 million results and it happened in less than a second. That's how they did it. The only way that's possible. And the downside is we do have to take a small performance set whenever we insert new items. So if we're writing a new credit card transaction, we also have to go and we have to write this into the index. But if you're querying by that secondary index uh then that's fine you know that's that's going to be worth it and there's also the downside from a um availability

Starting point is 00:45:16 kind of point of view yeah because it is like right beside that partition so you know you would have to replicate it out as well yeah so we call this a local index because it's stored locally to the partition and if that partition isn't available we can't respond to your query so if one of my you know two thousand or whatever we said seven years you know so like two three thousand um one of my partitions is unavailable for some reason i can't answer your question at all which is really fragile so so it's a you know it's a matter of query performance in case you know i have to wait for whichever slows the one that slows but also my availability is taking a hit which stinks uh one thing you know we did mention is nice about this is with data retention when the data drops

Starting point is 00:46:06 off in case of local indexes done there's no cleanup you just drop the whole index it's sorry the whole partition and the indexes go with it if we did have our indexes centrally located and we dropped you know older data which is every day we'd be dropping a partition, we have to go and we have to kind of scrub that directory of that partition saying this doesn't apply anymore. It just takes a little bit longer. Yeah. So just to elaborate on that, if you were going to clean your data, if you were only going to retain 90 days worth of data and you wanted to drop the partition for the day and you didn't use the document-based partitioning and you'd have to go back to that global partition uh or that global partitioning secondary partitioning scheme and remove every record from it which you know there could be

Starting point is 00:47:03 thousands for for like whatever that day is, depending on how successful your e-commerce shop is. Right. Which let's face it, it's going to be extremely successful because you've read this book. That's right. Totally, totally. Uh, transfers there that, that skill transfers. Yeah. Yeah. So that was, uh, that was pretty much it for document based partitioning. So in case we've had these local indexes on the partition. We take a little bit of a hit on writing, but hugely performance,

Starting point is 00:47:29 huge performance boosts on querying. So let's take that one step further for term-based partitioning, which is evolution. And like we've been, you know, talking about, we've been kind of learning the lines there a little bit.

Starting point is 00:47:39 But what it does is it takes those locals, gets rid of them, and it keeps a global index so that clients can go and talk to this global index. And then it'll tell them which partitions to go to. And so if our user only has 20 transactions over seven years and those 20 transactions only happen on 11 different partitions, then we only need to go query 11 partitions. So it's a lot less stress on the system. It's a whole lot less network traffic.

Starting point is 00:48:09 It's going to be faster because we don't have to wait on whatever is the slowest of 2000. Now we're waiting on whatever is the slowest of 11, which is really great. And it's a whole lot less fragile because we've got this, presumably distributed system now storing this much smaller data set, which is going to be really fast to query.

Starting point is 00:48:31 Yeah, they don't really talk about like how that partition might be replicated and spread across, you know, because in like it kind of it got super meta at this point because it was like, well, okay, so now we're going to create another index of our indexes. And you might want to have a keying mechanism for that, spread that across partitions to increase performance and availability and whatnot and scalability. But maybe it would be okay because these are smaller like i don't know but it did get very meta at that point yeah it's basically like you've got a database for your data and now you need a database for your indexes but it's smaller so it's a little bit easier so uh you know that's what you got going for you i did think about like as i was reading this part of the book though, like, I wonder, like, if you had to just for a challenge, if you decided, Hey, I'm going to write my own database,

Starting point is 00:49:31 right? Because like, if you remember, that's where this book starts with, right? It's like, Hey, let's just write a simple key value store, uh, to a text file. And let's, you know, we'll just have some simple bash functions to, uh, read and write data to it and start from there. Right. And, you know, imagine like the fun types of challenges that you would get into as you would start to scale this thing out. And specifically things like this. And then I got to like thinking about like how complex it would be because like, I mean, really, you know really, props to the authors and the developers that work on these technologies. Because I started thinking about just the file I.O. management of these systems.

Starting point is 00:50:20 Like, you can't write to this index at the moment because I need to put a lock on this index at the moment to keep prevent you from writing to it because maybe I'm trying to split it in the case of like automatic partitioning systems or whatnot. whatever reason you know just just the file io management alone is complex enough now to think about like these um you know the different data structures we've talked about like ss tables and and lsm trees and whatnot uh you know to think like okay now like i need to read this into memory and like where do you go for it and you know some of the file limitations like they were mentioned i want to say maybe it was like h base that was like had a 10 gig limit on partitions before it would like automatically split but it also had the capability of um combining those files back together if the if they fell below so they made the analogy of uh i say they, but I mean Martin Clipman, he made the analogy of the B-tree and how the B-trees can work to where it can split things out and then collapse them back depending on what's going on there. Thinking of like how you would write the file management alone aspect of all of this. It was like just kind of like mind blown, kind of like thinking of like, well, this is why it's taken us decades to get to here.

Starting point is 00:51:56 Right. Yeah. Oh, yeah. And yeah, it's just the low level details are so hard to get right and so important to get 100% correct. Oh, yeah. And you're going to take that and you're going to replicate that data to multiple nodes and you're going to split it up into different partitions. You're going to split those partitions, rebalance. And eventually, I think the peak is kind of like distributed transactions across multiple partitions and nodes.

Starting point is 00:52:20 And it's just amazing well it works at all one of the things that also came to mind too is that like as they're talking about all of this um you know when we're talking about the partitions and the replications we talk about like one of the things that that's a key advantage is that you want to have the multiple nodes so that they can serve uh different parts of the data because one like, like, you know, we talked about like scalability, but also, you know, availability might not even be able to fit on one machine or whatever. But when you talk about like repartitioning some of the data, you know, on the fly or whatnot, whether it be to like expand or collapse the partition based on the need,

Starting point is 00:53:02 then, and if you needed to like reassign the partition to like another node, one of the things that came to mind was in from a DB2 world in Oracle and had a similar concept, but I'm more familiar with it from a DB2 world, although the name eludes me at the moment, but their high availability solution. One of the things that you could do was you could have like a SAN, like a storage area network where it's all fiber-based, right? So you have all of these things, all of this data sits on a disk array that you are accessing over fiber. And then there's the servers can like literally um they're literally sharing the same disk the same underlying disk and so like transitioning from one node to the next doesn't necessarily have to be really all that expensive so that type of idea was really dependent on like well what's your

Starting point is 00:54:07 physical architecture and like we'll get more to it in there in the routing section um because physical architecture like really like it's that whole joke about like uh just turtles all the way down you know kind of thing because uh it really started to matter like well okay if if all your partitions are like physically separate you know because maybe it's like in an aws or or uh google cloud or azure kind of world where like you don't control the hardware but if you are the googles or the amazons or the mic Microsofts and you can control that hardware, then, you know, you can have some greater, um, some efficiencies because you could like localize that better, you know? And so your use cases can vary. Does that make sense?

Starting point is 00:55:02 Yeah. It opens things up. So, you know, if you imagine I had like a really, uh, good network Does that make sense? S3 and Google Cloud Storage and Azure Blob Storage is basically, you know, kind of what that is. They went and they created these dynamically shiftable storage options that store what you want and they keep track of it. Yeah. So in the world where like you can't control your storage, then you, you know, you have different set of needs. But when you can control that, then like if you were to be all on a sand then but you know you can have some some improvements there with like how you could go from one node to the next but also you lose uh some of this uh benefits but in in way of like uh protection like i don't know that we really talked about that. We talked about it from, well,

Starting point is 00:56:06 I guess retention or no, not retention. Availability would kind of be like under the availability area. Cause like if you lost like that one data center, if then that kind of assumes that if you have access to that sand, then you're all in the same data center or else you couldn't be on that same sand, you know,

Starting point is 00:56:24 just because of the limitations of the fiber networks. But, yeah. Yeah, it's all the way down. It's like the area is. It so is. So to take it back to term-based partitioning, basically, you know, it's just what you mentioned. So we've got the global index now,

Starting point is 00:56:41 and the benefit is that you've you've taken now this huge benefit of searchability for document-based partitioning and now you've made another huge relief forward in performance because now instead of uh us having to go to each partition which can you know if you have thousands of them uh is really slow and you're kind of network dependent and you've got this kind of fragility built in now you can just go to the smaller system and say you know hey where's my stuff and then you reduce that right out of the gate to you know potentially uh really much smaller number of operations that you need to complete but the downside is that there's overhead and keeping track of those indexes that's much higher than the document-based partitioning because, like we said, if things get modified,

Starting point is 00:57:27 we have to go update the index. If data drops out because of retention, we need to go modify the index. Every time we insert new data, we got to modify the indexes. And this is like this one smaller system, but it's really high traffic. And it's a database of your database.

Starting point is 00:57:43 You want to delete that? You know, you got to go update this other database. Yeah, exactly. And just by having, you know, a separate index alone, you've got some kind of like an async problem built in there. So I need to insert my document, but also need to go insert the metadata about my document. And, you know, depending on how many secondary indexes I have, you know, there could be multiple writes that need to happen in different spots just in my indexes alone. So now every single insert to my into my database is now, I don't know, 20 inserts, 11.

Starting point is 00:58:15 Once you add a replication, you know, it's a multiple. So each insert is now, I don't know, 20, 30 operations that need to be written across some number of nodes. And that's not all going to happen at the exact same instant. And so that's why when we talk about some of the stuff, like you're kind of looking at basically, you know, you're talking about eventual consistency just on a single write, which is part of the reason I kind of thought of this chapter

Starting point is 00:58:41 as being closely associated with like NoSQL type things. Yeah, I know that like in the last episode, part of the reason I kind of thought of this chapter as being closely associated with like no SQL type things. Yeah. I know that like in the last episode, Alan and I had debated on whether or not, you know, you think of these indexes as another database or not, you know? And,

Starting point is 00:58:58 and so we kind of talked about that and, you know, and even in the book there, I'm trying to find it now, but there was like a, a little blurb in like one of the footnotes where they they did make the point of like referring to it as like yeah you could think of these as like little mini databases within your database yeah even though it wasn't and like how i have traditionally thought of it and like that's one of the things that we talked about in the last episode was how I've always thought of these as just ways of different tables of it.

Starting point is 00:59:29 But I do get it because of all the complexity around it and the management and operation around maintaining it. Now, Dynamo does say that you can generally expect a fraction of a second for the inserts to happen to both the data and the indexes but you can imagine that like that would kind of stink if you are inserting data and imagine something goes wrong and it takes minutes and if i look up the data one way it works so i look up another it doesn't and calling the secondary indexes an optimization is misleading because it's not just about making it faster because if that optimist if that secondary index hasn't been written yet you won't find the data so it's not just making it faster it makes it findable at all well that depends on like how you're searching for

Starting point is 01:00:16 it though right well i assume you're trusting the indexes well unless you're well indexes what i mean though is like if it was written to the primary index and you're only looking for orders for today, then you'll find it. But if you were to look for orders by credit card, then by a specific number, then yeah, maybe that secondary index hasn't been updated yet for whatever reason. And so now you can't find that. And so we've all been in that situation where we're like, I don't understand. I can find this data this way but yet if i use this other method it doesn't show up yet right right i think which

Starting point is 01:00:53 is kind of funny i think i think in the past we've also talked about like when we talk because because this kind of um only because you brought up dynamo db it kind of reminds me of like eventual consistency and we've talked about like how you could go to Reddit and you could submit a, a post and then you refresh the page and you're like, wait, where did it go? Yeah. You know? So, um, I mean, it's not exactly the same as what you're talking about, but it just kind of came to mind. Yeah. It's kind of funny to think of, you know, we talk,

Starting point is 01:01:22 I at least I think about when writing the data as either it's either there or not and so we talked about one of the strategies was like read your own rights so we say hey i'm gonna like this post on reddit but i don't show or don't let the client return until we verify that we can look that like up and that we've got it in our database so it comes back and says okay we got your like and then uh maybe you refresh the you know the page somewhere else and it doesn't show up because it hasn't made that that secondary index that keeps track of those likes by user or by post or by date or however else they do it hasn't shown up yet and so it shows up as here on one page for your user and not here for another

Starting point is 01:02:03 just kind of confusing. Today's episode of Code in Blocks is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog's machine learning alerts, customizable dashboards, and 450-plus vendor-backed integrations make it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application's performance. And I can't emphasize that combining all that in one, because we're talking about indexes and databases as it relates to this episode,

Starting point is 01:02:40 specifically partitions and whatnot, right? And Elasticsearch keeps coming up in the course of our conversation. And of course, of those 450 plus, 450 plus integrations that Datadog has, of course, Elasticsearch is one of them. Not only that, but they actually have like the single pane of glass view that you could have of your Elasticsearch cluster to know like, hey, what's the current rate of your queries and the current rate of indexing? So you want to know about the search and index and performance of your cluster. What about resource saturation and errors? All that with JVM heap and garbage collection metrics,

Starting point is 01:03:17 network collection, single pane of glass, all of that for your Elasticsearch cluster. And that's just one example of the many technologies that they cover. Yeah, you want to make great decisions about partitioning and nodes that you got to have the data. And this is the way to do it. It's fantastic. And it looks beautiful. But I don't actually want to talk about that because Dash just happened. And I was just browsing YouTube. And as far as I can tell, all the talks for Dash have been uploaded. And so start with the keynote and then dive into some really great stuff there. I was just watching the Ask an SRE videos.

Starting point is 01:03:50 Looks like there are two of them. And I had only watched one of them. So I'm excited to check out the other one for North America. They've got panels, case studies, a bunch of sessions on security and incident response, supply chain attacks. They're just really great stuff. And it's all up on YouTube. So you can be watching or listening to that while you're doing the dishes or, you know, hanging out.

Starting point is 01:04:11 So give that a give that a shot and then go and try Datadog free by starting a 14 day trial. And you also get a free T-shirt once you install the agent. Yep. So visit Datadog HQq.com slash coding blocks. That's datadoghq.com slash coding blocks to see how you can unify your monitoring today. All right. So I don't even know how we do this anymore.

Starting point is 01:04:40 Am I supposed to? Because I kind of gave up on the late night. Hey, listener. But then even in the last episode, in the background, like Alan was still doing. So I don't know. Maybe I'm supposed to like, hey, listener, if you wouldn't mind leaving us a review,

Starting point is 01:04:58 you need to head to www.codingblocks.net slash review where you can find some helpful links and all your late night coding favorites. Morgan Freeman. That was great. I was first. I couldn't

Starting point is 01:05:18 tell if you do an Al Bundy or Morgan Freeman, but then, yeah, I mean, it's obvious. Well, that's how bad my late night radio DJ voice is then because it was supposed to be, you know, I was thinking of like the smooth, sweet sounds of W-J-Z-Z. No, I mean, that was straight up Morgan Freeman, just perfect. Now that I'm trying to like, you say that, I'm trying to think of like a Morgan Freeman thing, you know, or a catchphrase, and I'm like coming up blank. Yeah. So, all right. Well, so once again, we will not have a, you know, survey for you to answer. Cause you know,

Starting point is 01:05:56 wouldn't it'd be a little bit odd to only have one answer and it would be super odd if you still lost even to yourself, that would be uncomfortable. I would feel bad for you. I don't yourself. That would be uncomfortable. I could do it. I would feel bad for you. I don't want to put you through that. That wouldn't be fair to you. I guess we're kind of in my favorite portion of the show. Survey says... Wait, wait.

Starting point is 01:06:18 Can you read this survey as Samuel L. Jackson? Wait, Samuel L. Jackson? Yeah. Well, I might have to drop way too many f-bombs to do all right we're gonna have to put an explicit tag on this show if i do it that way say what what no i'm not gonna um all right so uh well okay fine how about if i ask you this first then do you know why the new skindle uh skindle uh kindle screen is textured to look like paper no i don't so you feel right at home oh that's terrible micro g actually that was dad jokes that was the dad jokes api yeah all right uh okay so for this episode's survey we ask how many different data storage technologies do you use for your day job now this is important it's for that you use not the ones that are available in your company but

Starting point is 01:07:22 just the ones that you use so So your choices are just the one and it's the, it's our hammer or two to three. It's a quaint little data pipeline or four or more. Oh my God. Why do we have so many or none? Keep your data crap out of my CSS. Or, you know, whatever your front-end choice of technologies are. I'm assuming front-end. Yep. This episode is sponsored by Linode. Simplify your infrastructure and cut your cloud bills in half with Linode's Linux virtual machines.

Starting point is 01:08:00 Develop, deploy, and scale your modern applications faster and easier. Whether you're developing a personal project or managing larger workloads, you deserve simple, affordable, and accessible cloud computing solutions. You can get started on Linode today with $100 in free credit for listeners of Coding Blocks. You can find the details at linode.com slash codingblocks. And Linode has data centers around the world with the same simple and consistent pricing regardless of location. And I don't know if you've ever looked at their pricing calculator. I was just looking at their website. But because the experience for Linode is so simplified compared to other cloud vendors,

Starting point is 01:08:43 there's really only a couple dry boxes to drag it's really crazy because their pricing is so reliable and affordable and predictable it's literally like how much memory do you have how many nodes do you have and you've got a little slider here for transfer and that's it uh that's incredible experience compared to some of the other things i've seen where it's just impossible to know how much you're going to be spending. And it's such a relief to know. And with $100 extra credit, you can see just how much you can stretch that. I mean, I was able to run a three-node Kubernetes cluster for months for less than $100.

Starting point is 01:09:21 Yeah. And when we say that you could use that $100 to go towards even personal projects, we're not kidding. What if... Now, hear me out, Joe. Hear me out because this one's crazy. You ready? What if you just wanted to have your own CS Go server? Oh, I didn't even think about that. Yeah. So you can go... I'm not kidding. I'm not making this up. You could go to Linode, go to the marketplace, and there's a whole slew of things that you could just easily one click, add to your cluster, add to your environment, right? Including CS go. But I know you're thinking like, yeah, but we've been talking about databases and whatnot. Like, okay, fine. MongoDB is in there

Starting point is 01:10:01 as well. Postgres is in there as well. Like you could have your traditional databases and, you know, play your traditional databases and play around with partitioning and learn about that if that's what you want to do. But it's time to go have some fun too and play some games. And that $100 can go a really long way. Like Joe said, that pricing calculator, it makes it super easy to see what your cost is going to be. Plus their costs are so small. So like you can really stretch that a hundred dollars out. So it's a, you know, high, uh, what was the word I'm trying to look for here? Like value, uh, you know, bang for buck that you're going to get here, extreme, extremely high value. Uh, so you can choose the data center nearest to you. You can receive 24 by seven by 365 human support with no tears or hands off, you know,

Starting point is 01:10:48 or handoffs rather regardless of what your plan size is. Cause haven't you always hated, like you, you call tech support in the time that you need tech support. And instead of being able to get to that person, you keep getting bounced around from one person to the next or, or whatever, or you can't even get to a computer to a person because you keep going through an automated system.

Starting point is 01:11:07 And with Linode, human support, with no tiers or handoffs, regardless of your plan size. You can choose shared and dedicated computer instances, or you can use your $100 in credit for S3-compatible object storage, managed Kubernetes, and more. Yep, if it runs on Linux, it runs on Linode. Visit linode.com slash coding blocks. Again,

Starting point is 01:11:30 that's linode.com slash coding blocks and click on the create free account button to get started. All right. So let's talk a little bit about rebalancing because we don't always get things right the first time or things sometimes change like if you have to add more nodes because you've got cpu or ram problems or maybe you've got lower traffic than you need to uh than you used to and you need to save some money sometimes nodes just go down and you need to be able to recover from that so those are all examples where you might need to repartition your data and uh basically no matter how you do it there's a couple goals that you generally want to kind of aim for and the first that you want to try in most cases to load

Starting point is 01:12:13 uh distribute the load equally ish and i didn't say distribute the data equally ish because we as we mentioned there's different strategies for that. And sometimes you want to embrace hotspotting by kind of having different hardware. Or, you know, you could have mismatched nodes, like different node pools where, like, you know, you're hooking up whatever hardware you've got in the closet. And so you want to make sure that the load is distributed equally to keep a consistent kind of query times. Also, you probably want to keep the database operational during the rebalance. This is like, now we're starting to get like heavily into where I was thinking of like

Starting point is 01:12:55 Kafka as we started getting into like rebalancing and routing types of conversations, because Kafka especially can be a bit of a beast in, in this particular type of problem, because like depending on the technology, it might not repartitioning your data may be a, you know, click of a button thing like,

Starting point is 01:13:21 Hey, it's done versus it may be a huge effort to go and like rekey all your data so that it gets into a new partition because, you know, like Kafka specifically, if you were to just add on another partition, it's going to be like, great, but I'm still going to keep all the data, all the old data is over here because that's where it is. And, you know, you would have to like specifically reread and rekey data to move it around. Yeah. Absolutely.

Starting point is 01:13:48 And there's a couple different approaches and some pros and cons that we're going to get into. But there's one way that you absolutely should not try to basically partition your data, which is you should not be hashing by the number of nodes, like the number of computers involved. This part was so great. Yeah, I loved it. Yeah. I hadn't considered it before. I don't think I've ever seen it partitioned that way, but number of nodes changes, right?

Starting point is 01:14:18 Sometimes you'll add another. Sometimes one will go down either temporarily or permanently. And every time you do that, if you're hashing by the number of nodes, then you're going to be moving data around a ton. And I didn't even realize how much it was until I looked at some of the examples in the books, in the book. And it said, you know, obviously if you're going from one partition to two, at a minimum, you're, you know, moving 50% of your data, right?

Starting point is 01:14:42 So what I didn't understand is even if you have say 100 nodes and you've got a key i'm just going to pick one of 1000 and you drop a node so you go down to 99 well a thousand mod 99 uh is hashes to one as opposed to zero so you've got to move that record what if we went to 102 well that hashes to node four so now that key of 1000 needs to go to four we uh go to 103 nodes guess what it's moving again so you are constantly juggling a ton of data way more than you would think every time you change the number of nodes which is a ton of work and all you need is for one part of your network to be slow to respond for the rest of the nodes to think oh we just lost the node time to repartition the data yep and so yeah you

Starting point is 01:15:35 can have like that what they call the stampeding cattle or uh i've heard they call it but basically the problem just blows out of proportion because now it starts adding nodes really quickly because it sees the system is not performant now the rebalancing has gotten out of proportion because now it starts adding nodes really quickly because it sees the system is not performant. Now the rebalancing has gotten out of control and it's just taking forever across all these nodes and removing all this data, which causes the need for more nodes. I hadn't heard of that term before. Is that really what it's

Starting point is 01:15:56 called? I probably got it wrong. Stampeding herd? Huh. Interesting. I mean, I could totally see it, though. It definitely spirals out of control fast um so yeah i can't remember now oh but you know if you remember uh audience and leave a comment because we're giving away copies of the book forgot to mention that i forgot yeah which means i need to give away the last episode all right so i wrote that

Starting point is 01:16:23 yeah there's some there's some cool name for it and we've talked about on the show before i'm sure All right. So I wrote that. There's some, there's some cool name for it. And we've talked about on the show before. I'm sure if you heard it, you'd know it. But yeah, so that is the one thing that the book cautions you very strongly. Like do not do this because you're tying yourself to the number of nose and

Starting point is 01:16:39 you're going to be moving data. What a lot more often you might think it's counterintuitive. Well, the thing, the thing is that I had never thought about partitioning it that way, only because I guess I never had to think about it. But when I read that, I was like, well, I could definitely see why somebody might be tempted to want to partition it that way. Yeah. Right? Well, hey, I have 10 nodes. see why somebody might be tempted to want to partition it that way yeah right like you know

Starting point is 01:17:06 well hey i i have 10 nodes let me just uh partition it evenly across my 10 nodes so whatever your key is i'll just mod that and you know that's where you fall and uh you know but like as you said as you as that comes and goes and goes, as your nodes come and go, then it would really cause problems for your data. And constantly have to move it around. And that's where that SAN idea was coming into play here, turn this thing. I was like, well, I guess it wouldn't be too bad if you were able to live in a world where everything could share the underlying disk structure. And because it's all fiber, then it could be super fast. But that's not the real world, more often than not.

Starting point is 01:17:54 Yeah, luckily I've never fallen into this. But I can see myself being in a case with Kafka where, let's say, I've started a small project with Kafka. And let's just say, for example purposes, I've got one broker. I can see myself having one partition for my topic that I care about. And then I add another broker. Well, now I should go ahead and add another partition. Adding partitions in Kafka, changing those partitions is painful. You basically have to kind of rerun everything through.

Starting point is 01:18:18 It's slow and it's a problem. And you've got to pick when to swap over. And you've got data coming in in the meantime. And it's just a pain. It's actually not even advisable by the way like it's actually if you really wanted to do that the way to do it in kafka would be to the the more the more advisable way to do it would be to set up an entirely separate topic that is partitioned the way you want and then just parallel your rights to both topics and then swap over to the new one as you can, because otherwise you'd have to reread the original topic from the beginning of time to rekey it.

Starting point is 01:18:54 You know, to get the data spread across that partition as the way you would want it. And even with the idea that I mentioned with the sand, though, to where that like that kind of falls apart is um that only works it if you're able to take the partition as is but in this rebalancing portion of the book we're talking about like you're taking like go back to your example i think you used with um h base or couch base i forget which one you said but where like you went from one to two nodes and you were like literally, you know, rewriting 50 percent of the data because you're physically like moving the data from one part of the from one partition to another partition. Even if it even if you did have access to the same disks, you're still moving it to a different part of the disk. So you'd still have the IO expense there. So that's a really good point.

Starting point is 01:19:44 Yeah. And I should mention that when you go from two to three, it's not a 30% of the data that's moving. You have to move 30% of the data, but then also you're rejiggering the data on the nodes that you just dropped that data from in order to kind of take up that space. You're writing to 100% of the records, even if you're only writing tombstones to 30% of it. You know what I mean like yeah because you're you're constantly moving bits around yep yeah yeah absolutely so yeah you can't just leave these giant holes of zombies you know you have to fill that stuff in or otherwise you don't get the benefits of the other kind of lookups and depending on like the underlying technology too you might have compaction that would eventually kick in to pay based on the amount of data so which would likely just result

Starting point is 01:20:30 in rewriting the entire uh smaller portion of it so that's where you get into that 100 thing like yeah it really grows out of control depending on what you're trying to do and it was surprising too that like some of the systems they talked about here like actually have this capability like as a built-in feature. And I think I alluded to it before. I think it was HBase maybe was one of them where it could dynamically expand or contract the partitions based on a file size limitation. Yeah, absolutely. HBase Rethink were the examples.

Starting point is 01:21:04 And apparently Mongo has an option for it um which is really cool yeah sorry yeah no it's a typo yeah so um the the kawasaki example going back there real quick i wanted to point out that um if you um if you care about the ordering of the events you have to fully copy that data to the new topic before taking anything new into the new topic so So it's almost like you're going to have some downtime there. Then there's no automatic way to do it because Kafka knows that this is a terrible operation and they don't want you to do it. But I can see myself naively getting in that situation where I had one node, I have one partition. Why not? I had another node. Well, crap, I got to do that expensive operation. Fine. And we add a third one. Now I'm realizing the error of my ways.

Starting point is 01:21:46 But is there something I could have done to prevent that situation? And the answer is yes. The widely kind of accepted solution to that problem is to have a number of partitions that is greater than your nodes. How many partitions are you going to have? How many nodes are you going to end up with? That's a hard problem because you have to play a guessing game that you have to kind of guess at how many nodes you could forcibly have in some kind of imaginable future and then come up with a number of partitions that's greater than that. And there's a huge benefit here, which is that

Starting point is 01:22:20 you don't have to write individual records when things move you have to write individual partitions so in the case where we say we have one node and let's say it's got 10 partitions on it for a single you know topic or a database a data table or collection whatever on the database now when we add a second node we have to move half of our partitions there, which is much easier than going record by record because we've got these things already kind of, you know, split off. And so we can move those partitions and it's a much, much easier move. You know, you can, you were, you were,

Starting point is 01:22:59 we were talking earlier about like you know, if you only ever like what your value of this book might be, if you only stayed in what your value of this book might be if you only stayed in one technology and whatnot and i can think back to like some of my earlier experiences with kafka which prior to reading this book and like when it would come time to where i would want to repartition things like why is this so complicated i don't understand why it doesn't just support this like why is this such a such a hassle and now i'm like yep totally get it totally makes sense to me now yeah yeah i remember us running the problem just being like

Starting point is 01:23:30 just come like kind of mad at it it's like what do you mean you don't have a way for me to change this like that's such a obvious feature that anyone would want and then when you start thinking through it's like oh there's actually a bunch of trade-offs and there's some decisions that you almost you like you need to have a manual process there because there's things that they can't do for you yeah and it's kind of like um your comment a moment ago that made me think about this because you'd said like where the they're you know trying to prevent you from doing things so like kind of going full circle right like you know uh interfaces are you you know, eyes for interface episode, right?

Starting point is 01:24:05 And those being like guardrails, but you could kind of extract the same concept to like whatever technology you pick in this case, like a Kafka, for example, is your, your, that's your guardrail.

Starting point is 01:24:16 And they're trying to prevent you from doing certain operations that they've deemed to be inefficient for their use case. Right. And so they're like, no, we don't support that if you want to do it you're going to have to go through some pain to make that happen yep yeah absolutely and so you know that that example that we started with where we said we've got um you know seven

Starting point is 01:24:35 years uh retention for credit card transactions um sorry i'm typing along i'm uh updating my my notes here so we said 70 years partitioning if you're going from 10 nodes to 11 well okay so you're going to have to go through those partitions and figure out you know roughly 10% of them are going to need to move but that 90% those other partitions

Starting point is 01:24:57 you don't have to touch them you don't have to think about them they're out of the equation if we didn't have this case where the number of partitions was higher than nodes we'd be looking at every piece of data which is just awful yeah and what if you have more nodes than partitions right what if you know you had thousands of nodes in that case well i mean would it even have to be that extreme it could just be like you have three nodes and two partitions yeah then you're not using all your nodes yeah absolutely so you're going to have nodes that just aren't doing anything they have no data on them yeah just waste of money yep and

Starting point is 01:25:40 so uh say most vendors don't actually support it and And so it's not even sensible in most cases. Kafka, you can absolutely do that. Oh, yeah. And there's a couple of different cases where consumers, producers, you care about partitions. And the only thing I can say there is that sometimes you'll want to over-provision. And so it kind of avoids a cold start problem. And one kind of use case I thought of here is if you're launching a big video game, it's a big online video game.

Starting point is 01:26:09 Say Call of Duty 2022 is coming out. And you know that you're going to be going from zero users to a million users for an hour. That can be a really rough scale problem. And so you can do some kind of paper napkin math and say you know what let's go ahead and pre-provision 100 nodes and just have them ready and we'll pre-provision a certain number of partitions that are greater than that number of nodes and so that when people start coming online we can you know start balancing that stuff out better so it avoids this kind of problem i And I skipped ahead there. It is kind of the downside to that though, to that pre-provisioning though, is it does,

Starting point is 01:26:49 it flies smack in the face of like horizontal scalability, right? So like if you go back to your, like you put your Kubernetes hat on, right. And you want to think of like, well, you know, I want elasticity here. So like, if I don't need need 100 nodes, then I don't want to have to pre-provision those. But specifically to pick on Kafka for a moment, the things that I've read about Kafka provisioning is the advice is to architect the system for the next two years. So don't think about your needs today, but what your needs are going to be next two years. So don't think about your needs today,

Starting point is 01:27:25 but what your needs are going to be in two years. That's what you need to, that's how you need to architect your, your topics and brokers and replication factor and whatnot. Right. And the number of partitions and, you know, uh, so you have to really think through your use case and, and think through like what, where you think you might be in two years or where you expect to be within two years and then that's the way you should size your kafka environment so you're definitely you know that cold start problem as you mentioned like you're you're definitely heavily front loading it and you know to avoid uh future problems so that because kafka has those limitations about like how you can uh oh gosh i need to repartition things and you know so yeah yeah absolutely and um so yeah it's

Starting point is 01:28:13 it kind of flies in the face of all this stuff we spend so much time talking about like this ability to dynamically scale things and this is a case where like whoa whoa step back a little bit uh cold start serious like you want to put some thought into these systems ahead of time which is unfortunate but uh you know when after we talk about this stuff it makes a lot more sense i mean date serving data is so much harder than um serving a website right because or like html because you know anytime you have to deal with state just like think back to like everything we've ever talked about with unit tests, for example. Anything that doesn't involve state is super simple, right?

Starting point is 01:28:51 It's super simple to write unit tests about. It's super simple to scale it because you're like, oh, well, if I have 1,000 of them, then it's going to be 1,000 times more performant. But anytime you have to deal with state, which is all data is, it's like infinitely more difficult to work with. Yeah, and there's been talk about scale to zero systems. Like we talk about like serverless functions and things like that. Sometimes they'll talk about basically having service that's free until you need it and then you pay for the usage. And that's really hard to do. I think a lot of people don't realize how hard it is to do because you need to start somewhere.

Starting point is 01:29:28 And it takes time to get up to speed. So going from zero to thousands of requests per second, that's really tough. And you're probably going to lose a lot of those first requests. You just can't accelerate that fast because there's overhead. And then it's easy to accelerate too much. And now you're spending too much money. So scaling all the way down is really tough. And fast is tough yeah scaling is tough well so why don't we just have a million partitions right and we don't have to worry about we just create way more than we

Starting point is 01:29:56 need and we never have to worry about it well the downside there is that there's overhead yeah and we kind of talked about that on the last episode. Again, I keep picking on Kafka here because I keep thinking of this book. I keep putting on my Kafka hat as I read through it. But the specific example that I gave last time was that for each partition within that directory, you actually get at least two files. So then depending on your operating system, there is a limitation on the number of files that you could have written onto the disk, which in Linux, you can actually configure that. But then there's the overhead of just having the file handle open itself. You know, there's there's some memory overhead to just related to that.

Starting point is 01:30:41 So, you know, you can run into problems if you, if you try to have those million partitions, uh, you know, it's actually, there's an amplification factor to think to consider because each partition is actually a couple of files. Yeah, absolutely. And we mentioned too, like if you're, um, rebalancing, right. Uh, it's really nice to have partitions, uh, greater than the number of nodes, because you only have to look at subsets of your data. On the flip side, if you imagine the alternative strategy, like to have no effectively no partitioning, each record is its own partition. Right. That's the extreme other end of it. That's the same as looking at every piece of data.

Starting point is 01:31:19 It's like you have completely mitigated the benefits of partitioning at all by having one partition for every record and so you've got to try to find that sweet spot and depends on your use cases obviously but i just want to point out that you can't just pick a huge number and go with it and be safe yeah you you it really helps to know your use case know your data and and go from there like even in um i'm not sure like how this works from elastic scaling perspective, but like, um, I've never thought about this way related to like SQL server or, you know, Postgres or anything like that. But like specifically in the Kafka world, you, you actually want to think about it in terms of, you know, bytes. You want to,

Starting point is 01:32:03 you want to think about it in terms of like size of what you're going to be transmitting back and forth and what your latency is going to be required to that, uh, that transfer. And so you then size it based on, on that. But I never, I've never thought about like sizing a SQL server or a Postgres or anything else like that. Have you? No, definitely not. I kind of wonder if part of that is because Kafka, it kind of views things as chunks of data. So as it writes stuff in, it writes it to contiguous space on disk. So it's almost like a big buffer.

Starting point is 01:32:42 And it keeps track of where the individual offsets you know, offsets are for each individual message. It's got like a separate point. It says like, I know file one starts here, file two starts there. So it kind of throws that stuff in as a big blob and then figures it out later when you ask for it. Kind of reminds me of like the deli counter of like a Publix or something like that, where you go and you order a pound of cheese or a pound of meat or something. And they go through and they slice, slice slice slice so like the the data is stored together is that big hunk and then you come and say well hey give me a pound or give me 10 slices or whatever it can do that now i want some deli cheese thanks i know and you know that is really like how the consumers work in kafka too

Starting point is 01:33:19 it's funny um they don't say um like traditional queuing systems this is total tangent but traditional queuing systems you might say like hey give me a record and i'll do something with it with kafka you generally told it um to pull based on a latching system so you say like hey give me either some number of messages or some amount of time and then i'll stop so the consumer will say i'll take a thousand or ten seconds whichever comes first and then it'll get that big chunk of data because like we said kaf, Kafka stores stuff in these kind of chunks. And it'll just kind of slice off as much as you want, subdivide it as necessary based on the offsets. And then there you go.

Starting point is 01:33:56 There's your data. But that's a big part of why Kafka works so well and why it works so well in low latency environments is it's really good at dealing with chunks of data and because of that they think of it in terms of data size much more so than actual files because it's not a whole bunch of little files on disk it's like one big one which goes back to what i was talking about with the sizing yeah absolutely like like i know like i ended up creating a whole i called it the kafka calculator um in order to size, to figure out like how to size a given environment based on, you know, number of connections in and out and clients and expected data size and what kind of latency you want, et cetera, et cetera.

Starting point is 01:34:35 And I'm not the only person that has done something like that for their particular usage of Kafka. Cause at the time when I was trying to work on sizing Kafka, like there were other people that had done similar things, uh, you know, to size it. Cause it is very specific to, you know, what kind of, uh, you know, the, the, the size of data that you're going to use and, and to, you know, to your point about wanting to serve it and contiguous blocks. So. Yeah, that was really great. There were definitely things, assumptions that were wrong.

Starting point is 01:35:06 And if we didn't have that calculator, it was really hard to know how you ended up with those numbers you did. So if you come back and say, oh, you know what? My average message size is actually seven kilobytes, not five. It's not going to change things by,

Starting point is 01:35:18 you know, seven divided by 5%. It's going to be much more dramatic than that. And so it was crazy to see just how much the numbers would change based on one piece of information. Yeah. And, and like, given, depending on what your storage technology is, like, they'll have like, you know, a recommendation for like, Hey, you know, any given node should be no more, it should be responsible for no more than this amount of, of data or throughput

Starting point is 01:35:43 or whatnot. And so that's definitely true in Kafka. They, they have like recommendations for, you know, uh, topics and partitions to be spread apart so that no broker is responsible for more than a certain number of partitions in an ideal situation. And, uh, you know, you, you, you can start to, especially if you were to like live in a world where you're creating partitions, or I'm sorry, topics on the fly, you know, maybe as like, hey, Joe just signed up for my new service. So I'm going to go spin off these five topics related to Joe. And then, you know, now your service becomes wildly popular and you're spinning off five topics for every new user that signs up to your service, then it can grow out of control fast. And that's where like this whole file handle

Starting point is 01:36:30 thing comes into play. Because, you know, now you're at the scale of Facebook and you have three billion, three or four billion concurrent users a day logging in, you know, it can get out of control quickly if you don't think about it up front yeah and even just adding five topics is a lot of work so you get a new customer you add five topics uh say they've got 10 partitions in each topic which is like a fairly normal number and then a replication factor of three so each uh each one of those partitions is replicated three times well uh we just added 150 um wait did i do that right it's gonna be five times ten times three yeah so 150 uh partitions

Starting point is 01:37:12 to our system which is whoo that adds up fast right just for each customer yep all right well uh what about rebalancing so we've talked about hot spotting a little bit and uh we talked about a few times about hb so how it's really good about kind of automatically uh dynamically i skipped section i know i mean kind of i mean we've talked about dynamic partitioning uh already we've kind of hinted at it about like you know uh because the trying to determine the number of partitions up front can be difficult uh some systems will allow you to do it dynamically so i think we mentioned like h base and rethink were a couple examples of that yeah i wanted to point out that it's really important to know that there's no magic algorithm

Starting point is 01:38:03 that they're using to like split your load up like super intelligently they're just kind of slicing it up like in kind of chunks and so you add a new node and say all right you get half and i'll move that stuff over and it's really nice that you don't have to think about it it scales really well uh cassandra ed um has an interesting way of doing some partitioning it well as well because it's got the whole like leaderless architecture that deals really well with adding new nodes and being available even when nodes disappear.

Starting point is 01:38:32 It's a different use case there, but it just handles things really well there. But there's no magic. Ultimately, it's doing some kind of dumb math in order to figure that out. Well, which might work against you, though. If all it's doing is just splitting it based on, well, the file size has now reached some maximum,

Starting point is 01:38:51 so I'm going to give half to this one and half to the other one, it could give, you know, the new node could get half, the half of the data that the new node gets could be the half that's most heavily requested. You know, if all i did was just simply divide it in half right yeah and the cold start problem is really bad there too you know like your launch ecology 2022 and h base starts out at zero and then it fills up really quickly and it splits in two and then those fill up yeah and it just keeps going they did talk about like

Starting point is 01:39:22 in the in this portion of the book though where for that specific problem where like even in the automatic world or the dynamic partitioning world where you would probably want to go ahead and start it out with some minimum that you thought would be reasonable so that you don't get into that a stampede problem that you referred to earlier yep absolutely and so yeah we talked about we just talked about a system that automatically rebalance and you know we talked about the kind of downside there it is really nice if you have workloads that are really like well homogenized or known and uh you know it doesn't grow too quickly and you don't have like hot spotting

Starting point is 01:40:00 issues and so that's really nice for that but if you do have um situations where you really want to protect your system so like maybe you want to not rebalance until off hours because you don't want to cause unnecessary strain on your system or you know we talked about that video game launch where you want to kind of scale up ahead of time where you've got stuff over provision for a period of time because you expect to get a really quick flood. And so there's just considerations there whether you even want it to be automatic or not. And then some systems, I mentioned Couchbase, React, and Voldemort will suggest the partition assignment, but they won't do it.

Starting point is 01:40:36 So they kind of wait for you to hit the enter key there in order to make sure it happens at a good time that you're around in case something bad starts happening, like nodes start reporting as unresponsive because they're rebalancing and you don't want to end up in a situation where things go off the rails. Yeah. Now, now we're getting into the really fun part though,

Starting point is 01:40:54 where it's turtles all the way down, which was the request routing part. So I'm going to throw this out there and then you tell me like, as we go through this, if this is kind of what you were thinking of, but basically we're talking about at the large front was just like service discovery. But I was thinking about this from like a Kubernetes point of view where I'm like, well, you have the, the Kubernetes service that could be, that could be like the, the front for like, it could be the load balancer across multiple of these things. Right. But then any one of those things might now then be the front for like it's given nodes or partitions. Cause like, you know, if the, if the service is pointing to

Starting point is 01:41:35 like say five brokers, but each one of those brokers is, you know, Oh, well here's the nodes, you know, like it's doing its own little service discovery within it. So that's where it was like turtles all the way down because there was like, even at a physical network layer and then once you get to a particular node and then, you know, how you would get to a visual file system, this whole request routing concept, you know. So with all that said, now, like, just keep that in the back of your mind and then we

Starting point is 01:42:06 get into this section yeah so we talked about having a database for your database and the database for your indexes now we're talking about having a database of uh that stuff is yeah and then maybe having uh something like a zookeeper in the background to keep track of uh you know kind of that communication there see and we all thought we all thought we were being clever when we made a meme out of exhibit. And it turned out he was really onto something like he knew, he knew all of this well before we ever did. Yeah.

Starting point is 01:42:33 He knows. He knew. Yeah. And I remember, um, one of the first things that I, when we first got started with Kafka that I was like really frustrated with is like,

Starting point is 01:42:39 I want to get started with Kafka. You download a Docker compose file. It's got like eight different systems. Like this is terrible. This is such a, this is such a bad database. And now I kind of understand more about it. It's like, Oh, you download a Docker Compose file, it's got like eight different systems. This is terrible. This is such a bad database. And now I understand more about it. It's like, oh, actually, it's leveraging these really smart systems that do these

Starting point is 01:42:53 really important things. And it's kind of cool to be able to see those moving parts there and to be able to scale them independently. So it kind of exposes more of the gears than a traditional relational database. So you can see the parts of the engine. But now that I know more about it, it's really cool to kind of see all that stuff being there. Yeah.

Starting point is 01:43:11 It's not so cool when you don't want to be the mechanic, though. Well, yeah. I just wanted the engine. Why do I need to know all this stuff? I'm just trying to hello world here. Right. Yep. Yeah.

Starting point is 01:43:28 So, yeah. right yep yeah uh so yeah and this is kind of an instance of a more general problem here that you'll see uh in just distributed systems called service discovery which is how do i know when new things are coming in and out like if if services can kind of scale themselves up and down independently and users are doing deploys and systems are just crashing. How do I know as a client what the heck is going on and who I should even be talking to? And so specifically referring to partitioning, there's a couple different ways to solve this. First is that nodes keep track of each other, which is almost like an anarchist kind of like hippie view of the world here. You just talk to any node and that node knows where else you need to go to. So you can kind of imagine each node is responsible for doing routing.

Starting point is 01:44:09 So you just, you just, as long as you know, one, that one can eventually get you to others because it just kind of propagates in a nice little mesh. And that's how Cassandra kind of works in like a leaderless situation where it's just very,

Starting point is 01:44:22 very resilient and highly available. So that works out really well. And eventually the clients can kind of figure out, you know, more information about those nodes and keep track of them. And eventually if it can't get to one, well, it can try another one in the list and it'll just make it happen somehow. Of course, I'm going to go back to my canonical reference for this because it made me wonder if that's what Kafka is moving to. Because Kafka, I'm skipping ahead here a little bit, but Kafka has traditionally relied on another Apache project called Zookeeper to handle that type of requesting to know what node you needed to go to for a given partition. But I haven't looked at it recently to see if it was implemented yet, but I know that Kafka had an open issue that they were planning to iterate towards

Starting point is 01:45:20 where they were going to remove their dependency on a zookeeper. And so I don't know if the underlying, if they were planning to go to this Cassandra kind of modeled where like every broker would know, like, so you could just connect to any one of them and it'd be like, oh no, you need to go talk to my friend over here. Go talk to broker number one. Hey, I haven't looked really closely at how they're doing that. I know it's ultimately stored in Kafka. So it's like kind of Kafka on Kafka there. But yeah, I don't know how that're doing that i know it's ultimately stored in kafka so it's like kind of kafka on kafka there but yeah i don't know how that uh how that works traditionally they had a bootstrap service that you could talk to that was aware of all the other nodes um but yeah i don't

Starting point is 01:45:54 know what's i don't know what's happening there yeah i know uh originally when i first started working with elastic and just kind of experimenting like before kind of pre-docker world if you wanted to spin up like two nodes locally you'd have to tell them about each other. So you'd have like a host file entry, for example, on your computer. And it was like elastic one, elastic two,

Starting point is 01:46:11 and then the configurations for elastic one, you'd say, Hey, there's also another node named elastic two. And for elastic two, you have to tell about elastic one. And if you add a third, guess what?

Starting point is 01:46:20 You have to go in and update that configuration and restart the elastic search service. So now it knows about three and that's how you would bring things up online and i don't know if it still works that way you know like for i know the like kubernetes operators or whatever kind of just doing that stuff in the background or if a service comes up and it kind of goes out and looks like you know hey y'all uh i'm scouring your network for you know i don't know how that works but i was just doing an in-map and like who who responds to what port yeah which sounds crazy it's probably not doing that it seems

Starting point is 01:46:49 almost dangerous yeah you'd hope it doesn't do it that way yeah that sounds like a huge security flaw yeah yeah i worked on a feature for a backup software at one point we called resource discovery it was basically the same thing it would go out and look for systems that needed like that knew how to back up now thinking back though it's like crazy to think like we just all look through everything i can see on your network for um i don't know active directory or sql servers or exchange servers i'm just gonna look for those and then try to set up backups on them it's like oh you should probably not have your network like open up to uh you know any software that can just kind of go in and start messing with this data but yeah that was fun oh yeah so the second approach we talked about you know just knowing each other the second one is

Starting point is 01:47:36 the centralized routing service that the clients know about and that's like the bootstrap service i mentioned so you go you as long as you know how to get to the bootstrap service and the bootstrap service knows how to get everywhere else. And sometimes this is called out explicitly, like we mentioned Kafka, like you would typically set up literally a service called bootstrap service, or sometimes it's kind of hidden behind something else. And so I think Mongo is one of the ones that kind of has this other system kind of set aside that does it, that they kind of mask away. But when you put together your connection string for getting to the database it that's the address that you use i was thinking

Starting point is 01:48:09 of the master node in elastic in that case yep absolutely yep so if you have one master or multiple master either ways like that's what the client talks to the client should not be directly talking to any data nodes yeah and you know unless the master said these are the ones you need to know about well even then though like the master doesn't route you to the client the master abstracts that right i'm not sure i don't know i think it kind of hides it from you so i don't i don't know okay yeah that's a good question uh i don't know how i get i'm so bad with networking stuff too like even like load balancers like i feel like i don't know how i get i'm so bad with networking stuff too like even look load balancers like i feel like i've looked up how load balancers work like a dozen

Starting point is 01:48:49 times but i still think about it like so wait when i talk to the load balancer does it like funnel the traffic through it and back out or does it tell me about the ip but then how does that work if my brain explodes well that's why i was like as i said it there i was like well i can't do that it has to like throw you over to the node, right? Because otherwise that would mean that that one master becomes the bottleneck for all IO in and out of that storage, out of that index, right? Which can't be. That would be, that sounds awful. But maybe I'm wrong and it totally does work that way.

Starting point is 01:49:25 I don't know. Let us know. You might win book. Yeah. Yeah. Enter, enter a comment in on, uh,

Starting point is 01:49:31 you can go to, uh, coding blocks.net slash episode one 72 and you can leave us a comment and, you'll be entered in for a chance to win a copy of the book. Yeah. I'm taking another, uh, note here for load balancers networking. We should do an episode on that so I know what the heck I'm talking about. Yeah.

Starting point is 01:49:53 Networking in general. Yeah, exactly. Talk about turtles all the way down, man. Oh, gosh. Yeah, no kidding. Let me change. Get into all the different layers of networking. Do you know the different layers? The seven OSI osi well i didn't mean the numbers oh yeah there's like application layer

Starting point is 01:50:12 you're talking about right like uh application the hardware layer yeah i might be able to figure it out but network translation i doubt that i'll be able to get all seven uh but you you started off pretty good though that you knew that there were seven so you know kudos well i haven't yeah i i don't know how any of it works aside but it's like one of those things that like i just happened to hear like once and for some reason my brain just like remembered that but i've never actually used it and like i never i don't know how to put it into practice or make use of that information yeah definitely it's one of those things that every time i hear it i'm like okay yeah that sounds cool that makes sense and i want to try to remember that but i never use it and so therefore because i don't put it into practice i always forget it yeah exactly like what were the

Starting point is 01:50:54 seven layers again and when you tell me is layer five really never comes into play or you know whatever exactly so it's like uh that and it's like stored in my brain right next to like the pokemon theme song we're like if i start singing it i can do it or the alphabet i can sing the song but you know you want me to use the letters i don't know i don't okay so i'm not the only one that like when you were like trying to think of a particular part of the alphabet you like instinctively like the song replays in your head. Elemental people. Yeah, exactly.

Starting point is 01:51:28 Okay. Or if you ever tried to do – someone asked you a password or you're telling someone a password that you type often and you don't have a keyboard. So you're like, it's the index figure of H. And the next one is... You have to almost type it out to remember what the password is. So we've talked about a couple ways to solve it. So nodes keep track of each other.

Starting point is 01:51:55 There is one other, which is the client be aware of every single node and partitioning data, which sounds crazy. I don't know any system that actually does that, but you can see, presumably, the clients all just keep track of it and so a new system comes online someone has to go out and say hey client here's another uh system that you need to talk to uh you know just keep that in your back pocket for when you need to and whenever that gets removed or added someone goes out to all the clients and lets them know like, you know, that's been removed.

Starting point is 01:52:27 I wonder if that happens more in like, um, I'm thinking like a Splunk or something where it's not so much like a necessarily a database, but like, I don't know. Someone has to go in and actually configure these nodes to kind of talk to, uh,

Starting point is 01:52:37 you know, the almost like agents that talk to different services. I don't know. I was trying to think through of like an example of that and just coming up blank though of an example where the client knows. Well, okay. Okay.

Starting point is 01:52:55 I guess, I guess I kind of, I kind of, I do remember now as I was reading this book that like the one thing that did kind of come to mind of this is that, gosh, I got to stop talking about Kafka. Because the client does kind of know based on the key, like, oh, this is where that's going to go. Yeah.

Starting point is 01:53:22 The producer has to know and the consumer has to know. Yeah, there is some awareness there but i don't i don't but but the way you said it though made me think of like a totally different world like you bring in a new a new one and you're like gotta reach back out to the clients and say like oh hey clients we've added a new one yeah definitely something you would not want to do if you know often so if changes happen a lot but i thought maybe dns would be an example like you're bringing a new dns server online or you're changing your dns servers and you would probably go into each uh you know node or agent or computer add the new systems

Starting point is 01:53:55 and then eventually after its transition you'd remove the old ones but that like happens never and hopefully you're not, I don't know. I guess you do use IP addresses for DNS servers, right? So, because otherwise you'd be in trouble. You had to like look up your DNS servers via DNS, but yeah,

Starting point is 01:54:20 off the rails. So yeah, that's the only thing I can think of. And the book doesn't give any examples of it. But no matter which way you go, off the rails. So yeah, that's the only thing I can think of. And the book doesn't give any examples of it. Um, but no matter which way you go, uh, partitioning,

Starting point is 01:54:28 no changes need to be applied somehow. And it's really difficult to get right. And even though it sounds like it's, you know, just another database. And this is where a zookeeper comes in, which is a really popular, uh,

Starting point is 01:54:39 system for keeping track of configurations. It's really resilient. Um, it's used with a ton of other Apache systems and just used all over the place for doing things like this or for coordinating. But you don't normally talk to Zookeeper directly. Like it's not your routing service,

Starting point is 01:54:56 but it's the thing that's in charge of updating the routing service or that the routing service will pull for changes. And so it's just a really resilient key value store that you'll see used in a lot of projects. And like we mentioned, Kafka uses a Druid, a solar,

Starting point is 01:55:11 MongoDB has something similar. So it's just a really popular tool, but it's really basic, but apparently it's really hard. And it goes, you know, one thing that's really important about it is if it's storing changes about what other nodes and services are available,

Starting point is 01:55:28 it also has to keep track of its own. Cause it not like it can defer to some other system it's the authority so if one zookeeper node goes down it needs to still be able to function and you know we've talked about different problems with kind of network partitionings or bad things that can happen like zookeeper really needs to be able to weather all of those storms. It's like the single point of failure for a lot of these systems. So then you end up with a Zookeeper cluster to keep up with. So it really does become like a whole other cluster to maintain an index to tell you how to get to your main index

Starting point is 01:56:04 and what part of that cluster you need to get to your main index and what part of that cluster you need to go to to read it. Yeah. Yes. Yeah. It's funny. The turtles all the way down. It's like,

Starting point is 01:56:12 I don't really trust my cluster. So I'm going to come up with a, another cluster that'll keep track of the nodes that are in that cluster. So in case the first one goes down again, exhibit comes to mind. He's like, yo dog, I heard you like clusters.

Starting point is 01:56:25 Yep. Yeah, absolutely. So yeah, the comes to mind. Like, yo, dog, I heard you like clusters. Yep. Yeah, absolutely. So, yeah, the idea is that they keep the functionality, like the feature set, really limited. Just really, really, really focus on being as close to perfect as possible. And so that zookeeper keeps the zoo in line. And so Cassandra and Rick react, use a gossip protocol. We mentioned about those keeping track of each other.

Starting point is 01:56:49 So they, they kind of do these things where they're constantly chatting about chatting to each other saying like, Hey, do you know I'm a node? You're a node. Do you know any other nodes? I don't know about like they're kind of playing go fish or something.

Starting point is 01:57:00 Had you heard of that information? Had you heard of this gossip protocol? A little bit. When we talked, like we, you know, data stacks, uh, something had you heard change that information had you heard of this gossip protocol a little bit when we talked um like we you know data stacks uh wanted to show a while back uh and we got a kind of a tour of how they work and cassandra and so we spent a little bit of time there looking at it so i read a little bit about then i thought i was it's it sounds cool and that's where you know i said i mentioned that like feeling almost like hippie-ish with like leaderists and um it just means that i don't know the whole spirit of Cassandra is just really cool and different to me.

Starting point is 01:57:29 And so I'd read a little bit about it then. Yeah, I just, it's not like a term you come across a lot. So I really liked it because I was like, well, it truly is what it is. Because it's like, you know, yeah, I heard that there's this other node over here called Node 3. But, you know know by the time you go to get to node three it could already be down so that's why it's not like authoritative it's just it might still be there i don't know yeah it's pretty funny to think about these notes kind of you know chatting over playing cards or whatever like it's like oh hey alice haven't seen

Starting point is 01:57:59 you in a while how's it going he's like all right you know i just met a new friend named fred the other day but oh nathan uh nathan's not around anymore and he's like okay well thanks for that update uh i'll see you later wow you took that to a dark place yeah that's how it goes like haven't heard from nathan in a while it's like oh yeah me either wow your examples man yeah we'll take him off the list i guess do you have enough sunshine in your life like you're in the sunshine state right oh wait no that's colorado is it colorado here well they get the most sunshine right what's the florida tagline it's the sunshine state oh is it yeah but but uh yeah it rains a ton here so i've always

Starting point is 01:58:42 had issue with that it's actually there's a home video of me being like 11 thinking I'm funny talking about how much rain Florida gets. What's the Jerry Seinfeld kind of like, just some joke about like like 330 days a year of sunshine or something like that versus Seattle that gets 30. Yeah, so I looked up the nickname for each state. Apparently, there's some official nickname. Colorado, they call the centennial state. And Florida, they call the sunshine state, and Florida they call the sunshine state even though Colorado gets more sunshine it's not fair yeah but they get the cold that you don't have to deal with so there you go yeah Florida officially adopted the nickname in 1970 but it like

Starting point is 01:59:40 apparently like one of the suggested search terms for Google is what's the real sunshine state. That's funny. So yeah, the gossip protocol. Um, so, and then we already mentioned though, they're like,

Starting point is 01:59:56 like elastic search has different roles, uh, that nodes can have. So you mentioned there was the master, there's the data, the ingestion, and then, uh, a routing node. Yep. master, there's the data, the ingestion, and then a routing node.

Starting point is 02:00:07 Yep. And from there, there's just one other thing I wanted to mention on real quickly, or hit on, which is parallel query execution. So, so far, when we've been talking about things, you know, primary indexes, looking things up by key, or secondary indexes, looking things up by these secondary data stores. Those, you know, kind of fit fit well like the NoSQL paradigm, but that's far from the only game in town. And so the book brings up massively parallel processing relational databases, MPP. And these are kind of relational databases that have really complex joining situations, filtering and aggregations and excluding this or excluding that.

Starting point is 02:00:44 And really joins is the big thing to me there. So how does that work? And that's where we got our old friend Query Optimizer, who's responsible for taking these big into stages that can be run independently. And then ultimately kind of going through and joining those stages together and filtering to get the final results. But when you think about it that way, you're like, well, hey, all this is really doing is it's taking that complex expression and generating a bunch of individual stages, just like the ones we've talked about that use primary indexes and secondary indexes or you know ultimately if those aren't available those full table scans going through every record in the partition that it's got set up well hey we just described everything we talked about so you can just kind of think about these massively parallel processing relational databases as everything we've talked about today but just up one level where it's able to break them down

Starting point is 02:01:45 into these components and then ultimately put them together. And the really cool thing there is that in addition to breaking them down, these are all independent stages, so we can run them at the same time. So if we take a query where you're joining a few things and filtering and sorting and only grabbing these three columns or whatever, it might break that down into 12 stages, and maybe the first four can be run in parallel. Then we do something with those results to bring them back together,

Starting point is 02:02:11 and then we can run the other eight stages in parallel from there and then put those two pieces of information together. So there's a lot of work going on underneath the covers of those CryptoQuery optimizers in order to do that. But ultimately, it's just doing the same kind of querying and filtering that we've spent this episode talking about, which is just really cool to think about it. Yeah.

Starting point is 02:02:31 This part at this portion of the book, they only give a brief overview of parallel query execution. And they basically say it's a really complex topic, and they do get into it in a later chapter in more detail. So, you know. Yep. Makes me wonder if there's a book just on, like, query optimizers. We talked about, like, NoSQL,

Starting point is 02:02:56 and, like, one of the advantages, we said, of relational databases is, like, you don't have to know as much about your data, its use cases, because the query optimizer is responsible for really figuring out how to get your data back in the most efficient way based on statistics and indexes that it has and so it frees you up to just focus on the the shape of the data modeling in the real world where no sql is kind of the opposite where you really have to know your use cases because you're the one kind of designing these smaller pieces so just kind of cool to think about the query optimizer as being this like a big brain with a whole bunch of, I don't know,

Starting point is 02:03:30 spider legs or something that's able to do all this work and put it all together for you. So you don't have to think about it as a human or a query developer. Well, I mean, you say that, but like, I know that we've definitely been in situations, specifically in like SQL related worlds, where you're looking at the execution plans. Just like, hey, why is my query not working as well as I want it to? And you look at those execution plans and like some of them, you can create some really hairy queries, you know, where like the execution plans are like, wait, what?

Starting point is 02:04:02 I can't even see this whole execution plan on one screen and I've got like this giant widescreen monitor and I still can't see it. Yeah, there's a couple of tricks from where sometimes SQL Server had a problem with it. Was it either with and statements or or statements where it was more efficient sometimes to do a union of two queries than it was to do like an or statement,

Starting point is 02:04:22 which seems totally crazy, but we saw it in practice a lot. Just kind of a defect. Hopefully they fixed that. So it always helps to know that you can't trust it fully, but in practice, those query optimizers do a great job about getting that stuff out of your face. Alright, well, yes, that's it for the chapter six. We are still not halfway through the book, but we hope you're enjoying it. And I know that we certainly are. It's definitely one of our favorite books that collectively between the three of us, I think we all agree that this is by far our favorite one so far. that you will agree once you give this book a read. So if you'd like to have a chance to own this book, you can head to www.codingblocks.net slash episode 172. Leave us a comment for a chance to win a copy of the book. And it'll definitely be in the show notes as one of the resources we

Starting point is 02:05:21 like for this episode. And with that, we head into my favorite portion of the show. It's... Oh, no, wait. This is Alan's favorite portion of the show. I'm sorry. Yeah. Well, you said that wrong. Oh, you're right, Alan.

Starting point is 02:05:34 Thank you. Yeah. So this is Alan's favorite portion of the show. It's the tip of the week. All right. And so I've got some terminal-related goodies here. So first, I think we've given Oh My ZShell as the tip of the week before, I believe, right? Yeah, pretty sure.

Starting point is 02:05:54 Yeah. So just a quick recap. It's basically a really nice plug-in for ZShell, which is the default in OSX and just probably the most popular shell at this point. And OhMyZShell is a really nice way of organizing and has plugins and a really nice ecosystem around and a really nice user experience. I was thinking about it as almost like a brew for your shell. So you can add like autocomplete for kubectl or Docker or maybe show your Git branch that you're in

Starting point is 02:06:24 and how many files you have difference. And just really nice things like that, just quality of life improvements for your actual terminal. And so that's a popular tool for that. Well, Bobby, a friend of Bobby, a friend of the show, had a really cool theme that he told me about called Power Level 10K. And we'll have a link here here and this thing is the best and

Starting point is 02:06:49 calling a theme is misleading because it makes you think like colors right but for me uh it brings in some really nice fonts which gives you some really cool icons which look really cool and it also uh is really good at uh supporting plugins so like um for example you know i mentioned get being able to see your get uh history and changes one of my favorite things is uh if i start typing k for kubectl which i have an alias for for kubernetes it will automatically pop up the context that i'm in and will show it so it knows i'm trying to do something in kubernetes uh cuddle does the same thing and will say hey this is the context you're in so if you would run this command this is where you're running it which is like a kind of a common

Starting point is 02:07:34 pain point and it's got really nice visual customizations too so my favorite thing about this is the reason why i had to switch is all that stuff i, like showing the git diffs and the branch you're on and the path you're in and the user you're logged into and the context. You can put all that stuff on a line above where you're typing. Man, I hate when your terminal takes up like half the screen is your username and the directory you're in. And if you put the git branch in there, it's like, you know, it's considerable size. It's a considerable portion of your terminal size, potentially. So any commands that you end up typing end up going off screen or whatever.

Starting point is 02:08:12 It's just annoying if you've got that terminal running in a smaller window. And another problem I've had with shell commands, sometimes I'll run something. And if it's a lot of output, you scroll up and it can be kind of hard to see where you ran the command if you run it more than once, you know, to basically see where the output of one command starts and the next begins. And so by having these like nice looking

Starting point is 02:08:35 this information break between commands, it makes it really easy to scroll, scroll, scroll. That's why I ran this command and the beginning of my output is for this run, which has been really nice. And it's got like a wizard that walks you through everything. And it's really easy to reconfigure. So you basically start this thing running.

Starting point is 02:08:52 It runs a command in the background called like init or something, power level init. And it'll say, hey, do you see this icon? Yes, no, no. Okay, would you like to install this font? Y for yes. Okay. Next, do you like this style? This style? Do you want this stuff on the next line? Do you like to install this font? Why for yes. Okay. Uh, next, do you like this style? This style? Do you want this stuff on the next line? Do you like, you know, do you like this

Starting point is 02:09:10 kind of connector? Do you want to show the time? Uh, things like that. And it just kind of walks through and asks it. And so it was just a breeze to set up and it was so nice for Kubernetes to be able to see your context easily. Very cool. Yeah. We've mentioned, um, oh my z shell uh what back in episode 138 and as a tip of the week and then in episode 149 when we talked about our favorite tools uh somebody i don't want to say it was alan i think that mentioned it there um so yeah yeah. Yeah, I know it's really popular. And I should mention, too, so I mentioned ZShell. That's not the only way to get this theme. I didn't realize this, but apparently you can just get it through Homebrew

Starting point is 02:09:52 or Arch Linux, or you can do a manual setup here. So it'll just set that stuff up for you in just kind of normal, you know, non-ZShell environments. Yeah, it is crazy, like, like how z shell has taken over because you know forever it was like uh it seemed like the battle was between forever that i recall like bash and corn shell were like the two you know most popular and now in like recent years especially after apple changed the default from bash to z shell um like that was the end of the war yeah oh and i do have one other one uh so in uh in windows i think i believe this happens

Starting point is 02:10:35 by default when you install vs code it adds to your path so that you can be in a directory say in a terminal and you just type code dot and it'll open up vs code in that terminal you can also that's really nice yeah you can also write so yeah it sets up for you and same with like if you go in your downloads folder say and you just want to open up a text file or something you downloaded from the internet and you just want to view it in vs code instead of going through the file open and browsing to that location if you're already there you can just do code in the file name and it's going to open it for you you couldn't do that in os in mac os because it doesn't automatically add to your path so i googled how to add vs code

Starting point is 02:11:11 to my path because i wanted to be able to just say like you know do a lot more stuff on terminal on mac i want to say like code file name and just open up the json file or whatever i want to see and you know have all the nice stuff i use code for. Well, I Googled it, and actually, the recommended way to add VS Code to your path is to do it through VS Code. So if you do the Command-P and just type, start typing, add to path, it'll go ahead and prompt you and say,

Starting point is 02:11:38 would you like to, you know, it's got one of the actions you can do is add VS Code to your path, and you just hit Enter, and it just does it. And it was so nice it was like wow i've been annoyed by this thing for so long and it was such an easy fix is it it's it's um it's shift there right like command shift p to bring up the oh yeah you're right i always forget yeah you're right yeah command shift p and then for windows i don't see it it's control shift p for windows code is so nice yeah and i believe it's command yeah command shift command p yeah i just thought it was cool but i think though if i remember right doesn't it prompt you like after a fresh install of it like as soon as you open it up for the first time it's

Starting point is 02:12:25 like hey do you want me to also add this to your path and you're like i mean unless you want to feel pain because you don't like yourself then you would press no but otherwise you know maybe i just missed it i didn't uh oh that sounds more likely because i hate to think that you needed like you know some a mental day health you know whatever pretty much any time a system asked me if i also want to do something because it makes my life better i'm like no like you want to check me no maybe we need to get you some help then we can location services no oh well i don't know like mobile apps i'm definitely like more uh know about you know what about um would you like to enable notifications for this website no no block block your ability to know my location and i don't

Starting point is 02:13:13 want your notifications and uh yes i will be better for it thank you now one more for you what about when you're using app it's like hey do you like using this app oh yeah i hate this i hate this yeah because if i say yes you're going to try and take me to the freaking app, it's like, hey, do you like using this app? Oh, yeah. I hate this. I hate this. Yeah, because if I say yes, you're going to try and take me to the freaking app store. I'm like, nope. I hate your app. I hate it. No, I always view it the other way around.

Starting point is 02:13:33 Really? Oh, you say no, they ask for feedback? Yeah, they want to know what you didn't like about it so they can correct it. If you do like it, they want you to leave a review so that they can get the five star. But if you don't like it, then they want to know why. How do they know that you left a review? Do they know? I feel like if I say yes and finally go review and you should review things because it's really awesome and helps them out a lot.

Starting point is 02:13:56 If you leave review, crap. How do they know? And do they keep prompting you to leave review even though you've left review i don't know i guess maybe that's if you say yes and you're like would you would you like to go to the app store to review us you just say yes and then you do it because it's really helpful and then they don't bother you anymore or maybe you say yes and you don't bother to like actually leave the review just go back to the home screen, and then they've already written a bit to their log to say, no, he left the review. Yeah.

Starting point is 02:14:29 But really, he didn't because you were a jerk. But I would do that. But maybe we're biased. I don't know. All right. So for the tip of the week, I got a couple of fun ones here for you. So one, this was mentioned a long time ago. Uh, I don't even remember who mentioned it. So I'm sorry, I can't give the

Starting point is 02:14:50 credit where credit's due, but this was mentioned in our Slack a long time ago. So, you know, as a parent, uh, you, you want to, to teach your kids things, right? And so this is a way to teach your kids all about the joys of Kafka as a bedtime story. And it's called Gently Down the Stream. And it's a website that if you go look at it, it's a very well-made website. And you can just flip through it and teach your kids all about Kafka and how it works and the joys of Kafka and whatnot. It really is written as a children's book with little otters and they're swimming and the trees in the background and everything. It's quite impressive that this person put this together for his girls. Yeah.

Starting point is 02:15:49 So there's your first fun one. And then another fun one that was making its way around the interwebs that I also saw on our Slack. So if you're not already on our Slack, by the way, you should definitely join our Slack. So I think you can go to www.codingblocks.net slash Slack, or even at the top of the page, there's a link for join our Slack. But this one was making its way around the interwebs for the article's name is Lesser Known Postgres Features. And there's like all sorts of things in here there's like oh uh you know things that you didn't know that you could do with postgres like

Starting point is 02:16:30 find overlapping ranges or generate a unique id without an extension or um keep a separate file history per database or uh multi-line quoting in, in your SQL statements, things like that. Like there's a whole slew of, there's, there's like probably a dozen and a half or two dozen different things here of things that you could do.

Starting point is 02:16:56 Like one of them was temporary views. Uh, so if you wanted to have like maybe in your, your, uh, procedure or query, you want to do something and you want the view they are available so that you could reuse it over and over and over um and maybe a view would work better for your needs rather than a cte but at the same time

Starting point is 02:17:17 if your transaction fails you don't want to leave any like real schema around. So, you know, temporary view. So there's a whole bunch of them in there. Um, ways to pivot, uh, to, to generate, to produce pivot tables, uh, how to prevent setting an auto-generated key, um, how to grant permissions on specific columns or find the value of a sequence without advancing it, uh, all sorts of great features there. So, um, we'll have links to both of those things in the show notes. And with that, uh, if you haven't already subscribed to us, you can find, um, us on iTunes, Spotify, Stitcher, wherever you like to find your podcasts. We, we certainly hope that we're there. If we're not, um, I'm not sure how you found us,

Starting point is 02:18:06 but Hey, let us know when we will figure that out. And, uh, like, like I asked earlier, um, if you dear listener, if you would please,

Starting point is 02:18:16 uh, find it in your heart to head to www.codenblocks.net slash review. Um, every time the, the late night, uh, DJ voices, it just cracked me up. dot net slash review. Every time the late night DJ voices just crack me up. Anyway, we would greatly appreciate it if you would leave us a review if you haven't already. All right. Thank you, Morgan Freeman.

Starting point is 02:18:37 And while you're at it, codingblocks.net. Check out our show notes, examples, discussion, a whole lot more. And feedback, questions, rants can be sent to our Slack and you can get there from codingblocks.net slash Slack. It's easy to go ahead and just click a link to join. And you can follow us on Twitter at codingblocks or head over to codingblocks.net and find all our clippies

Starting point is 02:18:55 or what did I call them? Oh, dillies. Dillies. Yeah, see, I'm so old. Find all our dillies at the top of the page.

Coding Blocks - Designing Data-Intensive Applications – Secondary Indexes, Rebalancing, Routing

We wrap up the discussion on partitioning from our collective favorite book, Designing Data-Intensive Applications, while Allen is properly substituted, Michael can't stop thinking about Kafka, and Jo...e doesn't live in the real sunshine state.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Coding Blocks - Designing Data-Intensive Applications – Secondary Indexes, Rebalancing, Routing

We wrap up the discussion on partitioning from our collective favorite book, Designing Data-Intensive Applications, while Allen is properly substituted, Michael can't stop thinking about Kafka, and Jo...e doesn't live in the real sunshine state.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.