Coding Blocks - Designing Data-Intensive Applications – To B-Tree or not to B-Tree

Starting point is 00:00:00 You're listening to Coding Blocks, episode 130. Subscribe to us and leave us a review on iTunes, Spotify, Stitcher, and more using your favorite podcast app. More and disgusting samples, notes, show, find, can you where? CodingBlocks.net? Alright Yoda, send your feedback, questions, and rants to comments at CodingBlocks.net. Follow us on Twitter at CodingBlocks or head to www.codingblocks.net and find all our social links there at the top of the page, as well as links to other podcasts, aggregator things, and all that. I'm Alan Underwood.

Starting point is 00:00:36 Zach Joann. I'm betting that Joe couldn't do that again if he tried. I was pretty fast at it though, right? Yeah. It was impressive. You practiced it. It sounded like. And I'm Michael Outlaw. This episode is sponsored by Datadog, a cloud-scale monitoring and analytics platform that unifies metrics, traces, and logs so you

Starting point is 00:00:56 can identify and resolve performance issues quickly. And tonight we're continuing on with chapter three, aiming to do a couple paragraphs here. So you know sit tight it's gonna be a long one i just caught that so i'm feeling spunky tonight uh so last episode we talked about last time we talked about this book we talked about l7 trees um uh which i forgot what started the log structured merge log structure merge trees, yeah. And tonight we're talking about B-trees, which is a more common format used by a lot of databases. But first, my favorite portion of the show.

Starting point is 00:01:36 We also talked about string-sorted tables, too. Sorted string tables. SS tables. Yeah, we did. Yes, we did. Because those are going to come back up. Yep. We're not done with those. So as we always like to do first,

Starting point is 00:01:45 we want to thank everybody that took the time, especially during this rough time where most everybody's stuck in their house, you know, for who knows how long, uh, for taking the time to write us a review. So we have on iTunes, we have a NEP 79,

Starting point is 00:02:00 Jacoby boy, 23 or Jacob boy, 23 loaded galaxy, Jen T avid listener. So when it or Jacob Boy 23. Loaded Galaxy, Gen T, avid listener. Wouldn't it be Jacob Boyd? Jacob Boyd. Jacob Boyd 23. Good catch.

Starting point is 00:02:13 All right. Third time's the charm. I'm guessing he's a Jordan fan would be my guess. Maybe. Or maybe he's 23. Yeah. You don't know anymore. That's way too old for my taste.

Starting point is 00:02:28 I was going to say, we've, we've moved past the era when everybody knew Jordan's number. I believe. Yeah. Where everything was 23. It instantly met Michael Jordan. Yeah,

Starting point is 00:02:38 exactly. And you're not even, you're not even a sports ball fan and you knew that. So, I mean, that's just how ingrained it was in the culture. Yeah, exactly. The shoes. Yep. Did you really, you knew that. So, I mean, that's just how ingrained it was in the culture. Yeah, exactly.

Starting point is 00:02:45 Yep. Did you really, you had Jordans? Uh, yeah, Bob, back before the pumps came out, those are the best men.

Starting point is 00:02:54 The ones that always blew out. Yeah, man. Uh, all right. So, so the very first piece of news here, and you could thank Michael for this,

Starting point is 00:03:04 cause this was going to be my tip of the week and he stole it out from under me. Well, I mean, don't say it like that. Yeah, it totally is. I'm a little hurt by it. No, but in all seriousness, this is fantastic news for those who haven't heard about it because we've referenced this particular thing so many times. Pluralsight is free for the month of April. So if you want to learn and you want to just binge watch and cram as much information as you can into your noggin, now's a good time to do that. You can find, you can just go to codingblocks.net slash Pluralsight. It'll take you there

Starting point is 00:03:37 and yeah, sign up, get, get some, uh, get some free learning. Yeah. And just to add on to that, I mean, we're talking about 7,000 plus courses that are available. No credit card required to sign up for this. There's no watch limits. I mean, if you could devour 7,000 plus videos in the month of April, then more power to you. Yeah, this is legit learning time. Like this wouldn't come along very often. I mean, it's kind of generous. So yeah, take advantage of it. Do you want to talk about TechSmith? Yeah, I mean, I guess I threw this one in here a while back. So TechSmith, I think we mentioned on the past two episodes, they are also trying to help

Starting point is 00:04:24 during this time when everybody's, you know, it's a little bit harder for people that haven't worked from home. So they have things like Snagit for free until I think through June and some other tools. So we'll have a link in the show notes for that. So definitely go check that out. Hey, sorry. Go ahead. I was going to say, if you know of any other free resources, you should tweet at us and we'd love to retweet it and share that out. Yeah, totally.

Starting point is 00:04:50 And we'll add it to the next show. I mean, we love to help out wherever we can. And I think these would be big. And I need to silence my phone. Uh-oh. All right. This next item looks like it has Al's name all over it. Oh, yeah.

Starting point is 00:05:06 I mean, okay. Cause I actually got called out on this in Slack as well. And it's like, you know, rightfully so. So have you ever, could there ever be a better case of foot and mouth disease where it's just like, as soon as you say something, you're like, you got to put your foot in your mouth cause you're like, oh man, it's something bit me. Right. And, you know, last episode we were, as me, especially, we should raving about zoom and

Starting point is 00:05:32 zoom. Don't change a thing. Everything you've done is perfect. You know, you can't, don't, don't add any new features. Just leave it as is. Cause you got it right the first time. Apparently not. Because immediately after that episode aired,

Starting point is 00:05:51 within like the first 24, 48 hours, there were like story after story after story about like Zoom's popularity and people started putting it under a microscope and finding problems in it. And it's just like, oh, man, come on. Like, really? Dude, this first link on CNET that I grabbed here, it's so long, man. Like, there's so much stuff. And you're like, oh, good Lord.

Starting point is 00:06:18 So here's the thing. Do we take back recommending Zoom? We're using it right now. Yeah, that's what I was going to say. We're still using right now yeah that's what i was gonna say we're on it right now like legit i would have a hard time i still can't man no i can't take it away even though i know it's even though i know i should okay look look look i'm not gonna stop using it but maybe you shouldn't start using it. How's that? It's still so awesome for screen sharing and pair programming remotely.

Starting point is 00:06:56 I still love it for that. But we also know why it's so good now because you avoid the latency hit on all that encryption stuff right like ah that's that takes away too much that takes too much horsepower so let's not do that i i mean look here's one thing i can say about this anytime anything like this surfaces and a company doesn't just completely fold they almost certainly get just way better because they have no option, right? Like, because like you said, they've been put under the microscope. So the chances are, assuming that they're taking this seriously, which it looks like they are, like things will turn around for them and they'll probably be on par, hopefully with most of the

Starting point is 00:07:40 other companies out there. So. Yeah. and if someone else joins this call later, you'll know what happened. Right, if we get Zoom bombed. But, you know, you just said, you know, they'll get a parity with like the other ones, and we don't want that. We want them to be better. Right, right.

Starting point is 00:07:58 That was the problem, is that the performance was always so much better. But like I said, now we know why, because there were like things that they were like, ah, we don't have to do that security stuff man it's so rough we'll have some links there's a cnet link uh that alan referenced there's also if you listen to security now their latest one which was titled zoom go boom in which uh similar to your cnet article steve gibson like iterates through all of the different issues that have been found recently with it.

Starting point is 00:08:28 It's just like, yeah, can I take back that previous episode? Yeah, like he said, don't start using it. We're probably not going to quit, but you probably shouldn't start. It stinks, man. So that actually brought up another thing, right? So we talked about several of the other ones out there, and we had tried join.me back in the day. I don't remember why we ended up not going with it. I honestly can't remember.

Starting point is 00:08:54 So there's one that's been coming up a lot now lately from Jitsi.org. That's J-I-T-S-I.org. And it's a free open source one that you can basically just use and implement on your own. Who knows if it's good? Who knows if it's secure? Like at this point, we're,

Starting point is 00:09:13 we're kind of a little bit jaded and we're, we're a little gun shy on recommending a good video conferencing software, but you know, I so badly want to joke and be like not me use zoom but then i'm like i can't yeah yeah i'm just gonna pretend i haven't heard anything about zoom yet hopefully they fix it yeah exactly that's that's it just make sure that you're not doing any cloud recordings of stuff that you don't want out there somewhere, right? That might be a bad thing.

Starting point is 00:09:48 Alright, now last time I'm here I've been doing a lot of streaming. I've kind of gone in stealth mode a little bit. I haven't really talked about it much or tweeted too much about it, but I've been doing some live coding on YouTube and a little bit of Twitch. I'm still trying to decide what I like better, but the gist there is that

Starting point is 00:10:03 I've been doing a mix of coding challenges, some project work on QIT, and also on just some kind of like learning type stuff, learning public type stuff I've been doing with like Apache Beam and it's going to be other type things. So I'm looking at mixing those three things and seeing what works and what I like doing. But I think

Starting point is 00:10:19 I've done it like 10 days in a row now and I'm going to try and keep it all up for the month of April at least to see how it's going. So if that's something you're interested in, you know, think about subscribing to one of those things and stopping by and dropping in, saying hi or whatever. And if you're not feeling it, that's fine. You can leave. Hey, let's actually take this a step further because we forgot to put it in the show notes. So for this particular episode, if you want a chance at winning designing data intensive applications, in the show notes for this page, which will be

Starting point is 00:10:51 codingblocks.net slash episode 130, leave us a comment and tell us which you would actually prefer to watch on Twitch, YouTube, because that's one of the things that Joe's actually been looking at is like, I don't really know where I should focus my effort. So it'd be kind of nice to know if anybody out there does watch any of the streaming stuff. Like, where do you do it and why? Like, what is it about the platform that you like? You know, drop us a note. It would it would help us out.

Starting point is 00:11:19 Focus our efforts. We got to come up with a way to virtually kick Joe in the shins, though. Yeah, we do. Maybe we create like a, you know, we'll have like a little Raspberry Pi listening on the internet and we'll have some little motor connected to it that sits under his desk. And, you know, when somebody clicks it, it like kicks him. I've been looking for a hardware project. I'll build. There you go.

Starting point is 00:11:41 There you go. There we go. A little motor. I kick myself plenty on the stream, especially with the exact technologies where I spend forever. What was it? The other night, I spent an hour and a half trying to do something. In like 45 minutes of it, it ended up being the problem wasn't anything that I had done. It ended up being a bug in the framework, which like never happens.

Starting point is 00:12:01 It was literally a problem with the example code in that particular version on Windows. It never happens it was literally a problem with the example code in that particular version on windows it never happens and believe me i'm just like lots of things like trying to track the down i found that i should have done wrong or that i should have done differently or steps i'd kind of missed or misread whatever but yeah it's so frustrating i finally i just googled i was like i'm just gonna google this stupid generic error message and immediately found the issue i was like oh yeah just back the version down. Anyway. Wait, you didn't start with Google? Yeah, I don't know.

Starting point is 00:12:31 I do it. That's like usually the first thing I do is Google the error message, but I don't know. I didn't this time and I should have. That hurts. So lesson learned. Yeah. Google. Yep. All right, now Google is talking to me. Somebody's responding. That's my tip of the week.

Starting point is 00:12:46 You should Google error messages when you see them. Good one. So, yeah, on to this particular episode, and this is where we talk about B-trees, yet another indexing strategy for storing data. Hey, before we dive into that, can I ask what you mean by indexing strategy for, you know, storing data. Hey, before we dive into that, can I ask what you mean by indexing? Yeah. So everything that we've talked about at this point, at least in this chapter, has been

Starting point is 00:13:16 about how you store data so that you can look it up fast. So indexing, right? Like when you think of like, I think we even use the analogy in the yellow pages or something, right? You go back to the, it's called the index. Right. You go back to the index. You look up words usually alphabetically and like the yellow pages and then it'll tell you the page to go to. It's basically the same thing, except you're storing it on disk that way. OK, cool. So when we talk about B-trees and we're saying this is the most commonly used indexing structure, we're saying that this is the most common way that data storage engines are kind of writing data so that it can look things up quickly.

Starting point is 00:13:53 Yep, that's it. And here's the crazy part. This thing was introduced, I think, before all of us were born. Yes? I'm going to say yes, definitely. I don't care what the answer is. So in 1990, no, this, this bee tree particular pattern was introduced in 1970 and they said that it was, it was basically everywhere 10 years later. Yeah. So it kind of took off and yeah, there were a whole lot less everywheres in 1970,

Starting point is 00:14:27 and there's a whole lot more everywhere now. But I think we're still in the case where bee trees are kind of leading the charge. And bee trees are the fundamental indexing strategy used in relational databases. And that's a big part of why it's so common, because those things are still, those dinosaurs are still ruling the earth. I am going to be curious, as we'll find out a little bit later in this chapter, if that continues to be the thing with the differences between SSDs and spinning disks. And we'll cover some of that in a little bit.

Starting point is 00:14:57 Well, I mean, they do make references to MySQL and SQL Server in this chapter, in this portion of the chapter specifically. So, I mean, I definitely got the impression that it was still a thing. Oh, it's, yeah, SQL Server, that's what they've been using for years. Yeah, but we'll get into the other question I have here in a minute. Oh, okay. Yeah. So, one of the things that they say here is much like the SS table that we described in the episode before last,

Starting point is 00:15:31 these B-trees also store key value pairs sorted by the key. And this makes these range lookups fast, right? Like, hey, give me 500. Well, is it between 0 and 100? No. Between 200 and 300? No. So it can

Starting point is 00:15:45 go find them pretty quick. What do you mean by review? Oh, so one of the things that I put here is just review from the last one, these log structured indexes that we have, they store data in variable size segments. So when we were talking about the LSM trees, you know, the way it would work is as data was written, it would basically chunk it off at some point and then maybe merge it into a new file, maybe leave that other one over there if there weren't changes, but there was no fixed file size, which is the important part here. One could be one meg, one could be 10 megs. It didn't matter. It's just basically where they started chopping them up if you remember too um those uh those segments were immutable so if you had an update to make or um or uh delete you would still append those changes and then you would rely on this thing called compaction to go

Starting point is 00:16:37 through and basically um go through and kind of clean up and get rid of the the old stuff and so it wasn't until that process ran that you actually freed up that disk space, which means if you're doing a lot of updates and you're creating a lot, you're taking up a lot more storage space that doesn't get cleaned up till later. So you still get the most recent copy of the data

Starting point is 00:16:57 when you query for it because the L7 tree knows how to look up and find the most recent data. But it does kind of inflate your storage temporarily right so but you can have i want to like make sure i understand what you're saying though because even with the b tree you could have one of those pages that's not the not yet the same size as another page no i don't believe that's absolutely true. Yeah, because you might have more in one, because they said that where you split,

Starting point is 00:17:31 okay, like if you had to split, we're definitely getting into an implementation detail of the B-tree that we haven't gotten to yet. But just when we get to this part, it'll make sense. But because you're splitting on a median, and so you might not have, like if you were going to split on if the median was, say, 250 and you only had three values that were less than 250, but you had 10 that were greater than 250. And that means on one side of your tree, you only have three values.

Starting point is 00:18:02 And on the other side you have 10 so i think the deal is there is that um you end up with some fragmentation because you've got like this empty space that you can't fill up and so um the page size isn't variable like i think you have to declare that up front right but uh potentially you've got like this kind of dead space that just isn't ever going to get used until you do something that actually cleans that fragmentation up by kind of balancing. So then, so then what we're going to say then is the way that the B tree is used as it relates to a database engine is it's going to be a fixed page size for the,

Starting point is 00:18:39 for what's going to be in that, that node of the, of the tree, but it might not be used entirely. Correct. And when it does get to be used entirely, that's when you're going to do the split for that particular node. And it might, it might result in two nodes that aren't completely used.

Starting point is 00:19:01 Well, I guess by definition they would have to not be completely used. When you go past the fill state of it. Yeah, we'll get to it here in a minute. So basically one of the things that you just covered here, that's the important difference between the B-tree and the LSMs that we talked about previously is the LSMs used, you know, variable size segments. The B-trees use fixed block sizes and they're called pages. And one of the keys that they did with this is they typically made them 4K in size, which mapped to disk sectors almost one-to-one, right? And when we're talking about disk sectors,

Starting point is 00:19:34 and this is why I said, I'm curious where this is going to go in the future with SSDs being so prevalent now and becoming cheaper for the enterprise, is this was for spinning disks, right? Spinning disks have block sizes of 4K, and so those page sizes match those segments on disk, basically one-to-one. And they even said, when you consider these writes, you can basically think of it as a hardware write, as opposed to something that's happening in software.

Starting point is 00:20:00 But that's the thing that I'm trying to make sure that we agree on here, right? I mean, a B-tree is a data data structure and it doesn't care about this 4k page size like that's that's irrelevant to the b tree the b tree doesn't care about that kind of thing but as it's used specifically by storage engines then that's where it's like this fixed block size of 4k comes in so as as as they're being used in this particular implementation then they care about this fixed block size yeah i think that's probably the right way to say that i guess when we're speaking about it we're talking about database implementation specifically the sql servers the mysql's those and and you know we didn't call this out yet but we have talked about btrees before well we haven't called it out in this out yet, but we have talked about bee trees before.

Starting point is 00:20:45 Well, we haven't called it out in this particular episode, but we have talked about it back in episode 97, I believe it was. And, you know, we covered the details about what they are, how they work, but also there were some like visualizations and things like that. So I'll have some links to those again, to those resources again, that you can use to follow along with this. And what's kind of interesting is I'm, I think this is cool anyway. Like these databases really want to like, just own a big chunk of storage space and they want to kind of have that

Starting point is 00:21:20 close interaction with the file system or the, the disc. And so they don't really go through the normal file system. Like it's, you're not able to go to like a SQL server and like browse on the disc to find your row. You know, that doesn't make sense. It's not some file that kind of maps that there's no file in there that of your table. Right. And that's, uh,

Starting point is 00:21:39 because of this kind of the way they, they allocate these, these block sizes and work directly with the disk at a very low level, you get a lot of efficiency from that. It's just a different way of working. It's almost like its own file system implementation. The block storage is different from typical file systems that you think of

Starting point is 00:21:58 and it's also different from object storage in a cloudy system. I just thought it was kind of cool to see that the B-Tree has worked so well directly with the disks. The hardware itself, right? Yeah. Yeah.

Starting point is 00:22:13 Or spinning disks, we should say. Spinning disks, right. And that's, yeah, again, there's some notes further on about that that, yeah, I don't know. I don't know how it's going to change in the future, but, um, so one of the key parts of this is every page has an address that can be used to point to other pages, right? So if you, if you have keys between 100 and 200, then there's going to be a reference to another page. And that might be a reference to 100 to 150. And that might be a reference to 100 to 150,

Starting point is 00:22:49 and that might be a reference to another page that's 100 to 125, and eventually you'll get all the way down to where you actually have data. But it's basically pointers to other segments of the disk is what you're looking at. Yeah, that's good. What that means is the efficiency for that kind of lookup is largely dependent on how many hops you have to do. If you have to go through 100 hops, then that and like sql server and so if you're you know looking everywhere uh in a lot of different places for your data because it's not been indexed properly you don't have like a an index you're not you haven't tuned those indexes for the thing you're looking up then you have to load a lot of pages and look for that stuff but if you do have a good index on that thing then you can do a minimum of

Starting point is 00:23:43 hops and go find that thing really quickly, like with a primary key lookup or something that's highly efficient. That's actually a really good point. So what he's talking about specifically is SQL Server, and what he's talking about is if you've never done this, if you go into

Starting point is 00:24:00 SQL Server Management Studio, up in the little row of buttons there, you'll see like show query execution plan. If you turn that on, when you run your query down in the bottom, it'll spit out a graph. And as you mouse over these things, it'll say something like table scan or index scan or index seek or something like that. If you see a scan, it usually means that it was super inefficient, right? Like it's going through every record until it finds it. If you see seek, it's actually going straight to where it knows it needs to go to because you

Starting point is 00:24:30 have an index in place that manages that well. So useful to know if you haven't ever done that, it's always a good interview question. If you're going somewhere where you're dealing with a lot of SQL server stuff or, yeah, I mean, I mean seek versus scan would be good to know even if it's not necessarily specific to sql server yeah just to understand like the difference of what's happening there yep yeah that's true so if you've got some metadata that helps you find that information

Starting point is 00:24:56 or even helps you narrow down say it's like it's on some set of these pages it's going to be in the first 10 of these pages or whatever, then that can be really helpful. You can keep some statistics about your data in order to find stuff faster. And the primary, the way we kind of store this is basically we have a root page, which has all the key searches are going to start there. And I want someone else to describe it because I have a problem with like visualization type things but who could do this i i probably could i mean it's it's just sort of split up into segments so i think that root page is is almost like the range keeper for basically the entire data set right so let's say

Starting point is 00:25:39 that you have between one and a million items it's going to try and break up that root segment into ranges that will fill that out, right? So maybe if you're in one to a million, it might have a segment that's from one to 10,000. And it's going to point to some space on disk over here. From 10,001 to 20,000, right? It's going to point to some other segment over here. And it's going to keep doing that until it fills out its index. But basically all it is, is ranges with pointers to the next page where you can either go narrow down the range a little bit more or to where it ends up going

Starting point is 00:26:17 further down to the data. Yes. If you're looking for a key 700 and 53, you can go through that thing and say, well, it's not between zero and 10,000. It's not between 10 and 20. It's not between blah, blah, blah. Okay, here we go. It's going to be on this page, which is between 690,000 and 715,000 because they don't have to be equal sizes. It's up to the tree to kind of figure that out.

Starting point is 00:26:42 And so you can narrow down and then you can kind of drill into there, load up the next page and do the same thing until finally you get down to the very bottom and you actually find your data. Yeah. We haven't mentioned anything about, you know, related to the B tree with the, um,

Starting point is 00:26:57 uh, what do they call it? The, the branching factor is how they referred to it in this book. I think in the past we've referred to it as the max degrees of, uh, factor is how they referred to it in this book. I think in the past, we've referred to it as the max degrees of what you would have in a particular node before you would need to reshuffle the tree. Right. So in the example that Alan gave, I forget how many you said, Alan, that you might have in that first node. But, you know, if you said 10,000, then that's your, that's your branching factors.

Starting point is 00:27:26 You're going to have a thousand things in there, a thousand references before you're going to split that off into a new, a new node that then might cause the whole tree to reshuffle. And if, and if you recall, if we were to just talk about these from, from a number point of view, how would you do that? Like you, you had that you, well, you use an example of the book where they have like, uh, you know, 100 to 500. And then if you were less, if you were less than 100, you were, you pointed to this page less than between 100, 200, you pointed to this other page. Um, but when you get to to to that bottom page and you wanted to insert something into it again uh how am i saying this you're jumping way ahead here oh i am yeah but

Starting point is 00:28:12 yeah just to kind of i think i understand what you're saying basically the branching factor deals with how wide that uh that page can go like how many references it can have and then uh obviously uh the the wider each page can go the less deep of the whole tree right which is you can fit more kind of horizontally right so if you can cram a bunch of ranges into a single uh node right if you can cram a bunch of ranges there then when it goes down to the next node you want to try can cram a bunch of ranges there, then when it goes down to the next node, you want to try and cram a bunch of ranges there too. And then that way you don't have to go down two or three more nodes deep because as Joe said earlier, the more hops you have to make from the root node to the next node, to the next node, that's time, right? That's reads to the disc and

Starting point is 00:29:01 all that to where if you can minimize the number of hops and get closer to the data faster, the better off you are. And that's, so the branching is going wide. That reduces the number of deep nodes. Whereas, you know, as if you fill up those nodes, then you have to go deeper. Well, there is a trade off there. Oh, go ahead, Allah. Well, I was going to say like, just to tie back into what we had said in episode 97 was that most operations like search, insert, delete, min, max, those would require O of H disk accesses where H is the height of the tree. So that's why you want this branching factor to have this wide tree. Yep. So why not just have a huge branching factor and have maybe a branching factor of a million? And the deal is the branching factor is directly tied to the space you need to store a single page.

Starting point is 00:29:58 So sure, you can have a ton of them. But that means that whenever you load up a page now if you have a super wide one maybe you're loading up megabyte pages or something that's going to be really efficient and end up more disk space if you're only trying to find a single record in there well there was also another strategy that was talked about here though which was the and we also covered this in in episode 97 which was the b plus tree where you know, to date, what we've talked about in this note of the tree is you might have, you know, if you were doing like zero, uh, zero to 500, then you might have like 100, 200, 300, 400. Whereas in the B plus tree, you would store just enough data about those

Starting point is 00:30:40 keys to know how to split it. So you might only store one, two, three, four, and five, instead of one, zero, zero, two, zero, zero, three, zero, zero, which would mean because you're storing a smaller value, you could scram more into it. So you're able to better take advantage of that 4k block size or page size. Then if you, you know, use this, you know, compressed version of, of your keys.

Starting point is 00:31:13 Yep. So yeah, some of the important things to note here is what we've talked about first was the ranges, right? Like this, this particular page is going to have ranges on it with references to other pages. Those other pages might be another index with more ranges. When you get down to where the actual

Starting point is 00:31:33 data is stored for your key, so your value for your key, that's called the leaf node. So everything traversing down to that thing is just a page that's going to have more index references on it. But as soon as you get down to that last one where the data is getting picked up and used, that's your leaf node. Yeah, I wanted to mention too that, so when you mentioned the branching factor, I thought it was interesting. The book mentioned that it's common to have branching factor with several hundred, of several hundred, but I actually found an article about Postgres talking about it being

Starting point is 00:32:01 common to have several thousand of those thousands. And so I was kind of wondering if that's related to the actual width of the rows in the table that you're looking at. I don't know if that's true or not. Like if you've got bigger tables, bigger, wider rows, then maybe it, uh, no, yeah, it doesn't matter. It'd be based on the keys on the, in the number of keys that you would have in that, uh, in the table.

Starting point is 00:32:24 That is totally right. You're totally right about that and this is of course assuming that we're talking about like um you know a single key value not a composite key right where because because that's where that width could come could come into effect but it still wouldn't be like every column necessarily yeah you're totally right i don't know what i was thinking i was thinking about the values but the yeah the values don't have to be even be stored in the tree. There could just be a pointer to those values stored somewhere other place in memory or whatever. Well, where you might be thinking about this, though, is if we were to talk about clustered versus non-clustered indexes, right?

Starting point is 00:32:57 Because right now, I mean, the whole conversation here is about how these storage systems might use a B-tree for an index. And if it was a clustered index, then the clustered index would have everything that it needed to answer a query in that single index. Whereas a non-clustered index would have just the bits necessary to point to, hey, this is where you go find this thing. Right. And you go to some other source. Clustered index is also kind of closely associated with how,

Starting point is 00:33:30 uh, like the order in which stuff is actually stored on this too. So if you kind of insert something in the middle, it'll actually kind of juggle a little bit to keep stuff, uh, sorted appropriately. It will. Well,

Starting point is 00:33:41 that's, but the, the key difference is what outlaw said. The clustered index is essentially your table that's being stored, right? Whereas non-clustered indexes are going to have to keep that index up to date so that it knows, it has to know where to point things. So it is going to sort that stuff appropriately as well. And that's part of the, what I was trying to describe, where the shuffling would occur in the bee tree. And it's going to be like so super important to go and look at that visualization that we referenced in episode 97. And I'll put a link here for you guys, if you haven't, don't have it access, but where you could

Starting point is 00:34:30 see like how the B tree will reshuffle to where like when it gets down to the bottom nodes, if it, if those, if the branching factor is reached there, then it'll split that, but then that will move a key up to the parent node. And if that one then causes that one to reach its branching factor, it'll split again and it'll go back up the tree. So what you can have happen is something that happened in a leaf node can cause writes all the way up to the root node in the way the B tree would work, which I don't know if we were going to get to this part with the write-ahead logs. Was that? Yeah, we do. Okay. I can save that then or not. Yep. So one of the next things we have in here

Starting point is 00:35:22 is updating the value in a B tree. What does that actually mean in terms of databases? So first, you're going to do a search to find that key, and it's in the pages, right? Once you find that key on a leaf node in the pages, when you update that value, basically, you're just overriding the entire block, which is kind of interesting, right? It's not like you're just updating a little chunk in a file. You take that block, you modify it, then you write the entire thing back to disk. Yeah, and you can imagine that being just really nasty if it's a big page size.

Starting point is 00:35:57 But what is really cool and what's really different from the LS entry is that we are modifying values in place, which makes it a little bit easier to maintain transactional integrity. Cause there's only one spot to write to. And we don't have to like worry about multiple and make sure we always have the latest. But it also is much more efficient on this space.

Starting point is 00:36:14 Cause we're not duplicating our data. We're actually modifying the values rather than recreating the whole thing. But I want to, I want to comment on something that you said though, like you, when you said like the nasty index size or or page size like if it's only 4k then that's your worst case yeah like you're writing a 4k block period which you know i mean depending on what your hardware is that you know unless you're on like a raspberry pi something really embedded and small you know otherwise that doesn't sound like a big deal right yeah i was thinking like a record with a huge

Starting point is 00:36:50 branching factor like the kind of the question i had a little bit ago it's like why not just have like 40 billion of a billion branching size and then we could all have everything in one one flat structure well the deal is that um if there's a cost of that so we'd be writing a billion bytes or whatever well you wouldn't even have that much room to store it all in a single page is the problem you have to keep the ranges right and that's that's the problem is you can only store as many key ranges as you have room for on that particular page size in that 4k block yeah and that however many keys you can fit in 4k you're only ever going to be writing 4k period well you can there you can set you can find disks with other block sizes okay but but

Starting point is 00:37:32 generally speaking like what they're saying here is that typically the the page size like if you were in sql server and you were to create a new a new table new index like it's going to default to a 4k block size now if you're using a different block size on your, your disk array or whatever, then, you know, that's why it's on you to, you can tailor that to meet your needs. But, um, you know, if we, if we take the default to 4k, then, you know, you're taking not, if we ignore for the moment any balancing that might happen in the in the tree that might cause rewrites up the up to up the branch structure then at most you're only doing one 4k block in that scenario if you had to rewrite everything back up then worst case scenario it's going to be 4k times the height of the tree is how many blocks you'd be

Starting point is 00:38:27 rewriting. I think, is that right? Yeah, no, I guess that's not, if you're on the third level, it could potentially write to level two and back to the root node.

Starting point is 00:38:36 Yeah. But as soon as it wrote back to level two, that might, that would require a split. There would be two 4k block sizes. And then if it got back to the parent and the parent required a split, then that would be two as well. Maybe I'm overthinking it.

Starting point is 00:38:53 I don't know that you're going to have that because you're probably just updating ranges as you move up the node structure. You're probably not writing data because you're not writing data in the pages that are just indexes, right right the ones that are just key ranges so that's not going to cause splits that's going to be modifying the the pointers based off the ranges but you could totally have the splits though but that could definitely happen it could so worst case scenario you could be writing further up but i don't know that it'd be 4k blocks maybe it would well each one would be each each each node in the tree is a 4k block size period every time because it writes it and

Starting point is 00:39:33 and like we like we clarified earlier and corrected me like even if it's not fully utilized you're still reserving like this 4k block is for this page is written blah blah blah so you know you might have a file that's like a bunch of nulls and you know assuming if it were to like null out the whole block size right you know but only have like at the very head of it uh you know so that actually brings us to the next important point this whole writing up if you update a value and it doesn't exceed the page size for that particular block, then nothing changes, right? The only thing that you did was write that one 4k block and life is dandy, right? The pointers from the parent don't have to change and so on up the tree. So that's really important.

Starting point is 00:40:17 Have you guys ever hit that error where SQL server would tell you, you can't write more than 8,000 characters or whatever on a particular row. And it all had to do with page sizes. That's exactly what it was because it couldn't split that data across two different places. I never put two and two together, but yeah. So then to Joe's point, I've never seen it happen, but I wonder if you could get that error to go away by having a larger block size. Like changing the underlying implementation of how SQL Server was writing to disk. Yeah.

Starting point is 00:40:49 You can configure the page size too. I looked up the defaults for the common ones too, by the way. Yeah? You guys are curious. Sure. We are. SQL Server default, you want to guess? For the page size?

Starting point is 00:41:00 Yep. I'm going to say 4K. 8K. 8K, okay. Yep. I'm going to say 4K. 8K. 8K. Okay. Yep. What about Postgres? 16K.

Starting point is 00:41:14 32K. I'm going to stay with my first answer. 8K. Yep. What about my SQL? I'm going to say a trend. I see a trend. I'm going with 32K again.

Starting point is 00:41:24 16K. Dang it. One more. Oracle 32K again. 16K. Dang it. One more. Oracle. 32K. 32K. 14K. What?

Starting point is 00:41:32 Yeah, that's the weird one. Wait, wait, wait, wait. Is that because they reserve 2K for something? Like it's 16K, but you only get access to 14 of it, maybe? I don't know. Let me see. All right. Well, while you're looking for that uh one thing that we did forget to to mention when we were talking about these b trees is that as we're describing how this uh splitting of the nodes could occur the whole reason why any

Starting point is 00:41:58 of that matters is because the whole definition of thetree data structure is that this is a self-balancing tree. So as it reaches that maximum degree of branching or, you know, within a particular node, that's it doing the split of the nodes and then which might then recursively go up the tree is inherent to its nature. Like that's a key defining characteristic of, of the bee tree. Yep. And this, this next point actually goes straight to that. So if, if your key and value would exceed the size of the page that it's trying to write to, that's when the page is split in half. And then you have two now 4k blocks if we're still talking about the 4k and after that data is written to one of the blocks the new block that is created it then goes and updates the parent node and for the parent page information to update the ranges

Starting point is 00:42:59 so so that's where that split happens and that's why it's a self-balancing tree. Like what outlaw said is because as it exceeds it, it splits it into two full halves, right? Like it's two full pages with half the data on each of them. And then that way it's always trying to balance it as it goes, as it splits these pages. Yeah. And another thing too, that we haven't said, at least in this episode is, you know,

Starting point is 00:43:23 another thing that makes this thing so powerful this episode is, you know, another thing that makes this thing so powerful and, uh, so efficient is that, um, when you do these splits, right. Of the node, then you're going to put things less than a particular value in one. And like say the left child node and anything greater in the right child node. So you can imagine how using this, you can do a binary search as you're trying to first traverse this tree. So if you had a tree where, uh, let's say, let's say it had a thousand things in it and you wanted to put insert a 787, you could, and, and the root node was 500. You could be like, you know, okay, it's greater than 500. So I need to go into the right side of that tree. Right. So, so then you would traverse into whatever the right side of that is. And, and then, you know, again, in my,

Starting point is 00:44:16 you know, the next decision point might be, well, okay, is it less than or greater than 750? And you're like, oh, it's greater. So again, you would go into the right node of that second branch or node of the tree. And until you get to the spot where you're like, okay, this is the range that 787 would belong in. Is it already there or is it not? If it's not, then I'll insert it. But like Alan just said, if the insertion, if the act of inserting that means that that node is now at its limit, then you would split that and the median of that node would go up into the parent node, which could then cause the splitting to be necessary again and continue on up the tree that way and and again like we covered this in more detail in episode 987 as it relates to like specifically how the b tree operations work so so one of the interesting things that they point out here is the big o notation for a b tree with N keys, it has a depth of O log in, which is kind of interesting. It means it'll never get huge.

Starting point is 00:45:33 It's pretty good. By the way, Oracle, 4K. I don't know what the heck I found initially. Oh, okay. Okay. It's also important to point out, though, with that O log in, that's the time complexity to search, insert, and delete on a B-tree.

Starting point is 00:45:50 They didn't go into it deep here. Yeah, they didn't go into it here. But, I mean, that's why it's so important, right? It's fast. Relatively speaking, this is a pretty fast data structure to use and it's why it's been around for what now 50 years i mean oh gosh yeah yeah think about that right i don't want to think about it like that but yeah but i mean like as a data structure period you know because like we said the data structure itself doesn't have this kind of limitation on it but now we're

Starting point is 00:46:21 putting these physical hardware constraints on it to say like, okay, these nodes are 4k and we'll do as best as we can with the 4k. And then we're, so you can see like how the application of something theoretical, which was just the B tree now apply it to the real world hardware use cases. And what we can do are like where some of these things start to matter. Right. Yep. Oh, here's, Here's an interesting thing that they threw out in the book that I actually really liked. They said most DBs only go three or four pages deep. And the reason being, a four-level tree with four kilobyte pages on it

Starting point is 00:47:00 with its branching factor can store up to 256 terabytes on just four nodes deep like pretty nice that's a lot of data so yeah and i had to like i was curious as to how they got there so i i redid the math on that because the important thing that was left out of that was that the branching factor was 500. Ah, so that's it. Wow. So how you're getting to that 256 terabytes, which I can get exactly to 256. So I'm assuming this is like, you know, one of the 1024 kind of like, you know, things that are that are, you know, getting getting lost in the translation here, rounding error. But you would take, because you would have these four levels, so in each level could have

Starting point is 00:47:55 500 branching factor, 500 times 500 times 500 times 500 times 4KB, and that's where you would get to the 256 terabytes. Yeah, that's cool. That's awesome. Yeah, 256 terabytes is a nice-sized database. Yeah, I'll take that SSD. Yeah. Hook me up. Yeah, but it's probably, like, with a branching factor of 500, it's like, oh, yeah, I could see that.

Starting point is 00:48:27 Like, that sounds pretty large. So, I mean, that's each level having 500 things that it could point to. Right. Right? That's it. It's just pointers. Wait. No, I'm saying that wrong.

Starting point is 00:48:48 At the top most level, you would have 500 things you're pointing to. At the second level, you would have 500 twice. So you'd have a thousand things that you're pointing to. No, no, no. Because each node. Top level is 500 pages it points to. Those 500 pages have 500 pages they each point to. Oh, yeah, yeah, yeah. You're right, you're right, you're right. Yeah, it's a tree, right? Yeah, yeah, yeah.

Starting point is 00:49:10 Like it's every single leaf or every single branch that comes down has two more coming off of it, right? Yes. Or 500 in this case. Yeah. So, yeah. Yeah, and if you want me to draw it out, I can totally describe what this tree looks like if I had to draw it. Picture a triangle. No, I'm just kidding.

Starting point is 00:49:29 No. Yeah. Today's episode of Coding Blocks is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. DataDog's machine learning based alerts, customizable dashboards, and over 400 vendor backed integrations make it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, logs, and my favorite thing, traces, in one spot, you can easily improve your application performance. Try DataDog free by starting a 14-day trial and receive a free t-shirt once you install the agent. Visit datadoghq.com slash codingblocks to see how you can unify your monitoring today. Hey, Joe here. Whispering into your ear about reviews. I know it's annoying, but check it out. I was looking at the stats and I figured

Starting point is 00:50:26 out that whenever I ask for the reviews, we get the least number of views afterwards. And so I really want to kind of do a good job this time because I want to make Outlaw and Alan feel bad because I guess I'm a jerk.

Starting point is 00:50:42 So I was kind of hoping if you could just take a minute and go to codingblocks.net slash review. And if you could just leave us a review, a really good review, that would be really great. And then it would really help my numbers look good. So I'm trying to kind of get ahead here. So if you could do that. Okay, they're coming back. So, okay.

Starting point is 00:51:02 All right. And, yeah. So thank you very much hey what i missed i just uh i just got back yeah i got i got a drink uh anything happened there yeah no no it's just uh time for outlaws favorite portion of the show okay what portion is that survey says i love how just watching you try to mock me do it is so much better. All right. So a few episodes back, we asked – and actually, Alan wasn't here for this one.

Starting point is 00:51:35 This was while he was out in – I think this is while you were in London. Yes. You can't go there anymore, but you were there for a while so uh yeah we asked how do you pronounce data or data and your choices were data where that first a is a long a or data where that first A is a short A. So... Hey, I take issue with the survey, because neither one of those have a T after the A.

Starting point is 00:52:14 What is that about? People don't say data. Neither one of those have a T after the A. What are you talking about? So, the pronunciation key to the right there. You've got D with a long A and then a D. I mean, you've got to blame the pronunciation people at Webster or wherever. Your dictionary of choice.

Starting point is 00:52:39 All right. So since Alan takes issue with it, yeah, Mr. Issue, which one do you think is the most popular answer with a percentage? The right answer is data. And I'm going to go with 55%. 55. So long A. Yes. Okay. The long A. Data. 55 So long A. Yes. Okay. The long A.

Starting point is 00:53:05 Data. 55% long A. Well, I'm going to go with data because I remember clearly and distinctly Star Trek winning the Star Trek vs. Star Wars war. And I think that means something to people pronouncing this word. So also long A. Is that what you're saying? Also long A, yeah. And I'm'm gonna go with 49 all right wait it's best not to dwell on it this math thing has got him messed up man well you know what i have to you have to know the math pretty well to be as wrong as I am. He could just be assuming that you overshot, though.

Starting point is 00:53:48 No, no. But there's two answers. One of them has to win, which means one of them has to be greater than 50%. Oh, yeah, yeah, yeah, yeah. Yeah, you're right. I don't know. It's too close to a survey. But this is Price is Right rules.

Starting point is 00:54:06 So you could say as long as you pick the correct answer and then you give 1%, right? You could still win. So that's where I was coming from. It's like he's just assuming that you're going to overshoot. That's what he did. So he just went under a few percentages. Yeah. So he gave himself some wiggle room there. So data

Starting point is 00:54:28 with a long A for 55% from Alan and 49% from our mathematically challenged Joe. And the survey is data with a long A was by far and away the most popular choice at 84% of the vote. My peoples. Whoa. All right.

Starting point is 00:55:02 It was awesome. That stuck in the house. All right. I was actually kind of surprised with that one i i thought that this was going to be more evenly split than it was but yeah proper education going on around apparently i like it all right well i mean there's a pronunciation key for both so i mean you can pronounce it either way in way, you would be wrong. Oh. Wow.

Starting point is 00:55:30 Well, that's a little brutal. All right. So now you know, too. If you say the word data, then you are on the outside. You are Robber Rouser. You are Scallywag Troublemaker. So now you know. Okay, Pirate. All right.

Starting point is 00:55:55 So for this episode survey, super good Dave suggested, hey, what if we were to ask something, you know, because we're all working from home. So what tools are you using to ease working from home? And this is going to be another multiple choice one. And the choices that I've got in here so far are RDP, VPN, Citrix, Zoom, Skype, Slack, WebEx, Hangouts, Teams, Discord, Pigeon, Jitsi, Messages, FaceTime, Join.me, and lastly, Go we'll see. This one's going to be multiple selection, right? Like we just kind of want to see what people are doing here. Yeah. Yeah. Yeah. Well, this all came about because, uh, so if you're not already part of the, uh, coding block Slack, you should definitely become part of it. And, uh, what we did was because we're all working remote and we're all in need of social interaction, we decided to get together, uh, you know, everybody from everybody that wanted to participate in the, in the Slack, um, and have like a large, uh, you know, zoom call where like,

Starting point is 00:57:03 if you were able to figure out how to zoom bomb us, then you got to join. And, and during that conversation, we were talking about like different tools that were being used and someone, I think it might have been a super good Dave that mentioned using a RDP and it was, and I had referenced, or no, maybe what brought it up was I had referenced, um, how there was a recent report that came out that was talking about the usage of RDP had significantly increased due to everybody working from home. But the downside to that was all of the security flaws that are in the RDP protocol, which are like reasons why you shouldn't be using RDP. And then, you know, people were asking like, oh, wait, what? Really? There's like, we shouldn't be using it. It's like, yeah, it turns out it's not a secure protocol.

Starting point is 00:57:54 So that's how it came about. So yeah, join the Slack. That's the moral of the story. Codingblocks.net slash slack. Okay. So now that we've, we've got this framework in mind about B trees and why we would want to use them and how some of they work, how some of the ways that they would work. Let's talk about like how we can make them reliable. So that's Alan.

Starting point is 00:58:20 That's that's me. Oh, okay. Not it for me. Okay. Yeah. So the, that's that's me oh okay not it for me okay um yeah so the the main notion here is because things happen in the same pay in the same place on the original plate on the original jesus easy for you to say let me start this sentence over again. So these things write to the same location as the original page, right?

Starting point is 00:58:52 And the whole point of that is so no references have to change. And this is where they said, hey, think of this thing as a hardware operation, right? You're actually talking about writing to sectors on a disk somewhere. And this was the part that I was mentioning earlier that was kind of interesting is when they brought this up, that works out great for spinning this because a lot of times that page size maps to whatever the sector size on the hard drive is. SSDs don't work that same way. Typically, when you rewrite a block on an SSD, you have to write a larger chunk of data at the same time due to the way the chipsets work. So that's one of the things that I'm kind of curious about as time moves on. Is it the underlying storage technology or how they interact with the storage technologies change,

Starting point is 00:59:40 or do they change away from B-trees? I don't know. It'll be interesting to see how that goes. Well, one thing that I was thinking about in my mind as you were talking about this was like, you know, what is to stop you? We've talked about like how they write all of these things in individual 4K blocks. And there's actually like a portion of, I believe, this chapter where they're talking about sequential versus non-sequential rights. But I was thinking about it in my mind, like, well, you know, you could actually like trick the system into getting a sequential block if you wanted to by upfront allocating for more than you needed. So for example, if you allocated a 16K page and then you just have like, okay,

Starting point is 01:00:29 here's an address that's going to point to the first 4k. Here's, you know, a node that's going to point to the second 4k and I know that's going to point to the third and the fourth, right? I'm like, why doesn't that work? What is that? Is that a thing? Or, and if not, like, why is it not like I didn't find anything on it again. Like this was just something that came to me as you were talking, but do you understand what I'm saying? Because then it seems like you could have like, like in that four and that, sorry to interrupt, but in that, that, that simple example that I gave of the 16 K block, like you could imagine it's basically like four, four K nodes. Right. And you know, okay okay so what if you're not using all of it yet but you you have

Starting point is 01:01:05 an idea that you're going to and you know at least now you have a larger contiguous block i don't know but maybe that's a dumb idea i'm imagining the only reason stuff like that doesn't happen is you'd have to know pretty well in advance that that is a possibility of happening because really what you're talking about is increasing your disk space by 4x right just when you create the one page block so i mean there's no doubt it might be faster because it'll shove stuff together but but you're taking a major hit up front on this space if if you're doing that and it's not all being utilized you know quickly i guess would be the thing. So I don't know.

Starting point is 01:01:45 Yeah. I mean, it might be an answer to this SSD issue, though, since it says that they rewrite large blocks at one time. Right. Which I'm not even that familiar with how SSDs work under the covers. So, like, just what they said here was kind of news to me. Like, I didn't I didn't really know that. So, yeah, it is interesting, but here's, so this is where things get a little bit crazy on the reliable stuff. So they said some operations require multiple pages to be written, written,

Starting point is 01:02:17 right? So if you think about that, like you, you write data and it exceeded that 4K size that you had on your page. And so it had to split it. Well, that split also means that up to the parent node, it has to update its ranges and its references to those. And then maybe even has to go up to the other node above it if there is another parent. Well, the problem is, what if the database crashes in the middle of that? Right. And there's nothing that was backed up there. Now, when the database comes back up,

Starting point is 01:02:49 your data's in an inconsistent state because essentially you're going to have some orphaned pages out there, like the things that were written, that the new references weren't updated in the parent nodes. There's no way to get back to that data now. So you basically lost data more or less. And it could be quite a bit of data because you got to think if that page split is because you exceeded that 4K, it took half of that data, shoved it on one, took the other half on the other.

Starting point is 01:03:14 So you're not just losing that one record you had. You're potentially losing a whole bunch of data that came along for the ride in that page split. So that's one of the problems that can happen. Or in this case, it would be at least 4K plus whatever the size of the thing being indexed is. Because it was the thing being indexed that caused you to split. And that's best case scenario. That's assuming that you don't have to split any nodes as you go up the tree because if you got all the way up to the very last node that needed to be rewritten and that's where you crashed then you've lost everything that you just split below it right and so now this is where

Starting point is 01:04:01 we get into uh what i alluded to before with the write-ahead log. Right. Which is the next thing we have up here. And the whole purpose of that thing is it goes back to like what the previous stuff, the building blocks of all this was, which is having the write-only, append-only log that all the changes go to first. And then after those changes are committed to disk, then it goes through and does the updating to the actual data. Yeah. And in this way, what we can do is if the database were to crash,

Starting point is 01:04:37 as we have to split and rewrite these pages, you know, assuming that that was necessary, if the database crashes during that operation, then we still have that log, that append-only write-ahead log, that we can use to rebuild the state of that index. Yeah. And they also refer to that write-ahead log as a redo log,

Starting point is 01:05:02 which sort of makes sense. It does kind of seem that you're writing at a minimum. You're always running your data twice. Yes. Yeah. But I think that's also how, even when we talked about the LSM trees and all that stuff, right?

Starting point is 01:05:16 Like that's kind of how they always kept, they made sure that things happen is commit everything. And then that way you have a rollback plan if you need to do it, right? Well, I believe in the pre... Oh, sorry. I was saying like the write-ahead log is a append-only log that keeps all the changes. So in some ways it kind of behaves

Starting point is 01:05:35 similarly to the LSM tree. Yep. My guess is they're probably way smaller. You know, like they commit that thing and then they go do the writes and then they probably just overwrite that file or they create new files all the time and wipe those off. I doubt there are a lot of long running changes like what the LSMs and those are. That'd be my guess.

Starting point is 01:05:57 I mean, this is similar to, I believe, where we started this chapter and we were kind of referring to this concept as the transaction log if i if i remember correctly but in now it may be to your point because it does just represent maybe it does just represent like a short-term you know period of thing before they would like you know if it was like a rolling log file you know and so log.1 log.2 log.3 right you know maybe that's where you you know, they're drawing the distinction between the write ahead log and the transaction log. I don't know. I mean, I'm asking. That'd be my guess is the write ahead log is probably a short lived temporary thing. Maybe for just a handful of transactions at a time would be my guess.

Starting point is 01:06:45 I've never looked at one. Yeah, otherwise they just call it the transaction log, which is a much bigger deal, right? So the next issue that they brought up that was really interesting is this issue of concurrency, right? Like you're going to, if you're in a database, you know there's more than one query being executed at any given time. There's probably lots of updates, lots of inserts, all that kind of stuff.

Starting point is 01:07:07 And the thing is, if you have one thread reading the state of something while another thread is writing the state, they're going to be inconsistent, right? Like you're going to have references that don't point to the proper pages at that time. And in order to fix this thing, this is where databases implement what they call latches, which are more or less just lightweight locks. And if you've ever dug deep, which I know you guys have been in there, when you're working with something like SQL Server, there's like implicit locks that happen for you when you're doing updates or deletes or anything like that. You can actually force isolation levels when you do certain types of my head, but you can have read committed. You can basically have things where you can have a read that will operate on the data, even if there's an update happening.

Starting point is 01:08:12 Because you're saying, I don't care if I get dirty data. I want it as fast as possible. So don't care about any locks that are in place. Just give me the data. Or you could give a hint to the query optimizer inside of your query where you'd say, select from my table with no lock. Right. And then that way you're reading that table as if nothing else in the world is touching it. You might not get back exactly what you expect, but you're also not going to be

Starting point is 01:08:39 waiting on anything. So there's trade-offs that can happen. But the key is there are locks that are happening behind the scenes, whether you know it or not, to help make sure that your data stays in a consistent format and so that everybody gets what they expect to unless they decide to override that behavior. So I'd have to give this full article a bit more read through than glancing at it right now. But it looks like that actually the right ahead log in the transaction log will be synonymous. Like there's a, there's a Microsoft article that I'm looking at and they actually refer to it as the right ahead transaction log. And specifically they say that SQL server uses a right ahead logging

Starting point is 01:09:20 algorithm. So, so yeah, I think it's just they're synonymous with one another. Okay. I'll give you guys this link, though. Which kind of makes sense if you think about it, because

Starting point is 01:09:33 you can truncate a transaction log in SQL Server, which basically means, hey, you still have all your data, but you can't rewind anything, right? Like, there's no history anymore on what the writes, the reads, and all that kind of stuff were yeah oh there are five levels you can set for those isolation those i have not heard of most of them read uncommitted right yeah it's been a while there's like weird ones like repeatable read uh snapshot and serializable too whatever that means yeah it's it's been a while since i've

Starting point is 01:10:06 gone that deep on them but all right so now we're actually get into a part that outlaw touched on earlier and there's a couple of them that he hit on in here so now we're going to talk about b-tree optimizations so you know again everything we've done in this chapter is sort of foundational, just understanding how files are done for databases to work it writes that thing first before it goes and updates the page on disk, this kind of gets rid of that need. Because what it does is it will write the new 4K block that it needs, again, sticking with 4K. And then after it's done with that, then it'll update the pointers to point to the new one. So it's basically just creating a new space and then updating the pointer to it. And then that old one kind of goes away and that alleviates the need to do a

Starting point is 01:11:12 separate log and then also update the 4k blocks, which this actually seems like it might be more data or more rights. I don't, I don't know. It's hard to say. I guess it would depend on, on the density of your data that you're writing. Yeah, it's interesting. I'm still kind of trying to figure out what exactly that means.

Starting point is 01:11:32 But I don't know. It's pretty cool. What do you mean? I was just thinking about the copy on write. Taking up more storage space temporarily. But I guess it allows other people to keep reading while you're writing, which is, I don't know.

Starting point is 01:11:48 It's just kind of interesting. So I was just kind of thinking about it and we moved the pointer after it's done. So I guess that would be a nice way of allowing those dirty reads if you wanted to. Yeah. That was the way I interpret it. This is just a short,

Starting point is 01:11:59 your point about the disk space being used is just short term. You know, it's just a short term problem as you're as you're splitting these pages, for example. So the one that you hit on earlier, Outlaw, when you called compacted indexes, like if you had 1 as a key instead of writing 00001? No, 100. Instead of writing – Oh, 100.

Starting point is 01:12:24 You'd write 1 instead of 100. Okay. So to zero. Instead of writing, you'd write one instead of one, zero, zero. So to represent the range of 100, you would just write the one. So this is what they called abbreviated keys. So this is a way to basically store more keys and more ranges in a single page so that you can have more branches, right? And so that you decrease your node depth while increasing basically the width by going with more branches. And you could do that because you've reduced the size of your keys. And, you know, I mean, you could think about this even from an alphabet point of view, right? Where like, you know, instead of having all of like outlaw as part of the key, you might just have O U T splat. Right. And then,

Starting point is 01:13:11 and then that would be, that would represent everything that starts with O U T. Right. They, they did mention, and again, this is where, this is technically a B plus tree. And, and again, we, we talked about this in episode 97, but the author here made an interesting point that he says that this variant known as the B plus tree is known as that. But although this optimization is so common that it often isn't distinguished

Starting point is 01:13:38 from other B tree variants. Wow. So apparently this abbreviated key thing is, you know, pretty much the de facto standard. Like it's pretty common that that's going to be done. you know, taking 4k and turn it into 16k. I don't, they're not saying that here, but this whole notion of trying to keep things sequential. So keep your pages sequentially next to each other. And then that way, because if you think about it in terms of a spinning disc, it makes a lot of sense as this thing spins, it gets to number one. Well, the next turn of that, wouldn't it be nice if number two was right next to it? So that it's not having to go all the way back around the disc to get to the next one or whatever. So trying to keep... Go ahead.

Starting point is 01:14:31 No, you're good. Well, I was going to say, so it doesn't even have to move the head. It could just keep reading where it is as it makes its way around the platter. Right? Instead of just reading one section off the platter and then have to move the head a little bit to read off the next section. Instead, it's just like one contiguous read as it goes around. Yeah. So, I mean, that's one of the things that's kind of interesting we haven't talked about is they talk about seeks in terms of what the software is doing. A seek is actually a disk operation as well, right? So, what you just said is not having to move that thing around

Starting point is 01:15:07 saves time, like quite a bit of time. So having those things next to each other matters a lot, especially when you're talking about large database systems. And even though, you know, as you say that, like someone might be saying like, well, you know, I mean, hard drives read pretty fast, but yeah, I mean, compared But yeah, compared to maybe 1980, sure. But if you start doing that operation a million times, where the head has to keep moving back and seeking across, now you're going to start seeing a lot of latency in there. So whatever you can – these things might seem like micro-optimizations. But as you start to scale the use of it, then they can start to matter. Yeah.

Starting point is 01:15:49 And these optimizations, these kind of optimizations are the reason that people pay hundreds of thousands of dollars for like SQL Server and Oracle when there's a free MySQL or Postgres available. Wow. Yeah. One of the many. So here's another one that's kind of interesting. I wouldn't even thought about because I've never written my own database from scratch. But one of the other tricks that they do to optimize these B-trees is we talked about they typically store ranges, right? So 1 to 100 is going to point to one spot on this.

Starting point is 01:16:23 101 to 200 is going to point to one spot on this 200 or 101 to 200 is going to point to another spot. Well, if they are trying to put things, you know, next to each other, whatever, right now, in order to get to the next chunk of data, you'd have to go back to the parent to find the pointer to the next chunk of data, right? So what they might do instead is add additional pointers for each one of the indexes. So there'd be a pointer to the previous sibling into the next sibling. And then that way you don't have to go back up to the parent to find out where the where the brother or the next or the previous sibling was. So it's just it's just another optimization to keep reducing the hops that you're getting around from data to data. It's almost like treating each branch level as a linked list, as a doubly linked list,

Starting point is 01:17:13 so that when you're on a particular level, you could go back and forth across that level as I bang my microphone around. That's actually a perfect analogy. I mean, that's exactly what it is the banging the microphone around i thought so well yeah that's why i did it that's why i did now how cool is it to see all these kind of like data systems that we talked about kind of coming into play on uh such big systems all right and foundational to these big systems right like they're they're the reason they work at all so yeah it is cool. Yeah, it is really cool. I mean, we've talked about this before, that like how much,

Starting point is 01:17:49 how this particular chapter, like how much we enjoyed it because there's so many things that we've taken for granted for years in our career, or careers rather, that, you know, just didn't care about

Starting point is 01:18:02 some of the underpinnings of like how some of that stuff worked. Like, yeah, okay, fine. I'm just going to move along. And, you know, just didn't care about some of the underpinnings of like how some of that stuff worked. Like, yeah, okay, fine. I'm just going to move along. And, you know, there might be parts of it that you like, you know, had to dig into for some reason and figured out. But there's so many as a whole, you know, so many aspects of database systems as a whole that this chapter has really like solidified. Yeah, it's excellent stuff. And so the last one that they had here is they said things like there's bee tree variants, like fractal trees, that they borrow some ideas from structured logs to reduce the disk seek. So again, it sounds like a micro-optimization, but ultimately when you talk about millions of reads

Starting point is 01:18:47 and writes a day, these things add up. So, you know, there's a lot of things that are borrowed from various different places to help these things actually do what they do effectively. Yeah, and that pretty much covers B-trees, what we talked about, but I'll take a little bit here to talk about the differences between B-trees and LSM trees. And like we said, B-trees are much more common and mature, having come out in the 70s.

Starting point is 01:19:14 And I just looked it up. LSMs were invented in 1996. And I thought it was kind of interesting. It was like, do you invent algorithms or do you discover them? Wikipedia says event. But yeah, just the idea is that the B-trees have been around for a lot longer. A lot of things we just talked about there with the optimizations or things that we've learned were really helpful based on the use cases

Starting point is 01:19:38 and things that people were doing with real-world data. So the LSM trees are still kind of the new kids on the block and they're becoming more popular but um i don't know uh there's definitely some pros and cons here which we're going to talk uh talk about but kind of high level if you just remember kind of a few things about it ellison trees are typically associated with faster writes. And B-trees are typically associated with being faster for reads because basically all the things we talked about, but essentially

Starting point is 01:20:12 because those LSM trees have to check multiple kind of spots potentially to find data. And so that's the general kind of rule of thumb. But those use cases can wildly vary based on the kinds of things that you're doing or the benchmarks can vary wildly.

Starting point is 01:20:27 So, you know, it really kind of depends on your specific use cases and the size and shape of your data. And I don't know if we talked about this last episode at all. Do you remember if we talked about write amplification? I don't. No, but I did want to like, before get into that, though, there was one other thing I want to cover because we had said that for the for the B tree that the big O notation for that was going to be O of login for I think we said for like search and insert and deletes, if I remember right, were the operations. And I'm looking at the Wikipedia page. I gave you guys a link to it for the log structured merge tree. And you can see like how your comment about the B trees being faster for reads versus the LSMs being faster for writes, because the, on average, the minimum for the find, for example, was an O of N operation, as well as the worst case was an O of N operation.

Starting point is 01:21:31 So you can already see the B tree way outperformed it. Smokes it. Yeah, that makes a lot of sense. And the whole thing with B trees is that they maintain that balance. And so you get those consistent times, which is really nice. And we don't have that with the LSM. So I want to – oh, go ahead. The LSM was an O of 1, right, which makes perfect sense because it's always depending to the end.

Starting point is 01:21:58 Whereas the B-tree was having to do this balancing act. It was still fast, but it's not nearly as fast as O of 1. That's as good as it gets, right? Well, again, the B-tree has to write to all data at least twice, which doesn't technically affect the big O there, but it's a consideration for the write-ahead log or transaction log and also the tree. And then if there's any sort of splitting goes on,

Starting point is 01:22:20 then it goes on from there. And then there's optimization as well, where some storage engines will even go further by kind of writing to multiple pages for like redundancy or for basically making for faster lookups and according to wikipedia for the rights it's the same the b-tree is the same o of login operation so like that's its average and its worst case for an insert into the B-tree. Okay. So, yeah. So different use cases, a little bit different strengths.

Starting point is 01:22:53 LSMs also do rewrite data. If you remember, we talked about compaction tonight where basically we're always appending. And so if there's any updates or anything, then we basically kind of mark them as deleted. And then later, there's a separate kind of garbage collection process that goes through and kind of cleans up and shakes things down, which is rewriting data. It just kind of happens out of band. And there's a problem with that. Not only for Solace's drives don't do so well with repeated writes in the same segment. Those things kind of have a lifespan. But also, because...

Starting point is 01:23:30 I can say it. So I'll go there. Whoa. Okay. I was just going to say I've got a section on LSM downsides, so I was going to just save it. Okay. We can do that. So.

Starting point is 01:23:44 Alright. So Lsm trees typically have better sustained right throughput because they have a lower amplification they're doing less rights typically and because of those sequential rights particularly on spinning discs and can be uh can involve less disc space and i forget why i was so intrigued by that point now you remember the book said about the compression you know i was actually trying to think about that because i was like wait a minute why would that be a thing because again if you were to go to the B plus tree, then wouldn't that be compressed? But I mean, it does say, it does say that the LSM trees can be compressed better and thus often

Starting point is 01:24:35 produce smaller files on disk than the B trees. Um, and this is because the B trees would leave unused space due to fragmentations when a page is split or when a row can't fit into an existing page. Yeah, because they're keeping fixed block sizes where the LSM tree just writes what it can and keeps moving on, right? Like they don't care about block sizes.

Starting point is 01:24:58 And it doesn't change. So compression, it would be super annoying if you had to like look that stuff up and then make a change to the middle and that's going to change the whole compression, the whole output there. So it would involve a lot more writing. So that just works better in general with data that doesn't change. The last point here for the advantages were that L-centuries also have lower fragmentation on writes, which we just mentioned half a second ago. Now the downsides.

Starting point is 01:25:24 What were you starting to say there? Well, so the book covers a portion of it where they're talking about how, you know, any disc, be it a hard, hard disc or a solid state is going to have a finite amount of right bandwidth that can be used,

Starting point is 01:25:40 you know, for that. So when this compaction operation happens, depending on the number of threads that are running in the background to do that compaction, then they're competing for with the resources for live operations that might want to write. So depending on how many of these pages have to be compacted, and how many worker nodes are doing this compaction at one time, you know,

Starting point is 01:26:08 you could see this bottleneck where it could, it could impact the right performance and potentially the read performance of other threads that are handling like, you know, the live transactions. Yeah, that's a really good point. Now you're competing for that bandwidth there.

Starting point is 01:26:24 So that stinks. And yeah, so that compaction even on disk and RAM can have an effect there with it running out of band still. seen out in the real world is that with the lsm tree style databases or base databases it's possible that the compaction isn't keeping up with incoming events so if you're doing like a lot of updates here it's easier for you to run out of disk space because that garbage collection kind of cleanup phase isn't running as aggressively as the data that you're coming in that's like writing and kind of making that data mute as it goes along. And so what happens is that eventually you run out of disk space because you've got all this stuff that you can't reclaim or that's being reclaimed slower than you're writing,

Starting point is 01:27:15 which is kind of funny that you can like write less data than the disk has and still run out of disk space because you're effectively duplicating it over and over by appending, appending, appending. What was the system that you ran into this problem with? Elasticsearch. Yeah, that's what I had in my mind. Yeah, and it's a real problem when it runs out of disk space. It basically shuts down and says, okay, hey, we're not letting anything else and we're in delete only mode. And so you got to go in there and tell it to basically do the compaction manually but even compaction takes a little bit of disk space and stuff and so they have actually implemented circuit breakers now that will shut it down before it gets to like 100 so it'll like start shutting down like 85 so it's weird when you go look at the disk you're like why

Starting point is 01:27:58 are things shutting down we've got 15 free but it's because it knows that if it gets to 100 it could be a real problem because it's not able to clean up after itself effectively. And this was an issue in, you know, just to refresh my own mind with the way the LSM trees worked was that you might write a value and say, uh, Michael has a value of one, two, three, and then, Oh, Michael has a value of one, two, five. Oh, Michael has a value of 125. Oh, Michael has a value of 250. So you just keep appending the latest value for Michael in that scenario. But you've got it in triplicate in that example data for it. And it's not until the compaction operation runs that it would be like, oh, no, we only need the latest value for Michael, which is, I think I said two five zero.

Starting point is 01:28:51 And there would be in like one page file of the for the LSM tree. Right. So that's part of it. But the other part is you could just be having so many rights that the background compaction process just can't keep up. Yeah. Yeah. No, no, no. But I was I was trying to I was trying to explain like why this is a problem. Because you would have that duplication in the LSM tree, whereas in the B tree, you're only going to have the one key period. So in the case of Michael, you want Michael 1, 2, 3. You're like, oh, well, let me

Starting point is 01:29:16 find the right node. You're going to do a binary search as you make your way through. Oh, it doesn't exist yet. Here's the page where it would belong. I'm going to write Michael one, two, three. Then you want to change it to Michael one, two, five. You do that binary search again. Oh, I've already got a value for Michael, but I'm just going to update it to one, two, five. And then you'd repeat that same operation for two, five, zero. So you'd only ever have the one instance of Michael, you know, because like Highlander, there can only be one. Yeah. You're only ever rewriting a block at a time, not an entire file. Well, a page file. A page file. Right.

Starting point is 01:29:55 Yeah. I mean, and that was kind of the point that we were getting at before is that, or at least that I was questioning with like, hey, could the page file, does the page file necessarily have to represent a single file or could it be part of a larger block of file maybe? I mean, because part of the reason why I asked that too is that like if you ever had to go digging into SQL Server and things like that, like, you know, you don't often see like, it's not like a bunch of small files that you see split apart. Right. You know, it's one like a bunch of small files that you see split apart. Right. You know, it's one file. In Postgres it is. You ever looked at their file structure?

Starting point is 01:30:32 It's like a directory with tons of files in it. So it's definitely treated differently. I've never dug into the specifics of it. But like in SQL Server, you have a log file and you have a data file, right? In Postgres, you have a thousand files. I've never actually looked at it to see what it was. You would think with my new love affair of Postgres I would have done that yet, but embarrassingly, I haven't. Yeah, well, you're not truly in love with it. I guess not.

Starting point is 01:31:00 I guess not. All right, so Joe, you want to give us the rest of the negative Nancys on LSM downsides here? Can we call him Downer? Yep. Donnie Danner. Donnie Downer. So the problem with MagnetVite here, because the key can exist multiple times, and just to kind of reiterate, at that point we've already made it,

Starting point is 01:31:23 if I go on to Facebook and change my name 10 times in a relational database, maybe it gets a little kind of backed up because it's trying to write to that same record and so it's going to get kind of queued up writing to make those changes 10 times

Starting point is 01:31:34 back to back or whatever. But then in LSM tree, I'm going to duplicate that whole document 10 times. And so that's where that space issue comes from. And yeah, basically just ties into kind of making those uh updates easier and because of that it's easier to kind of isolate

Starting point is 01:31:50 that data and keep that transaction uh nice and acidy and i want to mention too that uh there are a bunch of other indexing strategies that we aren't getting into like these aren't the only ones and i think we might talk about those next episode yeah like as it relates to uh i mean so far we're only really talking about um single column type indexes but if you needed to get into multi-column indexes or uh multi-dimensional indexes um you know that that's where there's other other for that. But I did want to point out, though, that when you made that comment about, you said it, you referred to it as a B-tree being acidy. That's where that write-ahead log is important, is that that write-ahead log is helping the B-tree pattern maintain its ACID compliance because you can still guarantee that you could recreate it if you had to.

Starting point is 01:32:57 You can recreate that structure if you had to. because without that right-ahead log, because the B-tree would require you to shuffle these nodes around when you have to split them, if you didn't have the right-ahead log, then how would you guarantee ACID compliance? Well, the other way that they said where you write the 4K blocks

Starting point is 01:33:17 and just repoint, right? But I'd assume that at some point you could lose something there too, honestly, depending on how far up the chain you went. But even in that rewriting and changing the pointers though, if you had to, if you're rewriting, that means that you would have to, if when you do the split, right, not only do you have the operation where you write the new nodes for the split,

Starting point is 01:33:40 but you also have to do an update to the parent node. And then that operation, if that operation failed, that's where you've lost the two below. But if it doesn't fail, and that operation then requires you to split it, then, you know, you have to write that split and now go up to the parent of it and write an update it, which, you know, it's, it's turtles all the way up. I would imagine that's what they do, right? Just what you were saying. If it had to go up to the top one and then, and write another block for that, like it's going to write a 4k block for all the ones that it needed to do. And at the end of it all, it's going to repoint. What happens if it doesn't repoint in the middle of all that, that, I mean,

Starting point is 01:34:22 that's why the write ahead log is so important because that write ahead log is what guarantees that it can adhere to the acid compliance by saying like hey even if i failed to update the parent because of some reason right get back into a consistent step yeah like like like we lose power and now now you've written the child node you successfully split and wrote the child nodes. But before you could update the parent, you lost power, right? That write-ahead log gets written to before you start doing these split operations or these update operations on the nodes in the tree. Yeah. So.

Starting point is 01:35:01 All right. Well, yeah. And you mentioned those other indexes. I don't know if we'll get into it. Maybe we will. The two-dimensional ones are kind of interesting because then you could see how if you wanted to do a search for... They gave the example of longitude-, latitude, right? And if you only had the single column indexes that we're looking at, then you have to get everything from the first longitude value first and then match on the latitude or vice versa, depending on how you're doing your search. But if you had a multidimensional index, then it could just be like, boom, there it is. Right. You could use both values right there. So, all right. Um, so we'll have lots of resources, uh, in the resources we like section as it relates to this episode. Um, of course we're going to have a link to the book. Uh, we have, we referenced episode 97 a lot. Uh, so that'll definitely be in there as well as some other random links. And with that,

Starting point is 01:36:10 we will head into Alan's favorite portion of the show. It's the tip of the week. Okay. And it looks like I'm up first and you ever go and add like a path variable or environment variable or something. And then when you're in windows, you don't, or I don't know how to to refresh things but like basically closing down terminals that i might have open and reopening and that's annoying sometimes i use like terminals like in intellij or visual studio or code or these other places and just because i added it via the menu or in some other window i don't want to have to like restart all my stuff.

Starting point is 01:36:45 Right. So I was doing a live stream and I mentioned that I hate that I have to do this all the time. And, uh, Greg, can char off mentioned that if I had ever installed chocolatey, chocolatey installs a command in my path called refresh and.

Starting point is 01:37:04 So if you have ever installed Chocolaty, and I have a feeling that you might have, you can just type refresh env in that environment, and it will go and refresh all those variables. That's really cool. Yeah, that's like in Bash, it's basically the equivalent of like source, you know, Bash RC or something like that.

Starting point is 01:37:24 I think I've, I like that. I think I've actually used this and I'm trying to, I'm trying to find the repository where I would have used it. And now I can't find it. I can't find the repo. Yeah. Cause I'm always doing that. I always end up closing my terminal and reopening it.

Starting point is 01:37:39 And I'm always annoyed every single time. I always have like five terminals open. Yeah. Same. You know? Yeah. I like it. five terminals open and it ain't coming to me now. Yeah. I like it. That's good stuff.

Starting point is 01:37:49 Yep. So thank you, Greg. Now it's killing me. I got to know like how did I do it then if I didn't do it that way? But I can't find it. So I will be forced to look for it later. All right. So I got a couple of that. I don't, I didn't think that we had mentioned before. Uh, one is a real simple one is that,

Starting point is 01:38:14 you know, our love for Docker has only grown, uh, over the last few years. And, um, if you are doing things like a Docker compose up, or maybe you're just using a bunch of Docker containers, if you are doing things like a Docker Compose Up, or maybe you're just using a bunch of Docker containers, if you're like me and you just refuse to install things if you don't have to and you find a way to use Docker for it, you can do Docker space stats, and it's kind of like doing a top, but for just the Docker processes. So you can see how much network IO each container is using, how much processor memory, et cetera.

Starting point is 01:38:50 It's just a really cool little thing that you can see which one of your containers are getting out of control or having a problem or whatever. And then also, I think, Joe, you'd shared that you got your GCP certification. So obviously we've been doing a lot more GCP here lately or say Google Cloud Platform than we used to. And there was this nifty little shortcut that I shared with Alan, where when you are doing a command inside of the Google Cloud Console, look near the bottom of that UI, and you should find a link that'll say equivalent REST or command line, and click that thing. And like, for example, if you click the command line, it'll give you the equivalent G cloud command that you can use, which is super awesome. If you find

Starting point is 01:39:52 yourself in a situation where you need to iterate on doing something and you just want to be able to tweak it here and there, but you know, mostly you want to keep things the same. And if you want to guarantee that you're keeping those things the same, then the easiest way to do that is to script it. Right. And, and this can give you the shortcut to where you could be like, well, you know what, maybe you didn't know the command off the top of your head, or you just didn't feel like writing it all out. And so you just use that UI to get you started. And then that link can, can get you the rest of the way. If I may, I would like to piggyback on this because this was so important to me

Starting point is 01:40:28 because Outlaw had this like 20 line, you know, G cloud command. And I'm like, dude, did you create that from scratch? Because I can't find where to do this. So the reason I want to piggyback on this is like we were using Dataproc. This was only available for the command line while you were on the create stage.

Starting point is 01:40:48 So when you're going through setting up what you wanted to do, that link was down at the bottom for the command line or the rest there. But after you created it, if you went back into your cluster that you set up, there was no longer the command line gcloud statement to do it.

Starting point is 01:41:04 So the key is do it while you're while you're creating the thing the first time and you get that script because after the fact the only thing i had access to was the rest command so you know um and it is absolutely amazing so yes yep yep and some of those things are like uh you know, specific to Dataproc, you know, you can't go and modify that cluster. And that's why, you know, once you've created it, there are certain things that you can't do to modify. And that's why it's nice to be able to just script that thing out. And, you know, as you need to like iterate on changes to it, you can go off and redo it. And then in my, uh, effort to never install anything ever that I don't need to have, um, you know, I, my love for Python grows, uh, with GCP. And, um, so we found ourselves, uh, Alan and I were working on a thing where we needed to be able to, you know, just explore some data and see what was what with it. And, you know, and Python specifically had a lot of great features, you know, just real quick that you can like explore the data, chart it, describe it, you know, create some nifty little graphs with it or whatever. But, you know, and part of the beauty of the Python ecosystem is

Starting point is 01:42:27 the Jupyter notebooks that you can use with it. And it's, you know, the notebooks aren't specific to Python, although that's where they got their start. And it's certainly the most popular one, but there are kernels for like other languages that you could use. But the point is, is that, you know, you could, yeah, install something like an Anaconda on your environment to get Python and Jupyter Notebook capability, or you could just run it through Docker. And I'll share a link, but there's a whole, there's an official set of Jupyter notebooks specific to what you might want. So if you want to do a SciPy notebook, then there's a specific Docker container for that. If you just want something a little bit more generic, they have that.

Starting point is 01:43:20 If you just want a data science one that they have, and they tell you all the additional Python packages that come with that. So, for example, they start with the Jupyter based notebook image, Docker image. And then, you know, each one of these things are going to build more onto it. Right. And so like the data science one, I think, included PyTorch, if I remember right. So, you know, there are additional libraries that each one is going to include. And depending on what you need, you could you can find that.

Starting point is 01:43:54 So I'll include that link there. But it's just one more thing, one less thing that you need to install locally by way of Docker. That's wonderful. I too try to Docker all the things. I've actually spent days of my life trying not to install software and fighting to try and make it work in Docker. Yeah. It's really great too. Like I read this one article where they were talking about how,

Starting point is 01:44:19 um, you know, Docker is the new get and you know, I'm kind of coming around. Yeah. Kind of called it. Kind of. Called it, nailed it. Excellent.

Starting point is 01:44:30 All right. So my tip of the week is Apache Drill. So as Outlaw mentioned, we've been working on a project together. And, man, just what's – if you've lived in a database world or you've lived in a world with no document databases and that kind of stuff, you become familiar with the tools around those things. If you ever decide to look at quote unquote big data tools and what's available out there, it's just mind boggling the kind of stuff that you can do with massive sets of data. And there's a term that's used in Apache beam called it's like silly parallel. It's not silly parallelized,

Starting point is 01:45:16 but it's something ridiculous, but embarrassingly parallel, embarrassingly parallel. But what's amazing is you can get into some of these tool sets and apache drill is one of them that is just like man this is unreal so i have a couple links here one is for the main website so drill.apache.org but the things that you can query with this technology hadoop mongodb hbase uh rdbms is amazon s3 azure like there are so many things that you can query MongoDB, HBase, RDBMS, Amazon S3, Azure. Like there are so many things that you can query.

Starting point is 01:45:49 And by query, I mean, you can basically point this tool at those things and query it like it's a SQL database. So you could say, you know, select, if you're querying your file system, you could say select file name, file size, and it will give you a list of the files in a directory and their file sizes, right? Like just really super powerful stuff. Now, all of this is cool sounding until you realize what you could do with that. So the second link I have is directly to the getting running with Docker.

Starting point is 01:46:25 So you can run Apache drill inside Docker, but here's the cool thing. So what if you have something like some files in S3 that you have data in? Maybe they're CSV files. Maybe they're JSON files. Maybe they're Parquet files. Who cares? But you want to query data out of those things. Like it's a database.

Starting point is 01:46:46 You can basically point it at your S3 bucket and say, you know, hey, select column name from these files and it'll happen. But what's even cooler is you can then join that against data that you have somewhere else. So if you have a database sitting out there, a SQL server, a Postgres, a MySQL, whatever, you could select data out of those files and then do a relational join to the data that you have sitting in a database. And Apache Drill will go out, do that work, put it all back together, and bring it back to you. It is unbelievable. And I mean, it's one of those things that like when you start using it, you're kind of blown away by how well it works. I know Outlaw, you worked with it a little bit, and we both walked away with positive experiences. It's got UIs. It's actually pretty easy to get up and running.

Starting point is 01:47:37 It's so awesome. I mean, he's not really giving it all the love that it feels like it should get. It really is amazing. The fact that, one, I think you're underselling just how easy it was because they give you the instructions for running it on, you just spin up a Docker container and boom, here you can have drill and you can go experiment and play with it and whatnot. It is relatively easy, like super easy to get it up and running where you can explore

Starting point is 01:48:09 data in a variety of formats, join things that shouldn't be joined together like they were a normal database table. And yet somehow this thing works. And fast. And yeah, and relatively fast too, like really good performance considering, you know, even if you're running it on a, like a single node instance on, you know, Docker container on your machine, it's still relatively quick when you consider what it's doing, joining things that shouldn't be joined together and then returning back a result. And,

Starting point is 01:48:41 you know, there, I mean, it would, we would be doing a disservice if we didn't mention Presto in the same conversation, because Presto is a, Drill is an Apache technology and Presto is a Facebook technology. But the difference here though, is that there's, there's two main differences. And you tell me if you disagree with this alan but presto versus drill presto would require you to define a schema on all of the things that you want to query so you have a csv great you need to tell me what you think each of those columns should be what their types are and things like that right and how you plan a column and you don't have to do that with Drill. So depending on your need, maybe you call that a pro, maybe you call it a con. But the fact that

Starting point is 01:49:33 Drill has a UI on it, right? You can go in really easy, click on a query that you ran and see a visualization of the execution plan that it did. And then as if that wasn't enough, you're like, well, let me see the actual details of that execution plan. You can just go see the raw execution plan period and see like, here's where it did this selection. Here's where it did this cast and conversions or whatever. Like you can see all the gory details of it and you can see how it split all that up across all the worker nodes. If you,

Starting point is 01:50:07 once you get it into a cluster, you can see how it split it up into the worker nodes, how long each worker node ended up spending on it all from the UI. And it's just amazing that this thing even works. Yeah. And it's, it's killer. And the thing with PrestoDB,

Starting point is 01:50:23 there might be a UI. I haven't come across it. Like it was never called out. Like, hey, go to this URL and see it. So it maybe exists, but there's no question the ease at which you can just start exploring data in various different formats and joining it if you want and doing aggregations and all kinds of crazy stuff is absolutely off the charts amazing. So I would definitely recommend checking it out. Even if you have no interest in doing anything in the big data world, it's cool to know that these tools exist. And I actually have one of the...

Starting point is 01:50:56 Say what? It looks like Airbnb has on their GitHub AirPal, which is a web UI for Presto DB. Oh, nice. Yeah. Now that's also part of the confusion with the Presto world too. Uh,

Starting point is 01:51:13 Oh my God. Where there's like, how, how did you explain it again, Alan? There's like, okay, so EB,

Starting point is 01:51:19 but then they, they forked off something else recently. So a company forked Presto DB. So Presto DB is by Facebook, like Outlaw said, right? And it's a parallel querying engine that will allow you to do all kinds of remote joining, just like Apache Drill. So what was irritating is when I was looking at how to do things,

Starting point is 01:51:39 I would do searches, right? Like, how do I do this? And sometimes I would land on a page that was prestosequel.io. And sometimes I would land on prestodb.io. And I'd go and do something that prestosequel said to do. And it'd be like, this command doesn't exist. And I thought I was losing my mind. Like I spent hours trying to figure out what was going on. And sure enough, basically a company stepped in and said, hey, we're going to fork PrestoDB and make PrestoSQL so that they can iterate on it faster or do the stuff that they wanted to do that I guess Facebook wasn't allowing into their code base. Right. So at any rate, there's two different products that are essentially the same product that started diverging just not terribly long ago.

Starting point is 01:52:24 And so some things work over here, some things work over here. It was, it was highly frustrating. But yeah, so at any rate, really, really cool stuff with the Apache drill thing. I would highly recommend checking that out. Yeah. And look for a fork of drill pretty soon. They'll come out with like hammer drill.

Starting point is 01:52:52 That's awesome. So how do I get you guys on my stream to show some drill i i mean i i don't am i i don't know if i'm streaming material like i joined one of yours one night and you and you said something really funny and i was hooked for a few minutes and i was like yeah okay um oh you said i'm thorny about it. I don't remember what it was, but you said, yeah, I'm feeling a little thorny about this one. Sounds like something I would say. Yeah. Are you saying you have a voice for radio or is that what you're trying to say, Alan? Like, wow, that's the most self-deprecating thing I think I've ever heard you say. No, no, I definitely haven't. I don i don't mind talking right like we do the podcast my thing is you know how like when you're working through a problem if you vocalized everything

Starting point is 01:53:33 inside your head you'd probably be in the ninth layer of hell by the end of it like that's what i'm worried about like if i have to force myself to vocalize the things that i'm thinking while i'm going through it, because it wouldn't be that fun to watch somebody just sitting there typing on a keyboard like, you know, that worked. You know, it's got to be the commentary that goes with it. I have a feeling that I'll be ostracized from everybody I know by the time I'm done streaming one hour of this right at the end of my frustration level. So everybody will realize what a madman I am. And yeah, yeah yeah exactly like oh all right well hey joe here if you want to see this go to the comments and uh encourage it because you know they didn't flat out say no never no way so uh i don't know man maybe we can make this happen yeah drillill is pretty awesome. I think I might be able to give a talk on it.

Starting point is 01:54:28 Drill was life-changing stuff. Now's the time for it. Everybody's stuck at home, but does everybody want to watch more code? I don't know. I do. Somebody does.

Starting point is 01:54:42 Alright, well, we hope you enjoyed this conversation of to be tree or not to be tree. That. That is the data structure. So subscribe to us on iTunes, Spotify, Stitcher, more using your favorite podcast app. And as I heard, I think I heard somebody whispering earlier, if you haven't already, we would greatly appreciate it if you left us a review. You can find some helpful links at www.codingblocks.net slash review.

Starting point is 01:55:16 Hey, and while you're up there, make sure you check out our amazing show notes, all our examples, discussions, and more. Slack to your rants and questions, feedback, send. And make sure to follow us on Twitter at CodingBlocks or head over to CodingBlocks.net where you can find all our social links at the top of the page. Alright, so...

Starting point is 01:55:36 I didn't read this section. Alright, so let's... One, two, three. Just letting you know. I don't know. Everybody, Alan here. Sorry. Sorry. Joe, would you like to bring us back in

Starting point is 01:56:26 yep yep yep okay alright you want me to do it you okay

Coding Blocks - Designing Data-Intensive Applications – To B-Tree or not to B-Tree

We dig into the details of how databases use B-trees as we continue our discussion of Designing Data-Intensive Applications while Michael's description of median is awful, live streaming isn't for All...en, and Joe really wants to bring us back from the break.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.