Coding Blocks - Designing Data-Intensive Applications – SSTables and LSM-Trees

Episode Date: March 16, 2020

It's time to learn about SSTables and LSM-Trees as Joe feels pretty zacked, Michael clarifies what he was looking forward to, and Allen has opinions about Dr Who....

Transcript
Discussion (0)
Starting point is 00:00:00 You're listening to Coding Blocks, episode 128. Subscribe to us and leave us a review on iTunes, Spotify, Stitcher, and more using your favorite podcast app. And check us out at CodingBlocks.net where you can find show notes, examples, discussion, and a whole lot more. Send your feedback, questions, and rants to comments at CodingBlocks.net. Follow us on Twitter at CodingBlocks or head to www.CodingBcks.net and find all our social links there at the top of the page. And with that, I am Alan Underwood.
Starting point is 00:00:29 I'm Joe Zacht. And I'm Mike Wildlaw. You're Zacht. I feel pretty Zacht right now, honestly. Datadog, a cloud-scale monitoring and analytics platform that unifies metrics, traces, and logs so you can identify and resolve performance issues quickly. And About You, one of the fastest-growing e-commerce companies headquartered in Hamburg, Germany, that is growing fast and looking for motivated team members like you. And the University of California, Irvine Division of Continuing Education, one of the top 50 nationally ranked universities UCI offers over 80 certificates and specialized programs designed for working professionals. All right, and today we're going to be talking about SS tables, LSM trees,
Starting point is 00:01:22 and basically just continuing on wherever wherever we stopped in chapter three. And these are some data structures and some methodologies for basically storing data that are really common in certain databases that we're going to be talking about in a minute here. It's basically the greatest chapter of any book ever written. I think it's what we have surmised so far. Yeah. There is a bunch of meat and potatoes in this or a bunch of meat in this meat and potatoes book here.
Starting point is 00:01:49 So as we like to do, we want to give thanks to those who have taken the time to go up to either iTunes or Stitch or any of one of your other places where you could do so and leave us a review. So I've got iTunes today and so thank you to Dev Extremis, Caffeinated Gamer, Matt Hussey, and Index Out of Range.
Starting point is 00:02:10 And from Stitcher, we have Marcos Sagrado, More Like Coating Rocks, Am I Right? And Aspergis69. That was well done, Al. You're welcome. Asparagus 69. That was well done, Al. You're welcome. I'm not saying I'm out of practice for like a couple nights. All right. And don't forget we're doing a book giveaway.
Starting point is 00:02:38 So go ahead and drop that comment on this episode. You can find the link in your show notes or just go to www.codingbox.net slash episode 128. Yeah. You'll get a letter from my attorney.codingblocks.net slash episode 128. You'll get a letter from my attorney. That's trademarked. Oh, man. You don't need to cease and desist, sir. That's right. Now, we do have a little bit of sad news, though.
Starting point is 00:03:02 Orlando Code Camp has been canceled, so my shins will be fine, I guess. But it's still a big bummer. It is a bummer. I think we all get psyched for this thing. I mean, apparently this coronavirus thing has taken hold of everything at this point. NBA, NFL, NHL, MLB. Disneyland.
Starting point is 00:03:26 Yeah. Yeah. Yeah. I was so looking forward to it. Not the coronavirus, but I was so looking forward to Orlando Code. That came out wrong. In hindsight, I've realized that now. I probably should have clarified that before. I was looking forward to Orlando Code Camp.
Starting point is 00:03:44 Well, we hope you're all healthy and doing good and still finding time to listen to podcasts without that commute. So shout out to all the new remote workers. Just wanted to get that out there. So if you're just now giving it a shot, I hope you enjoy it as much as we do. Yeah. Hey, so check this out. But this is sort of like a pre-tip of the week type thing. So I just got an email free copies of Snagit, free copies of a few other of their software suite through like June 22nd or something, maybe June 30th. I don't remember
Starting point is 00:04:32 the exact dates, but I'm sure that maybe if I find a link, I'll put it in here or if it's in the email, we'll get it in the show notes here. But if that's something that might help you out while you're going through this sort of difficult time with this whole coronavirus thing um check that out because they do offer great tools and it might be something that helps ease your your time doing this thing wait you can get snag it for free now for a few months yeah um oh oh you can get it okay Okay. From now until June 30th. Correct. Right.
Starting point is 00:05:06 Okay. Correct. I mean, that's still pretty awesome. Yeah. It's, I mean, it's really cool. Like, don't get me wrong. I'm sure that they may end up getting some subscribers and purchases out of it afterwards. Cause after you use the tools, it's kind of hard not to go back, but it's really cool
Starting point is 00:05:22 that even if you only get use out of it for two or three months that they're even offering that. So yeah, definitely, definitely go check that out. And I'm not finding anything when I Google it. So I'll go back to my email and grab those links out of there and put it in the show notes. Let me, let me share this link with you and you tell me if this is the one you're thinking of. So I just threw that out there and our little show notes whoops yes yes that's it totally ha yep my google foo is strong yep beautiful so yeah man like seriously i use snagit daily and absolutely love it because you can take screenies which there's tons of tools for that but they're they're tools is that what we're calling it now screenies what do we call them screenshots i mean grabs you're making me sound like the old man for saying so but yeah that's what i would have said
Starting point is 00:06:15 i would i've never heard anyone seriously say screenies well we can't all be cool like me. Well, yeah. I mean, there's that problem too. But no, so you could do it to take screenshots and mark them up and all that, which is cool. Like there's lots of things, but you can also do videos and do the same, like clip them out, crop them, do that kind of stuff. So seriously, this might help you and any kind of free resource might be good for this time. Alright, well, let's get into chapter three. Let's talk about string
Starting point is 00:06:52 sorted tables. Or no, sorted string tables? Ah, that says tables. Sorted string table. Sorted string table. Sorted string table, yeah. It's right there. First line on the show notes. No biggies though. Yeah, I'm not saying that I was reading ahead.
Starting point is 00:07:08 Why you got to be like that? Why you got to call me out? I called him out. Yeah. I'm sorry. I mean, it's not like it's public. Hey, so in fairness, raise your hand if you knew what SSTable stood for before that reading of the first line. Raise your hand if you'd ever heard of SSTable before reading this chapter.
Starting point is 00:07:31 Had you? No. So I've seen – when I saw this book, I recognized the name and I went and looked and I'd seen some files and folders named after – I forget if it was files or folders. In a Kafka Streams app app literally named SSTable. Hmm. Interesting. Yeah. We'll get into it a little bit later, but the originator of that is sort of somebody that we've probably heard about, but I don't want to do it now.
Starting point is 00:07:56 No. Hey, suspense. Okay. So, yeah. Let's go back to the previous episode, though, because we're kind of picking up where that left off. Right. So so in the last episode, we were talking about how a database sort of works. Right. Like with this append only log and all this kind of stuff. And it's sort of the backbone of how the whole thing works. Right.
Starting point is 00:08:18 So if you have not listened to the previous episode or you have not read this chapter, you will probably be lost with some of the concepts that we're about to go over. So just for, you know, fair warning here. Yeah. We specifically left off with hash indexes last time. So the SS table or the string sorted table or the sorted string table, And now I jacked it up and I've even read the thing. You're welcome. My job here is done. Yes. So it's basically the same notion that we were talking about with the hash table, except now when you insert these things into the log, or actually it's not the same thing as the hash table. When you insert into the log, you are trying to do this by sorting the key as you're putting it in.
Starting point is 00:09:08 Right. And so here's the thing about that is if you're doing the append only mode like the last one was as as these records come in, let's just say records are flying in at you. Right. And we got Joe Allen and Michael here, at least in the video order that I'm looking at. If Joe comes in first, you can't write it to the log first because then it'll be out of order, right? And so if Allen comes in right after that, well then, okay, well then that should probably go before that. And if Michael comes in next, well then he should probably be at the end because it should be AJM, right? Well, the way that you do that is you sort of have to keep these things staged somewhere first before you write it to the log. Otherwise, you're going to write it in the wrong order. And remember, the purpose of this thing is to write them into that append only log in the correct order. And I should say too that everything we talked about last episode still applies. So things like having segments that roll over so we don't have these big one, you know, gigantic files, all that stuff still applies
Starting point is 00:10:15 and it complicates things a little bit in ways that we're going to talk about here in a minute. But that just is just in the segment that you're writing to, you need to make sure that you're writing in order. And you would think that this would be my favorite of all because it's alphabetized. Right. We're going back to the whole newspaper. We'll cover the newspaper table later. Right now it's the sorted string table. Awesome.
Starting point is 00:10:43 So back a couple of years ago. Oh, man. Hey, man, if you can't laugh at yourself what are you gonna do it i can't believe i got the reference i'll put it that way yeah well you know you were there so i was there so we mentioned that the downsides basically we've got to do more work on inserting and remember we said we may insert it to the log append only that we were always writing to the end of the file. And that's like the most efficient operation you could do for a bunch of different reasons and a bunch of different scenarios. And so we're saying, no, let's back away from that. Let's take a step back and let's insert this thing in order, which means that
Starting point is 00:11:18 potentially if we insert it, like at the top of that list, we've got to bump all that data down. And that was something we were trying to avoid previously. But if we do this, we make that trade-off on ingestion time there. And we get a couple things out of it, three things out of it specifically. Yeah. So what do we got for number one here? Merging segments is much faster. Sorry. I just got so excited. And it's much simpler. So remember before when we would merge two segments, we would have to go through and we'd have to look for duplicates basically because sometimes we can get logs that come in and we'd have to compact those. That was the word we used for it in order to basically get rid of redundancies and keep the latest copy of a record.
Starting point is 00:12:01 But now in this case, what we can do is just go through each list. And it's similar to like you do it on merge sort. You basically go through lists, you keep a pointer to the top of both files and you just throw whatever one comes next into the new segment, which makes it much faster and it keeps things sorted, which is really nice. Yep. And one of the things that they point out here too, is if you have the same key in multiple segments, basically the newer segment file, the value of that thing wins, right? Because remember this, the whole point of this is you're not keeping a transaction log. You're trying to keep the state at least in, in what this one is, right? So if the same key, Joe, comes in twice, whatever the newest one is wins.
Starting point is 00:12:48 Yep, so slower on initial write, but faster to merge up those segments. And then to find the keys, you no longer have to keep these separate indexes in memory. Basically, we can do something kind of smarter where we basically can look at – am I jumping ahead here? No. So if we want to look at the am i jumping ahead here no so uh if we want to look at the current segment it's really easy it's basically a uh by sparse index is what they called it it's like it's like the start of each segment right they keep reference to the start of each segment right yeah so you can know if your item is involved in the segment or not basically it's like is this key this key potentially in this file?
Starting point is 00:13:26 Yes, no, and that's something really fast because you only have to kind of check the ranges there, which is really nice. It saves on memory, too. Yeah, so if we back up to the previous episode, we were talking about basically all the keys would be stored in a memory hash, right? So Alan, Michael, Jozak, all of us would be in that one memory hash. Well, the problem with that is if that data set grows massive, think like, you know, back in the day when you had the yellow pages or something. Right. Like that could get massive. Does that even exist anymore? The yellow pages? There's like 20 of them left now.
Starting point is 00:14:00 Right. Yeah, exactly. So so the problem with that is it just gets too big. Right. Yeah, exactly. So, so the problem with that is it just gets too big. Right. And as it grows and grows, if it goes out of a city and gets into multiple cities or whatever, it just keeps getting bigger. Right. And you run out of memory and you have those constraints with this method. You don't have that. You basically, if you think about it, like the pages in an index, I mean, heck, let's, let's compare the segment to a page in an index of a book, right? You would be able to look at the very first entry on that page and know that, hey, this page starts with AL, right? You go to the next page and it starts with JO. So now you know that everything between AL and JO is going to fall on that first page somewhere, right? So it's very similar to how you would have looked through the index of a book. That's exactly what's happening here. And you've reduced the amount of memory that's required because now you're only storing a
Starting point is 00:14:55 reference to that spot on the file where that particular name was. And then everything in between, you're not indexing that. You're only indexing the next beginning of the next file. Maybe. And just to sum it up real quick, I wanted to say, so comparing to what we talked about last chapter with the pin-only logging, we've gone away from writing to the end of the file, so we're a little bit slower.
Starting point is 00:15:22 But what we gained from that was the ability to keep things sorted, which means searching much faster and compacting much faster. But, but here's one key thing though, that I think we might've missed. I don't know if we misled earlier. It's not that we're writing to different spots in the file. We're still only append only to the file. And we'll get into a
Starting point is 00:15:46 minute how that's possible because it's not like if our records came out of order, Joe came in first, I came in second, and Outlaw came in third. It wasn't that it wrote Joe and then it saw me come in and it was like, okay, hold on, let's back up and write Alan before the Joe line. That's not what's happening, right? And we'll get into how this works here in a minute. Did I get that wrong? I'm so sorry. I totally got that wrong.
Starting point is 00:16:12 Well, yeah, because I'm honestly trying to remember it because I'm trying to figure, I'm trying to remember like what was the value of the sorted string table compared to the hash index because in both you had a in memory you just had a small index that was the key to a byte offset where you know where that thing was in memory so yeah in the hash the problem is if the hash got if you had too many keys then it could eat up all your memory right so if you had a million keys sitting there and you didn't have enough memory, that hash index was storing every key and every offset, right? With the sparse index, instead of that, it's saving a key that's at the beginning of a segment and it's offset. And then the next segment, which is going to be a group of records, it's storing that one. So instead of having a million keys in that same example, you might only have a thousand.
Starting point is 00:17:09 Let's put this into more tangible terms then. Let's say if you had a hundred, I remember now, if you had a hundred keys total and they were each in 10 keys per segment, then you only had to have 10 entries in your index because once you knew one, you could tell, okay, well, alphabetically, if it's in between these two, then all I got to do is go to this one, and I know it's somewhere in that range. And it's a small list of things that I'm going to be going through. Exactly. And that brings us to number three right here is, so the, the three parts of this was the, the important part is this is still really fast because you
Starting point is 00:17:57 can get to the beginning of that segment. And because you're only scanning over a small chunk of data within that segment, your, your scans are still super quick, right? It's not as fast as the O of 1 operation that the in-memory index gives you, but it's not much worse. So that's really the key there. I guess worst case scenario, it would be O of N where N is your segment size. If N was your segment size, yeah, it could be that way, but it's probably closer to like OO login or something like that because you're getting, like in the example you gave
Starting point is 00:18:32 where you have 100 index or 100 keys in your index, when you do your sparse index, you just reduce the memory footprint by a factor of 10, right? But the speed to be able to go through that index, you were able to go to the beginning of that segment and then scan through it pretty quick. So, so like we're still getting there fast. Like, uh, we, we've quoted, uh, Jeff Atwood's saying, uh, I didn't believe it was Jeff Atwood. He said that, uh,
Starting point is 00:18:59 everything is fast for small N. Right. Right. Yep. So to make sure I got it because uh i'm totally misread this stuff so basically we're talking about when we do the merging of those segments that's when we do the sorting in order to kind of keep those ranges in this ss table no when does that happen that happens that also happens then but we're now going to talk about how it gets put in there in order okay let's do that we still haven't said we still haven't said who came up with this strategy though and i know we're gonna get there it's it's later i believe oh oh fine yeah fine i believe it's later yes it's fine now so it's fine we'll get to it yes all right so here's the key right like
Starting point is 00:19:42 and so this is why joe was tripping up just a second ago is, all right, well, if we're getting these records out of order, how are we putting them into the file doing this append only thing? Because remember, the key here is trying to create a database. This transaction log has to be fast, right? Like when you write to this thing, it has to be just as fast as possible. So you're still doing append only. So the way that you do this thing is as you get the data, you're going to get them out of order. And you have to know that is you're actually going to write them to disk, not in your transaction log, but you're going to write them to disk in a sorted structure. So sort of on the side, right? Like you're going to have this staging area where you have incoming records and you're going to add these in a way that you can then take them and write them to your append only log. Right, right. Okay. So one option for doing that is a sorted structure,
Starting point is 00:20:44 like a B tree. So like a, that's one option. Another red, black sorted structure like a b tree so like a that's one option another red black trees avls these are basically trees that are meant to stay balanced so maintaining uh basically a minimum number of hops to get to the data that you're looking for and we keep that in memory as uh you know it's basically a strategy for keeping things fast yep and i said to disc a second ago and we'll get to that in a minute. But yes, in memory, did either of you guys look up these red, black trees or these AVL trees? No. Okay.
Starting point is 00:21:15 So I did. We talked about it before, though. Okay. So here's the thing that drove me absolutely crazy. Like, as I do, and as we all do, there was a Wikipedia page that came up and I started reading that and it was in some sort of Martian or, or, um, uh, I don't know, Jupiter type speak. I, none of it made any sense to me.
Starting point is 00:21:38 Like I couldn't map it out mentally in my head for anything. So what I did is I went on a scavenger hunt and I found a YouTube video where this dude explains it so well, like it is, it is golden. So we'll have a link in the show notes here. I highly recommend checking it out because I'm not even going to try and describe exactly what it does because it will completely come off like crap in this show like can't describe it well i'll also um because we have talked about trees before in episode 97 and we talked about b trees avl red black trees there's we we said there's 115 different types of trees and it's insane. During that episode though maybe you guys remember this and I'm going to include a link right here for you
Starting point is 00:22:31 so you can see it. And I'll include this in the show notes as well. There was I think a professor from one of the California universities. I forget exactly which one it was. But he had a visualization where you could go and you could see how the tree worked. Do you remember that? Yeah. And so I actually landed on this site too while I was doing it. The problem is, especially with this red-black tree, is it does weird things where it kind of rotates portions of the tree. And I didn't see
Starting point is 00:23:08 that happen on these animations, which was driving me crazy because, because basically with this red black tree, without going into crazy amounts of detail, and I'm not even going to get the detail right that I'm going to tell you, but if you inserted a record, like let's say that you're doing one through 10 and, and five comes in first, then eight, then three, whatever, as data comes in,
Starting point is 00:23:30 it's basically looking at the colors of the tree and the positions of them. And if a particular rule is met, like if two reds are in line or two blacks are in line or something like that, then it's going to say, Oh, okay, well, this isn't correct.
Starting point is 00:23:44 So we're actually going to disconnect these things, rotate some of these nodes over and then connect them over here. It is mind numbing to try and read that and visualize it. Yeah. But this YouTube video that I've got there, the dude actually does it and he shows, okay, this is what happens when you add this value in here. We're going to break these nodes here because this is rule number five, right? And in rule number five, the uncle and this and this all have to connect. So the cool part is this, what they've done with this thing is they've basically made it to where you constantly have a sorted structure by basically going from the bottom left node up to its parent, then going down the bottom right node. And if it has any
Starting point is 00:24:31 children, then you're going to go down those. But essentially you're just crawling the tree from the bottom left until you get all the way back up and around and go down the right side of it and do the same thing. And it's really cool. And it seems magical. And this is why trees are so cool in the first place. But this is how you get fast sorting when you're writing by using trees. Except you're describing it as if you're like starting from one of those end nodes, right? But you don't, right? You start from the tip of the tree and then not when you work your way down, right?
Starting point is 00:25:06 Oh, no, not when you're sorry, when you, when you start, your first value is, I was thinking about like from an insert point. Yes.
Starting point is 00:25:15 Yes. You're always going to start your root node. How that stuff works is, is in the details. But when it's all done, your, your smallest value is on the bottom far left of your tree. Okay.
Starting point is 00:25:27 Sorry, I missed that part. And then you end up crawling over the node structure to get to the rest of it. Yeah. So at any rate, yeah, again, I say all that because, again, the details, you would just get lost in them. I couldn't even read it and understand it. But if you watch this, it'll at least paint in your head oh that's really cool how these guys are sorting things on the fly without having to do what joe was mentioning earlier which is oh okay well i know that the last record i wrote was was joe
Starting point is 00:25:58 and then the one before that was mike so i'm gonna have to bump it up two spots like that's highly inefficient but this other way is it's in memory and you're just basically reconnecting some of the tree node dots so it's kind of cool i think like throw this stuff in there as fast as possible and retrieve it in a sorted way and you can do really fast searches on it too yep and there was a big difference that they mentioned between red black and avl and i And I can't remember the details now. That's pretty terrible seeing as how we're recording right now. But one of them was way more efficient if you're just doing a bunch of writes. And the other one was more efficient if you had more updates and that kind of thing. So and I can't remember. It seems like the AVL was more efficient if you had updates, but I may be wrong there. I got that red-black are faster inserts and removal than AVL trees because they basically do fewer rotations when things are being balanced.
Starting point is 00:26:57 And so red-black are more common. Okay, cool. So I didn't completely screw that up. And we previously said that red-black trees were optimized for batch inserts. Okay. Cool. So I didn't completely screw that up. And we previously said that red black trees were optimized for batch inserts. Okay. Okay. And AVL trees are typically used in databases where faster retrievals are required. We're willing to do that extra work in order to get a little bit faster retrieval.
Starting point is 00:27:18 Yep. So let's go back to this thing that we were talking about, right? So when Joe was talking about, hey, we're writing to the log, actually we're not doing that. As data comes in, we're writing that data to a memory table. In this case, it's called a mem table in this tree structure. And then that way when we've gotten to a point where we want to dump this tree structure and now append it to the log file, we have that stuff in order. Yeah, and once you've reached some sort of like predefined size threshold, then you can dump that
Starting point is 00:27:50 data from memory to disk into a new SS table. So this is basically a fast ingestion mechanism. And then we dump it out to that SS table as necessary. So it's like a, just a way of really optimizing that fast time in order to keep things sorted without doing all that crazy work. So do you think they're doing anything more than just serializing that tree out to individual files? No, I would imagine that's basically what it is, right? Just what you said. I bet that they're basically just crawling that thing and writing it out to disk as fast as possible because that's the next thing that they bring up is, hey, while this is being written to the new SS table file, that doesn't stop operations, right? You still have more records coming in. Those new records are being written to a new mem table and so while the other mem table is being dumped out to that file
Starting point is 00:28:45 which you would imagine to your to your question your point outlaw is i think they're probably just trying to get it out there as quick as possible be my guess oh they write it to the like the ss table uh format right yeah so uh while that is being written then we can keep writing to uh the new mem table and then we can kind of keep cycling like that and then we're serving up the read request then um anytime you do a sort of like a search on it and the trick is that you search in the mem table first and then you go back to the most recent segment and then moving backwards and that's of course assuming that your day is more likely to be, the data you're searching for is more likely to be recent, which I think is probably, they probably made
Starting point is 00:29:33 that decision. Like they didn't have to do it that way, but I assume that maybe someone smart decided to do it that way and it worked out. I mean, it's all sorted, right? So I guess the whole idea is if a search comes in, it's going to look through your mem table first because that's the most recent data you have. But then I guess going back to the segments, the whole idea is all those segments should be sorted as well. Yep. So it's just trying to find whatever, you know, basically probably looking at the first entry of each file and saying, Hey, is this the file I need to scan? Yeah.
Starting point is 00:30:08 It's an interesting, uh, run pattern there because like, you know, your first search is going to be basically log of N where N is the size of the mem table. And then your next search is going to be log of N where N is the size of that, that, uh, as this table and then back in, you know, all the way back into the oldest one. So it's, um, yeah, I don't know what the runtime was. Some number of times log of end is what it's going to end up being which is uh which is interesting so it's not fantastic for searches but it's really fast at writing so it's like uh this is an interesting compromise where uh you want to write things really quickly but you also want to be able to research
Starting point is 00:30:38 search them and uh you know pretty pretty decent time i guess i don't know were we going to talk maybe it wasn't on tonight's episode because because now it's kind of because some of this come back to me where it was like it would actually write to a thing called a write-ahead log like that was its way of um yeah that's not this one i think you're getting into the bee trees yeah okay but it does kind of act like that because you know essentially we're basically Yeah, that's not this one. I think you're getting into the B trees. Yeah. Okay. But it does kind of act like that because, you know, essentially we're basically writing to this kind of in-memory mem table first, this tree, and then it gets up to disk.
Starting point is 00:31:29 So, you know, the problem is if we, you know, run into a crash or something, we got to note up here on what happened or how we deal with crashes in this case, but it's basically the same kind of situation where we need to keep track of that mem table and, you know, basically flush that to disk and keep that kind of log there. And that was the point of the write ahead log, right? Like, you know, let's not, don't, don't necessarily like write to disk first so that you have it. So if you had to, if you had to recover from a crash, you can, once it's written to disk, then, or maybe even in parallel to it, you could try to like you know get it into the tree in the correct place but you know priority one is persist it so we'll get into that here in just a second too at least on these um so one other thing to keep in mind right so we talked about as data comes in it's going into the spam table as the spam table reaches some threshold that we say when we're writing this, that happens.
Starting point is 00:32:08 But also there's probably going to be some background processes that go through your segment files and merge those things as time goes on. Right. Because the whole point is to dedupe, get the latest values and keep compressing these things as time goes on. So that's what happens. So then the downside that this gets into what Mike was just saying a second ago, if the database crashes in the middle of your memory table being written to, you're going to lose all that data, right? Anything that's in memory just dies. So one of the ways that they talk about avoiding this thing is you can sort of do three things at once, right? So you're writing to this memory thing and this data structure, but you can also at the same time write to an unsorted append only log file, just like we talked about in the previous episodes, as sort of a staging area. So that if, for some reason, the database crashes, you can rebuild that memory table from that temporary append-only log, get that thing sorted, and then get back to a good
Starting point is 00:33:15 state. And that should happen pretty quick because the whole idea is your mem table should never be massive. Yeah, so we're kind of combining the data structures we're definitely getting more complicated now so we're bringing more systems in uh but we're basically using each kind of data structure for what it's best at and uh we're basically doing this by kind of splitting out our data and you know into multiple spots and just making them they can do and making the best that we can it's pretty freaking awesome i wonder wonder if, I think you mentioned Kafka earlier with the sorted string tables, where you actually
Starting point is 00:33:53 saw a file or a folder named that. And now it's kind of like things are starting to click, right? As you go through this book and everything, you start thinking about it, and you're like, hey, maybe that's why keys were such a big deal to Kafka. Because even if you didn't have a key, it's like, oh, fine. If you don't define one, we'll figure it out. Because maybe it matters in the way they do the partitioning. If they're keeping this tree, if all of those partitions are like a tree in memory or like that's how it knows which partition to access. And it's like, okay,
Starting point is 00:34:28 now I kind of see why you wanted to know, might want to know what that key was. So I just so happen to have a temp directory from Kafka Streams. I definitely want to emphasize that Kafka Streams is a library that we use with Kafka frequently. I'm looking at specifically inside a state store folder, which is how they persist data that you can then do sort of stuff with in the streaming environment.
Starting point is 00:34:52 It's got SST extension files, which are sorted string tables. It's binary. I can't read it because presumably it's compressed, which we mentioned. It also has logs right in there, which i'm guessing are the files that it writes to just in case it crashes and it needs to resume
Starting point is 00:35:10 it's got some other cool stuff in there like manifest files a couple other things so the thinking might be that kafka is writing to that log file first then dealing with trying to figure out how to get it into the correct string sorted table file. Yep. And then if it crashes and restores, it's got a little bit of extra work to do in order to kind of parse things and get into the appropriate spot. But it's essentially answered both needs.
Starting point is 00:35:36 It's got the fast ingestion time and it's also got a recovery mechanism. Hey, and for anybody listening, when he tried to make the distinction between Kafka and Kafka Streams, the difference is Kafka is your storage technology, your queues and the brokers and all that. Kafka Streams is actually like an application that you can write on top of it that will allow you to stream data and process it in real time. They call it stream processing. So that's what Kafka Streams is. Right. But it's Kafka that does all the partitioning, and it's going to care about those keys. But in the example that you brought up, you were talking about an in-memory – well, you said state store specifically, but let's just say like an in-memory data store, right?
Starting point is 00:36:24 And maybe it's behind the scenes using a sorted string table too yeah and even uh we'll take it one more level deeper i just happen to know that kafka streams and by the way a lot of times i talk about kafka i totally conflate kafka streams because that's like my main mechanism for working with it and it's not the only one and that's a bad habit but i happen to know that kafkas utilizes RocksDB under the cover to keep track of its state stores. And so what we're actually seeing on disk is probably, I don't know, it's 100% for sure, but I'd be surprised if it wasn't.
Starting point is 00:36:55 But I'm pretty sure I'm actually looking at the files for RocksDB, which specifically is a, what do you call it, an LSM tree table, which we're going to talk about. You're jumping ahead here. Yeah. But RocksDB is mentioned frequently in this chapter. So it was kind of cool to kind of tie that all together and say, oh, okay, I can go here. I can look at these files.
Starting point is 00:37:16 I know what that means. I understand a little bit more about Rocks and everything. And so this is all starting to make a little bit of sense. More like Kafka Rocks, am I right? You're welcome. More like coding rocks, am I right? You got a double. Today's episode of Coding Blocks is sponsored by Datadog, the monitoring and analytics platform for cloudscale infrastructure and applications. Datadog's machine learning-based alerts, customizable dashboards, and 400-plus vendor-backed integrations make it easy to unify
Starting point is 00:37:51 disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a 14-day trial and receive a free t-shirt once you've installed the agent. Visit datadoghq.com slash codingblocks to see how you can unify your monitoring today. Again, that's datadoghq.com slash codingblocks. All right. So, hey, if you could leave us a review, that'd datadoghq.net slash review. We tried to make it easy for you. We've got some links there. And, yeah, just smash that five star there. That would be great. Am I right?
Starting point is 00:38:50 Am I right? Yeah. All right. Well, with that, how about should we do a joke first or do we want to do the survey first? There's always time for a joke. There's always time for a joke. I like that. Well, how about this one? This one seems very irrelevant given today's circumstances and the sad news that we had to give.
Starting point is 00:39:13 So Arlene shared this tweet with me from Sam Garasi. I'm going to guess that's how I pronounce his name. And he says that the World Health Organization, also, you know, just WHO, you might see sometimes, you know, so you just pronounce it who, announced that dogs cannot get COVID-19. Dogs can be released from quarantine. I guess you could say that who let the dogs out. Terrible. So thank you, Arlene, for sharing that with us. All right. And with that, we head into my favorite portion of the show, Survey Says.
Starting point is 00:39:57 All right. So back in episode 113, no, sorry, 123, we asked, which data model do you prefer? Which really fits so well with our current topic. Your choices were relational model, I love many to many joins, sixth normal form all the things, or document model, i'll worry about the data structure when i read it or graph model it just sounds cool oh you're still using relational data models that's cute or polyglot persistence i'll use what i think makes sense for the use case. All right. Joe, how about you go first? What do you think the choice is? The answer is polyglot persistence because I get them all mixed up anyway.
Starting point is 00:40:55 And I'm going to go with 28%. 28%? I'm good at math. All right. Because this time you studied math. It would have been better if you said something below 25, though, sir. Yeah, I was going to say, there's two options. I still crack up about that.
Starting point is 00:41:19 I could not stop laughing while I was editing that last show. Every time I would get to that part. Oh, man. Okay. So for me, I would hope that everybody said polyglot persistence, but I'm pretty sure that that is not going to be the answer. And I'm going to say that everybody's going to say the relational model
Starting point is 00:41:40 because it's been around forever and that's what everybody knows and loves. And I'm going to go with 33% of the vote. All right. So we have old Joe with polyglot persistence at 28% Allen with relational model at 33%. Right. Yep. And the winner is. Yes.
Starting point is 00:42:07 Oh, I win. Yeah, baby. Joe. No way, man. I don't believe. I could see it in his face. I knew it. Where are these unicorns you speak of?
Starting point is 00:42:17 I don't believe. Polyglot persistence for the win at over 45% of the vote. Yes. I think I love everybody just a little bit more today. It just happens to be relational 90% of the time. Yeah. I choose to think this is right for the job. Yeah.
Starting point is 00:42:37 Now, relational was a strong second, no doubt about it, at 37%. So you both had your answer, your percentage under what the actual was. So you both did well in that regard. So, you know, kudos to you guys. or something like, hey, for all you polyglot persistence people out there, how do you split your persistence? Is this relational? Is it the document model? Whatever. I don't know, man. I run into so many people that are like, nope, DB all the things.
Starting point is 00:43:18 Firebase. Firebase all the things. I don't know. I really want to know now leave a comment let us know maybe a book yeah definitely hey that's what we should do yeah leave a comment explain yourself yes and that'll put you in as a an opportunity to win a copy of the book so and even if you don't want the book leave the comment and tell us what you think because i mean whatever you're probably gonna want the book though leave the comment and tell us what you think. Because, I mean, whatever.
Starting point is 00:43:45 You're probably going to want the book, though. Hey, and let's not forget that I won this one. Make sure. How rude of me to minimize that. You're right, Joe. We should have that man a ribbon. We kind of drifted a little bit. I just wanted to bring it back on topic.
Starting point is 00:44:12 You know what? We should totally tweet that out uh joe won the survey yeah that's amazing all right all right well tweeting right now how about uh how about this then how about another joke let's do it so this one this one comes from youngest son. I have no idea where he got it, so I can't source it any more than that. But it made me chuckle. I hope it does make you chuckle as well. So a priest, a minister, and a rabbit walk into a blood bank. The rabbit says, I think I might be a type O. I get it. I was waiting. I was like waiting for it. It's going to happen. And when it does, it's glorious. Joe, did you get it? No.
Starting point is 00:45:04 Oh, come on. Do I got to explain it to you? What do jokes usually start with? A priest, a minister, and a... A rabbi. Rabbi. Rabbi. The rabbit must have been a type O.
Starting point is 00:45:17 Type O? Oh, my gosh. That's so bad. I love it. Oh, my gosh. See, I have. I love it. Oh my gosh. See, I have such poor grammar that I just keep thinking like a type of what. We're back to yes, yes, yes. Yeah, man.
Starting point is 00:45:33 Oh my gosh. Yeah. Well, the fact that it was also a blood bank too, like, you know. Yeah. It works so well. It works so well on so many levels. That was good. Yeah.
Starting point is 00:45:46 All right. So then for this survey, for this episode's survey, or for this survey's episode, whichever you prefer, really, we ask, do you leave your laptop plugged in the majority of the time? Right? plugged in the majority of the time, right? So you could take the cannibalist answer all the time. I don't care about the battery. Or you could be responsible and say, no, I try to maintain my battery's life expectancy by neither fully charging it nor discharging it. Or, no, but not because I care about it.
Starting point is 00:46:35 I know where Outlaw falls in this. Yeah, it's literally called the Outlaw. I think the second thing was named after him. Option number two, like when you get your new Dell, it's not the Dell battery optimization. It's the outlaw optimization. I mean, these things, I mean, listen, we all love our laptops. We all love our laptops. But let's be honest.
Starting point is 00:47:03 There's been some stories. There have been some photos shared. They're like little, you know, little heat generators just waiting to catch on fire. But hold on. Hold on. Okay, I'm holding. Let's be completely honest here. Outlaw, you baby, your computer battery is probably more than any person on the planet
Starting point is 00:47:25 and you have also still experienced some of these swelling battery problems right like your your macbook pro turned into a rocking chair at some point so it's like i don't know like i'm not trying to skew the survey too much but it it's like, man, if it's going to happen anyways, just freaking plug it in. The fear is that it happens when you're not around. That's the fear. If you leave it plugged in all the time, you'll always notice when something changes. What? Okay, yeah, Maybe that doesn't
Starting point is 00:48:05 apply to me. I don't understand that. We're going to find out who's right and who's wrong. I don't know. I might win twice in a row. We'll see. Yeah. We've got some weird logic going on here. Yeah. I think you guys have a strong chance there, Joe. A strong chance. Sadly for you, though, we won't be covering the results of that episode next episode.
Starting point is 00:48:22 Or that survey next episode. Yeah, so you've got a little bit of time, so you're going to have to work hard to get that two-win streak. covering the results of that episode next episode or that survey. Yeah. So you got a little bit of time, so you're going to have to work hard to get that to win streak. No, I already have the answer for next one. You know what? It makes me want to go back and like score all of the surveys to see like which one of you statistically is,
Starting point is 00:48:41 you know, has the, has had the better percent win percentage. Oh my God, you know what that would entail though because you know we haven't kept track of any of this. You'd have to listen to 128 episodes. But you wouldn't have to listen to all of it though, right? But you're going to have to scan to find it. It's not like we have this SS tree index of where this stuff happens. No, no, no. I still have the source. So I could easily find the, you know, open up the project and I could find it.
Starting point is 00:49:09 Not that I'm going to do any of this. Just leave a comment. Let us know who you think wins the most. Maybe that should be in a future survey. Ooh. There you go. Oh, that'd be a good one. Oh, yeah.
Starting point is 00:49:22 Then you're going to have to prove it. Yeah. I didn't say I had to be right. This is just what the audience thinks. Yeah, yeah. Alan's already hedging. That's what's going on here. Yeah.
Starting point is 00:49:33 This episode is sponsored by About You. About You is one of the fastest growing e-commerce companies in Europe, headquartered in Hamburg, Germany. The online fashion store is currently live in 10 European markets with more than 8 million app installs, 15 million active users on its platform, which handles more than 300 million API calls per day. In 2018, About You reached a company valuation of more than $1 billion US, moving up to the exclusive circle of European unicorns. This could only be achieved by the excellent work of About You's tech teams. One third of their employees are developers and come from over 40 different nations, which truly enriches the teamwork of the company. What they all have in common
Starting point is 00:50:19 is that they are highly driven by the passion to develop the best product on the market. About You also has an award-winning organizational move model that allows developers to switch teams, ensuring constant learning and developer fulfillment. About You has built its software in-house with leading technologies like Laravel, Node.js, and TypeScript on the server side, and Vue.js and React on the client side, and even Flutter for mobile applications. Besides a variety of free drinks and fresh fruits, About You offers free language courses and helps new employees in the relocation process if they move from abroad. Moreover, developers get free tickets to About You's organized conference, Code.Talks, one of the biggest tech conferences in Europe. The conference, that is taking place in Hamburg, is visited by more than 1,500 developers. Furthermore, About You offers a well-structured
Starting point is 00:51:10 onboarding process with a buddy system and provides access to e-learning tools such as laracast.com and egghead.io. When starting at About You, you have the choice between different hardware setups as well, like MacBook or Windows Notebook and the kind of IDE that you want to work with. About You is growing fast and is constantly hunting for new and motivated team members. About You currently has positions available for full stack, front end, and Dart slash Flutter developers, a quality assurance engineer, or a project manager, as well as other exciting leadership positions. Does this sound good to you? Apply now at aboutyou.com slash job. They're looking forward to hearing from you. All right.
Starting point is 00:51:51 So coming back from this, Joe had skipped ahead a little bit here. Oops. That never happens, by the way. All this stuff that we talked about now lays the groundwork for some things that kind of play big roles in a lot of technologies out there, especially nowadays, right? So he talked about RocksDB, and it's based on this whole SSTree thing and also what's called LSMs. And there's another technology out there called LevelDB. And I think it was also,
Starting point is 00:52:26 they mentioned the weird thing, Rayek or something like that. Yeah. So it works with that. But here's the cool part. These are databases that are intended to be embedded in other applications, right? So when Joe was talking about the Kafka Streams applications, they have several things like these state stores that he mentioned. And also they have these things called global K tables that basically you can load up an entire table. And it's persisted to disk using RocksDB behind the scenes. And these are little internal databases to the application that the application can use. So pretty cool stuff.
Starting point is 00:53:04 Yeah. that the application can use. So pretty cool stuff. Yeah, and what's really nice about those is that they're really fast at ingesting and they're pretty fast for lookups. And so they're not really perfect for either one. There's better use cases for both, but they work really well for both those use cases. And for Kafka particularly, in streaming environments, it's really great
Starting point is 00:53:20 because you potentially have a whole lot of data flying in as fast as possible and they're not really sure what you're trying to look up. So between those two capabilities, it's really great because you potentially have a whole lot of data flying in as fast as possible and they're not really sure what you're trying to look up. So between those two capabilities, it's just a great compromise. Yeah, it's a good medium. I think maybe like with last episode, correct me if I'm wrong, but I think with last episode and now with this episode, like we're finally getting into being able to apply some of the concepts from this book into like real world technology.
Starting point is 00:53:45 So last episode we talked about bit cast and what was it? React. And now, now we're talking about level DB and rocks DB. And, and now we, but, but we can actually take it a step further because now we can like relate it
Starting point is 00:54:00 to like, Hey, here's this big, you know, behemoth of a technology that's, you know, very popular right now of Kafka, right? Or Kafka. That's backed by something here.
Starting point is 00:54:16 So like we're, you know, the point I'm trying to make is like we're starting to get to the point where like we can actually like make these concepts tangible, right? It's not theoretical. And I think the cool part is not only are we getting into this tangible part of this, but it's not super complicated, right? Like what we've talked about, we went through in two episodes, and it's a faster read than we talk about it. But the funny part is,
Starting point is 00:54:46 is when you think about what we've done here is we're really just writing log files and just finding ways to efficiently merge these log files. And now we're talking about technologies that are used in a lot of the hottest distributed data storage technologies on the planet right now. Yeah, I should mention that. You know, we said they were great for embedded use cases.
Starting point is 00:55:08 Part of that is because they are so simple. We just talked about the insert and the retrieve aspect of these tables and the restore if it crashes. Like, that's pretty much it. You could take that description right there and write your own database based on those rules, and it's going to work pretty dang well. Yeah, it's impressive. Now, here's the cool part. So we've talked about this embedded stuff, but this is what's sort of mind blowing. We've heard of Cassandra, right? Probably anybody
Starting point is 00:55:36 that's worked with any amount of distributed data at scale, Cassandra comes up. This is the technology that sort of is the backbone of that, right? Like this particular method of writing these files. Another one that if you've ever worked with truly large data, you've probably heard of HBase, which is typically part of the Hadoop ecosystem. That's another one that is using these same bones behind the scenes. And that's kind of cool when you think about we went through some pretty simple concepts and these are now backing some of the largest scale systems on the planet. Are you going to tell us where this idea came from now?
Starting point is 00:56:20 So somebody was wanting, somebody was like tripping all over this earlier about to say it. I don't know. Was that somebody, Joe? Was it me? I don't know. I don't know.
Starting point is 00:56:31 Okay. All right. Then if nobody wants to take it. Oh, it was Alan. Alan was trying to give it away. So here's what's kind of cool, right?
Starting point is 00:56:40 Like when we were talking about this whole SS table and the mem table earlier, uh, Cassandra and HBase, they took design cues from Google's own white paper that they wrote on their big table technology. And in that white paper is when they actually introduced the terms SS table and mem table. So everybody's picked up on it, which is not surprising, right? Like I would venture to say that these white papers that get released are used all over the place. I know DynamoDB was written from somebody in house over at AWS. And I'm sure that he looked at a bunch of other things and said, Hey,
Starting point is 00:57:19 I want to take some ideas from here and some ideas from here. And let's, let's write our own storage engine. engine right so um so what's the moral of the story just read some white papers read some white papers combine them in novel ways and uh then become famous like google big table yeah i mean it's yeah i mean it's there really is something to be said for that though because like even um you know i mean there have been some white papers. One comes to mind that was related from a way that Amazon was applying machine learning in real time, right? And this, you know, kind of like a mini batching approach that they were taking at the time. It was an article that, you know, was pointed to us and we were in,
Starting point is 00:58:05 you know, me and a coworker read through it and it's like, Oh, that's really interesting. But yeah, it's like, you know, there really is value to trying to find those white papers and read through
Starting point is 00:58:13 them. Uh, you know, because you will find some like cutting edge kind of concepts and ideas like that. And, you know, maybe some are,
Starting point is 00:58:21 maybe not all are winners, but you know, here's an example. Like if you're following like some of the big companies, like a Google or an AWS. Yeah. That do things at scale. Facebook rocks.
Starting point is 00:58:32 DB was created by Facebook to solve the problem that they had. Right. So, um, Facebook seems to make a lot of killer technologies, by the way. Yeah. When did that happen?
Starting point is 00:58:43 Well, I don't know when you got like 3 billion users, you might have scaled to a scale that Alan is always trying to solve for. They haven't even hit my scale yet. Yeah. Well, I mean, they'll get to the billions eventually. The billions coming. Yes. Yeah.
Starting point is 00:59:02 So, yeah. And so here's the thing, like we were just talking about these SS trees and these things that we now know that Google kind of came up with the terms or, or the person that wrote this white paper at Google did. But here's the interesting thing. This stuff kind of existed beforehand, but under a different name called the LSM tree or the log structured merge
Starting point is 00:59:23 tree, because everybody's going to remember that you know but it was basically the same type thing this whole notion of storing compacted and sorted files yeah and uh gotta mention some of our favorite lsm storage engines lucene is probably the most popular nowadays and that's the backing technology that's embedded and used in both solar and elastic search which uses a very similar process and that's kind of how they're able to maintain fast ingestion speeds even though their search engines which do a lot of work and are typically known for being slow on ingestion so that's kind of like their secret sauce for
Starting point is 00:59:58 mitigating that and also just keeping up with that data and also making it searchable. Really great. Big fan. So on to optimizing. One of the problems of the LSM tree is that searching for keys that don't exist can be expensive. Because remember, we talked about that thing where you search the mem table, then you search the last segment, and then you go back to the last segment, and you've got to keep looking. So if something doesn't exist, by definition, you have to check the mem table in every single segment to see if it doesn't exist.
Starting point is 01:00:29 And, uh, that kind of stinks. And so, um, one, uh, Ooh, you know, I just saw a video about this. It's really great. Um, so one technique that's useful for getting away from this problem is basically a bloom filter, which is the algorithm. that's actually super cool if you look at how it works, but it's a probabilistic algorithm that basically gives you the answer no or maybe. And so if you ask me if a key exists in a database, I can tell you for sure no, it's not, or I can say maybe, in which case you still have to look. And that's a great compromise for this particular data structure because this is the one where it really hurts when it doesn't exist so if you can ask this uh algorithm hey does this key exist and it tells you for sure no then you can save a ton of work and it's uh in practice this algorithm works really great and remember guy royce uh guy
Starting point is 01:01:19 with a huge beard awesome fantastic amazing talks works for redis uh he just did a really great video on bloom filters and it like released it like last night so we'll have that in the show note it's and he did a really great job and he uh compared it to a tardis being bigger on the inside so there's a little hint about how that talk's gonna go okay you guys did with TARDISes, right? Yeah, for Doctor Who. Yeah. Yeah, I don't. Oh my gosh, Alan. Oh my gosh.
Starting point is 01:01:51 My other vehicle was a TARDIS. In fairness, I tried to watch that show. It was so awful that I couldn't get past it for an episode or two. Oh my gosh. It was so bad. Oh my gosh. You just lost the UK audience. I did. I did.
Starting point is 01:02:04 And I'm sorry. I did. I did. And I'm sorry. I tried. I tried. Oh, my gosh. Anyway, you know, I get very excited about these probabilistic type things. Like, I got really excited when we talked about heaps because they're mostly sorted. I thought that was really cool. And I also think Bloom filters are just a really cool algorithm.
Starting point is 01:02:22 So, it's really cool. Maybe Bloom filter had to be an algorithm that we specifically cover. Cause yeah. I'll get into that one specifically. Yeah. And it seems so weird to say that the answer could be maybe, maybe that's such a beautiful answer. It's definitely,
Starting point is 01:02:39 yeah, it's definitely no, or I don't know. Yeah. It might be. whatever it's like a teenager answer whatever whatever yeah so one challenge is that uh you know that we just mentioned that search for your key doesn't doesn't exist it's expensive so we use bloom filters and the second challenge or second way we optimize is basically determining when and how to perform the merge and compaction operations. And so there's two main strategies for that.
Starting point is 01:03:14 There's leveled compaction, in which key ranges are split into smaller tables and older data is moved to different levels. I'm guessing leveled DB uses this one. I don't know that for sure, though. It does. Leveled DB and RocksDB uses this one. I don't know that for sure, though. It does. LevelDB and RocksDB both use this. Okay, LevelCompaction. So they basically split into smaller SS tables, and then the old data is moved down to harder levels.
Starting point is 01:03:35 So kind of like, I imagine it's like a diamond being compressed into the... Anyway. The other one is SizeTieredCompaction, which HSpace uses, in which smaller and newer SS tables are merged into larger and older SS tables. So that works out pretty well for those guys, I suppose. Yeah. And here's the thing, right? Like we've covered a lot of the generalities of what's going on here. And they even pointed out in the book, like, hey, you know,
Starting point is 01:04:07 we hit on the meat of what's going on behind the scenes. There's probably tons of technical little things that they're having to do, right? Like, who knows? Syncretist file access. Like, there's probably all kinds of things. It's that stuff that you run into that you spend five days on that felt like it was going to take five minutes. I'm sure there's all kinds of details like that in this, but now you have a gist of what makes up some pretty massive databases, right? Like at least the
Starting point is 01:04:36 underpinnings of a lot of that stuff. I'll tell you, I am constantly getting my RocksDB stuck, locked, and I just delete the directory. I do the same thing. Yeah. Like, for a while there, we had code that would delete the directory on startup because we were so tired of dealing with it. That's not the right answer, by the way. No, no, it totally is. I don't even know why you would imply that it's not.
Starting point is 01:05:01 I don't care, like, how big or what the purpose of the database is. Anytime the answer is just delete the database and start over, I don't think you can claim that's the correct path to go. Pre-2015, you could have made that argument. Now with Docker, it's just so much easier to do that. It really was. Yeah. I'm trying to think.
Starting point is 01:05:23 It seems like there might have been something else in here. I mean, we're not going into the bee trees in this particular episode because as you saw, we were just going through kind of the alternate to that. And we've covered bee trees. Not in depth, though. Not in depth. We talked about file system kind of examples. But you know what? I do want to gush about this book just a little bit more because the amount of information that they have dropped in these chapters is just crazy. I mean, honestly, I've learned more from this particular book
Starting point is 01:06:07 in terms of the things that I just always kind of took for granted in my everyday tool usage that is just mind-boggling, right? Like bloom filters. I would have never even known those things existed had I not gone to some talk or like you say, Guy Royce has this thing. But like coming across that here, I mean, I didn't even know it existed. No or maybe? That's not even a thing in computer talk, right? Usually it's yes or no. It's not no or maybe. I don't know. Right. Write that as an if-else statement. If-else, I don't know. So yeah, man. I mean, seriously, like Joe actually was the one who recommended this book months ago.
Starting point is 01:06:50 And and we're all kind of like, OK, yeah, sure. Whatever. Then we pick it up. And at least me, I was like, man, this reads like a novel. Right. Like I could keep going in this thing because it's written in such a way that it's understandable. But you get so much out of it at the same time, right? Like it's, it doesn't skimp on the details, but it doesn't bury you in them either. Yeah.
Starting point is 01:07:12 At the end it makes you definitely want to create your own database. You're like, all right, I care about, you know, I want this kind of ingestion. I want this kind of search, um, compaction, you know, error recovery, like whatever. It's like you could kind of take these rules and build it up and like go with it and memory database or build your own Cassandra or if you want to create your own
Starting point is 01:07:28 RocksDB. It really looks at the underlying differences between those and the algorithms that power those. I think it's super cool. Okay. Now, as much as I'm enjoying this book, this is not inspiring me to want to go write my own database. We're still reading. That's why you haven't gotten far enough.
Starting point is 01:07:44 Oh, okay. What chapter is that? Chapter four, write your own database. Oh, I see it now. I feel like you need to embrace your inner Vlad. Yeah, right. I write database. Yes.
Starting point is 01:07:59 No. I build web server. I mean, I think for me it's just like having an appreciation for how they work, though. And the deeper we go into this and the more we continue along, even just in general with the podcast, like not even in the context of the book, though. Like we keep covering topics, you know, in the podcast that are subjects that it's like, well, I haven't thought about that since I was in school, you know, or maybe you didn't even think about it when you're in school, right? But, you know, we keep like, I guess it's kind of like, you know, honing the craft, you know, like you keep, I think, you know, an analogy that Joe had used one time before was like, you before was sharpening the saw.
Starting point is 01:08:47 That's what I've appreciated about it. As it relates specifically to this book, though, even though it does talk about bee trees again, for example. Yes, we did go and have a whole conversation on trees and bee trees are part of it. But it's like, oh, you're like further, you know, concreting that concept and that idea and like, hey, here's a specific use of how you could, you know, talk about this thing. Yep. It's pretty amazing. Oh, by the way, as soon as I was talking about the podcast, we did get reached out to one of the listeners.
Starting point is 01:09:20 I'm not going to give his whole name cause I don't know if he, if he wants that out there, but Brett wrote and said that he landed his first job. And, you know, he thought that listening to this podcast was a big part of that. And that's so amazing to us. I mean, like, seriously, it's really cool because we get to learn and we get to share that. And along the way, if it's helping other people out, like, that's so amazing. So, you know, we talk about those reviews and stuff, but I mean, it's, man, it really is some huge payback when somebody
Starting point is 01:09:50 reaches out to us out of the blue and is like, Hey man, like you totally changed my life or you, you did this. And it's just, it's really awesome. So, yeah, there was another story that we got similar to that too, uh, fromua that is similar kind of story it was like yeah it it really it's crazy to think that like here was this this idea that started out like you know i don't know six seven years ago now right and yet we're still going on with it and then people are crazy enough to listen to us. Right. But, but, you know, and it's like, it's the fact that it's like even helping people. Cause you know, I mean, that, that's, that's awesome. And it's like, it's, uh, it's so flattering. Cause it's like, wow. I, you know, yeah, you, you, you can't,
Starting point is 01:10:44 you can't put words to it. Cause you're like're like, well, I didn't think that I was possible of helping anybody like that. Yeah, it's rewarding in a way that is just, I don't know. It's really cool. So seriously, thank you. This was completely not even planned, but thank you. Everybody that does leave the feedback and stuff, it actually really does mean a lot. It really does. And it's part of the reason why we still do it, right? We study this stuff. Part of it's because we're gluttons for punishment. That's a lot of it.
Starting point is 01:11:14 We do like to learn, but it's also really fun to share and interact with people and get that feedback and go and meet people. And it does sadden us that things like Orlando Code Camp are closed right now because I mean, we have a blast doing that kind of stuff. You know how many people I met at the last one? I think I met all of Orlando.
Starting point is 01:11:36 Everybody. Yeah, you met everybody. Outlaw truly met everybody. Yeah. Hey, you know, if you're working remote and you are not digging the isolation and pop on into slack too coming box.net slash slack hang out yeah do that yeah i'm gonna have to i'm gonna have to spend a lot more time in in slack than now because of the coronavirus this episode is sponsored by the University of California,
Starting point is 01:12:05 Irvine Division of Continuing Education. Python is one of the fastest growing programming languages and UCI's Python Programming Certificate Program will prepare students for opportunities in web development, data analytics, core software development and a wide range of scientific and mathematical applications. Students will learn programming concepts including program styles, idioms, libraries, data structures, data retrieval, processing, visualization, networked application program interfaces, and databases. UCI's certificates in data science, predictive analytics, machine
Starting point is 01:12:43 learning will prepare students to gain the necessary skills to land a job in data science. Additionally, those interested in predictive analytics and machine learning will learn to improve and optimize business performance. If you're looking to become competitive in the global market, advance your career, or start a new one, UCI has the resources to support you on your new path. Spring registration is now open. Visit ce.uci.edu slash codingblocks. Again, that's ce.uci.edu slash codingblocks to learn more and reserve your seat. Again, that's ce.uci.edu slash codingblocks and reserve your seat today. And with that, we will have some resources that we like. You know, obviously this book is going to be
Starting point is 01:13:34 one of those resources. There's going to be plenty of links in here. I'll have links to episode 97. There's the video that, the YouTube video about red black trees that Joe mentioned. There's the video that the YouTube video about red, black trees that Joe mentioned. There's some links that Alan mentioned. So all of that's going to be in there.
Starting point is 01:13:52 So with that, let's head into Alan's favorite portion of the show. It's the tip of the week. Me first. This is Joe's tip of the week. That's right. Yep. And Joe's tip of the week this week comes from Joe recurs and Joe,
Starting point is 01:14:08 Joe Ridley. So thanks for sending this Joe Ridley. Did you know that you can drag and drop a folder from the finder into the terminal on a OS X and it will dump that whole path out for you. So you don't have to type it. How awesome. Did not know that. Did you know that also works on commander?
Starting point is 01:14:33 I did not know that did you know that also works on commander i did not know that yeah and powershell and command prompt everywhere i tried it this worked and this has probably been available for the last 40 years and i never knew it i remember there used to be like um what were those the the the tool the power tools like Mark yeah Windows Power Tools yeah that he created and there was like an option for
Starting point is 01:14:52 you could right click on a folder and say open command prompt here do you remember that? oh they still have that it's not even part of power tools it's part of Windows now oh is it now
Starting point is 01:15:01 part of Windows? yeah as a matter of fact here just for kicks I don't know if you have to right click and hold down shift um there's a it might even be alt but there's command prompt here or something like that uh maybe it's is it alt get batch here that's all i need ah man i can't find it, but I have too many things in my context.
Starting point is 01:15:27 Anyways, yes. It's actually, they stole it from Power Tools and I don't remember what it is now. So, excellent. It's so apropos that it was Joe recursing Joe because it's our Joe recursing that Joe giving a tip.
Starting point is 01:15:43 Oh, and I have another tip too because you guys always have lots of tips. Uh-oh. It's our Joe, recursion.joe, giving a tip. Yeah, so anyway, yeah. Oh, and I have another tip, too, because you guys always have lots of tips. Uh-oh. So you mentioned today the Big Table waiver. By the way, I don't appreciate that the T is not capitalized in Big Table. It's one word. I agree. T is not capitalized. Bigtable?
Starting point is 01:16:01 No. It should be Big Table. I agree. It should be prominent. Why would they do that? I think we should write it. We should do a pull request agree it's prominent why would they do that i think we should write it we should do a pull request yeah i'm gonna fix that crap everywhere well that's okay because mem table is all lowercase that's fine i'm just fine yeah wait how come that one's fine i don't know big just implies you go big or go home so it should be big table looks awful
Starting point is 01:16:21 yeah it does yeah okay This is crazy talk now. Anyway, we mentioned the Bigtable paper and the Dynamo paper tonight. Did you know there's a collection of papers called Papers We Love and it's run by a group that basically collects the best papers
Starting point is 01:16:39 in computer science and it arranges them and they even have meetups around the world where they get together and discuss like a paper every month or whatever and if you go i've got a link here to their github which actually just collects all the papers and you can find all these papers and they're even sorted by category so if you want to find the two papers we referenced tonight they're in the data store category and you can find both of them right in there oh my god you could lose half your life on this site yeah they're really good and they're so long and they're dense yeah oh my god
Starting point is 01:17:11 sometimes white papers can get a little overwhelming yeah sure but uh you should play a game you just like spin a little dial and you're like it picks two and then like you try to make a business out of it like maybe it works i I don't know. There are meetup groups for Papers We Love. I thought there was a conference for it too. Wasn't there? It probably used to be. Oh yeah, pre-coronavirus.
Starting point is 01:17:36 Back when we used to be able to get out of the house, I think there was a conference. Now... I've got them here in Atlanta. Do they have them down in Orlando? They don't Got them here in Atlanta. Do they have them down in Orlando? They don't care about you in Orlando. Nope.
Starting point is 01:17:50 Yeah. So sad. Sorry. Yeah. Yeah. Yeah. Pete, PWL conf.org.
Starting point is 01:18:00 So. Yeah. I don't know if they're planning on a 2021, but it wouldn't be until the fall if they were going to just judging by past timeframes. Yeah, I'll keep an eye on that. Yeah, that seems like a fun one to go to. You'll need lots of coffee. Yeah. All right. So, uh, so for my tip of the week, so I, oh man, I was like, how has,
Starting point is 01:18:34 how has this never like entered my life before? How, how, how did I even never bother to check to see if this was a thing, but in Slack, Sid shared with us this get tip and rightfully so he tagged me on it. He was like, Hey, you might like this as a tip of the week. And I'm like, I love that. It's a tip of the week. So have you ever found yourself in, in the mode where you want to, let's say you have a branch checked out. Obviously this is going to be a get tip. I don't know what you were thinking if you thought otherwise, but you have a branch checked out, you're doing some work or whatever. And somebody's like, Hey man, uh, can you pivot on that and like fix this other thing? So like now you find yourself in a need for, uh, you know, to change
Starting point is 01:19:24 gears. Right. And so you're like, okay, hold on, you know, maybe you're like Joe and you're just like going to commit everything regardless of what working state it's in. Or maybe you'll be like, you know what, I'll be a good person and I'll just stash the changes for now and then I'll, you know, check out another branch or whatever, right? So we've all been there, right? Or what you might even be tempted to do is to say, you know what? I'm going to create a clone, a second clone of it,
Starting point is 01:19:52 and then I'll just work out of that, right? And like you might even have like a web browser pointing to both locations, you know, so that you can test different things, right? We've been there. We've all done that, right? Well, with Git Worktree, you don't have to do that second clone. So the way it works,
Starting point is 01:20:15 what'll happen is with Worktree, and I tried this out, it's so beautiful, is let's say you have a repo cloned, right? With Worktree, using that repo that you already have cloned, you can copy all of the code into another directory, right, that is following a separate branch. That's amazing. So you're already using the code that you have already cloned locally to then spawn this other branch off into another folder, right? Or directory. So, and you're not recop working tree available that you can work in. Right? That's beautiful. And I checked it out. So like on our
Starting point is 01:21:16 repo that we mostly work in, you know, we work in a I mean, it's definitely not the largest repo on the planet, but you know, it's of a decent size, I would say. If you do a get work tree command on it to check out some other branch in another directory, that directory is 10%. In our case, it was 10% the size of my main repo directory. Wow. So it was a significant savings in terms of like how fast I could like spawn this other thing up. And I'm not wasting space on my, on my drive by having a duplicate repository around.
Starting point is 01:21:57 So here's the way this would work. You could, let's say that you wanted to, you wanted to, uh, check out a, a new branch called MyHotFix, right? And you wanted to put this in your temp directory, right? Like that's where you wanted to start it.
Starting point is 01:22:17 So you would get space work tree, space add minus B, then space my hotfix. And that part is going to, the dash B hotfix is going to create the branch for you called my hotfix. Then slash temp, because you're telling it where to put it, space master, because you're trying to track master in that example. So again, that command would be get space work tree space add space minus B my hotfix space slash temp space master. So let me, here, I'll tell you what, so that you guys can follow along to see it a little bit better here. Get work tree, add minus B, my hot fix. If I could spell that correctly, uh, slash temp master. So there's an example of what that command would look like. Right. And, and it, like I said, you're doing a couple of things in that command all at once, right? So you're tracking master. Uh, well you're, you're doing a couple things in that command all in once right so you're tracking master uh
Starting point is 01:23:25 well you're you're creating a branch called my hotfix you're copying the repo into slash temp or at least you know not not the dot get directory though and then tracking master so awesome so awesome really cool didn't know it existed yeah yeah it's a ton of space and and there were like uh they were seeing me in other tips now that i've forgotten that he uh added on to it um but but the point is if you're not already on our slack you gotta be on our slack because amazing people like sid are sharing nuggets of knowledge that you just gotta know right like as soon as i found as soon as he said this i was like oh that's so beautiful i will never forget that one ever again like this is this is forever going to change my life so thank you sid for sharing that with us and uh and then one other one other quick um you know tip here um you
Starting point is 01:24:20 guys watched silicon valley did you like silicon valley Did you watch it? Oh, yeah. Yeah? Oh, yeah. But do you feel like a little bit of a void in your life now that there's no more episodes? My wife won't let me watch it while she's in the room because it makes her uncomfortable because dude's always jacking up, right? So I've got like two or three episodes left. That's a weird way to describe it since there was one weird, awkward episode. That's a strange way to describe it since there was one weird awkward episode that's a strange way to describe it alan i'm just gonna say i'm just gonna leave it there but this guy is just a constant failure i'll put it like that and my wife cannot take it she cannot stand it that he just no matter what he does goes wrong well well let's just say that like as as people that are in technology right i think we could all appreciate the humor of
Starting point is 01:25:06 that show and and now that it is you know they have finished it even though you haven't caught up to it uh you know you are in the last season you know there's a little bit of a void there but there's this new show called mythic quest which which aims to fill that void. And if you haven't already seen it, it's really good. You got to give it a try. What's it on? Okay, so this is where it's going to get you.
Starting point is 01:25:34 It's an Apple TV Plus show, right? Get out of here. But, I mean, you can get Apple TV, a year of Apple TV for free. You might already have it if you've purchased a recent device in like the last 6, 12 months. You probably already get Apple TV for free and you didn't even realize it. So that's the downside is it is on Apple TV+.
Starting point is 01:25:58 But it's really funny except instead of the perspective of a technology company that is creating middle out compression to rule the internet, the, the premise is a technology company that creates games. So mythic quest is the name of their game. Okay. I will check it out. I have Apple plus.
Starting point is 01:26:21 I've never watched a single show on it, but I will. There's so many great things about it. I'm just saying. Right. You're going to like it. You're going to like it. Coming into my queue.
Starting point is 01:26:32 All right. So I have a handful of tips here. The first one is going to be from one of our good friends on Slack, Sean Martz. Sean. I just basically take anything he gives me and I republish it because he's always giving just golden stuff, right? This one is really good. It's called exorcism.io. That's E-X-E-R-C-I-S-M.io. And like he told me about this at first the name's cool but secondly this is the the heading on their page exorcism code practice and mentorship for everyone level up your programming skills with
Starting point is 01:27:14 3,325 exercises across 50 languages an insightful discussion with our dedicated team of welcoming mentors. Exorcism is 100% free forever. So like if you want to go practice and get some experience with some coding languages or some technologies, like they've got 50 of them. You know, I see PL sequel here. I see MIPS. I don't even know what some of these things are.
Starting point is 01:27:44 I see Swift, PHP, TypeScript, Kotlin, here i see mips i don't even know what some of these things are um i see swift php um typescript cotland cotland so they have cold fusion why is that on there um so so so seriously like there are tools like this that exist that you can just go get your hands dirty with something and, and learn some stuff. So, and, and I forget what he told me he was learning over there at this point. Cause he's kind of like all of us. He's just bounces around all over the place, checking stuff out. But yeah, man, go check this out. Really cool stuff. Um, then go ahead. Nope. So my next up is going to be a tool that I actually came across at some point. And then ironically enough, Joe Zach has also done some work with this thing and it's called elastic search dump. So I had a need to move a bunch of data from a newer version of Elasticsearch to an older version of Elasticsearch.
Starting point is 01:28:46 Now, it'd be great if you could use the re-index method in Elasticsearch to do that, but that only works on the same versions or maybe going from an older version to a newer version. So that kind of killed me. Well, I have a link to this. It's on GitHub and it's a Node application that will basically allow you, you can either install Node and install this thing if you want to do it, or much like Outlaw, I try and find every way in the world to use Docker to where I don't have to install anything. And I run this thing in a Docker container so that that entire application is running behind the scenes. All I have to do is map a drive into it and I can dump all the data from Elasticsearch into a file that'll show up on my hard drive, right? And then if I need to then re-index that into Elasticsearch, I can run that same Docker container and change it from outputs to inputs. And life is dandy. Like, it's absolutely amazing.
Starting point is 01:29:50 So if you ever have a need to move a bunch of data or export a bunch of data from Elasticsearch into a JSON format or something like that, this tool is absolutely fantastic. So I thought Exorcism sounded familiar and sure enough, we have talked about that this will be the third time this resource has made its way onto the air. How do I forget this stuff?
Starting point is 01:30:17 I don't know. Well, I mean, are you using a bee tree? What kind of memory mapper do you have going on there? What kind of index is it? Mine's glossy. Yeah. It's a compression format.
Starting point is 01:30:28 Well, I use a write-ahead log, so I always make sure that I write to the disk first. I definitely have probabilistic memory for sure. Yeah. Yeah, I think you're using a bloom filter for sure. Maybe. Yep, it's either no or maybe. her for sure. Maybe. I will include some links to the past shows where we have discussed
Starting point is 01:30:49 it too in case if you're curious. We have talked about Exorcism back in episode 26 and episode 78. Okay, that's why I don't remember it. We're talking years ago now. So basically Sean got these from those shows.
Starting point is 01:31:06 He spit them back at me. Yeah. Episode 26 goes back to April of 2015. So almost five years ago is when that site made its debut into our conversations, right? And then again in April, three years later in april 2018 so i actually used a little bit back then but it's it looks way way better now it does look really cool i'm not taking that away from it i got colin now everything must be good so this last tip is not really a full blown tip. This is more along the lines of, of, well, I guess it is. And so piggybacking on this whole, this Docker thing. So I've been doing work with Kubernetes of late, and I also have things that stand up in Docker composes, right?
Starting point is 01:31:58 So Docker compose, if, if you've never messed with it, I'm not going to go deep into it, but it allows you to stand up multiple containers or services at the same time. Right. And it's beautiful because you can kind of sort of spin up a server farm for all intents and purposes. When you do that, it creates its own network. So like, you know, Joe, Zach, myself and Outlaw, we all work in a stack that has Kafka, Elasticsearch, Postgres, and some other stuff, right? So when you spin these things up, you typically name them that way, right? Like, hey, this particular host, we're going to call that Elasticsearch because it's easy to remember. Well, that's all fine and dandy because all the containers that are on that network can talk to each other with those host names. So if my Postgres database needs to talk to Elasticsearch,
Starting point is 01:32:45 it can just say, hey, Elasticsearch dot whatever, right? Let's connect that way. Well, I've started messing with Kubernetes and I'm spinning that up because I want to be able to scale things out and see how things work on that side of things. That's not something Docker Compose necessarily gives you. Well, one of the things that's a little bit frustrating is when you start up pods in other places, or even if you're doing Docker runs or something like that, there's this, it's not on the same network because essentially it spins up a virtual switch or a virtual network for you behind the scenes and Docker compose, right? Well, what I wanted to point out is if you have a need to spin up another container to
Starting point is 01:33:26 try something out and you want it to have access to your Elasticsearch thing without having to poke a bunch of ports and holes and stuff through, when you do Docker run, you can say dash dash network and pass in the name of the network that all your Docker composed containers are attached to. And then you can reference everything in there just by their host names, right? And you don't have to open up new ports because when that Docker container gets run, it's actually being put onto that same virtual network with it. And it allows you to interact with things just like it was spun up there in the first place. That's pretty slick. So if, for example, if you, let's say you already had Docker Compose going and you already had, you mentioned like Kafka and Elastic and Postgres.
Starting point is 01:34:11 So let's say you already have those three, right? So maybe you already have your Postgres is, you know, being fed data from Kafka, for example. Kafka is also maybe feeding your elastic, but maybe you wanted some reason. You had some need to maybe spin up a web server that wasn't already part of that compose, right? So you wanted to like a Docker run engine X or something, but you want it to be able to connect to that. You're just maybe experimenting.
Starting point is 01:34:41 So you want it to be able to connect to that Postgres database or that elastic instance that you already have going, you Docker run dash dash network with whatever the name of the network was, and it'll be able to connect to it. That's pretty slick. I've never tried that. Yeah. And actually going back in memory now, the reason why I added it here is that's what I did with the elastic search dump. So when I first pulled the dump, I was pulling from, you know,
Starting point is 01:35:05 a public, not really public, but something that I had access to that was external. But then I wanted to feed that data into my internal Elasticsearch and I didn't want to have to poke a bunch of ports and holes and stuff in the Docker Compose to make that work. So I just said, hey, I'm going to run this Elasticsearch dump thing. I'm going to attach it to my local Docker Compose network. And then I can just say, hey, pump it into HTTP colon slash slash Elasticsearch colon 9200 and life was dandy. Right. So, yeah. Because the alternative would have been to like create a new to add it to your Docker Compose 2, which would also be like I mean, if it's a one off thing, right? Yeah. You don't
Starting point is 01:35:43 necessarily it's like you said, if it's a one offoff thing. That's the whole thing, right? Yeah. You don't necessarily, it's like you said, if it's a one-off thing or you're experimenting, you don't necessarily want to add it to your compose because it's not part of your regular stack. But it is nice to know that if you ever need to connect something to it, you can pass in the network. Yeah. Now, as Alan would say, that's beautiful. Yeah, it is beautiful. Beautiful. Let's go back a few episodes, too.
Starting point is 01:36:05 Yeah. One of us makes the better Alan, I'm just saying. All right. So that's it for this episode. We talked about SSTables, LSM Trees. Hopefully I didn't confuse anyone too bad when I got mixed up there. And if I did, just let me know at alan at codingblocks.net and apologize for that big time. All right.
Starting point is 01:36:31 All right. Well, with that, be sure to subscribe to us on iTunes, Spotify, Stitcher, more. Use your favorite podcast app just in case, you know, I don't know how you happen to hear it. Maybe a friend pointed you to the website or, like, you know, your friend loaned you a device, listen to it, but you know, you can find us at coding box.net. Uh, there'll be links all at the top of the page for all your favorite podcast destinations. Uh, and if you don't find the link there, if there's another podcast destination you have, let us know. Uh, we can add them. You can find us on just any of them. And if you haven't already left us a review, as Joe mentioned earlier and Alan mentioned as well, we do greatly appreciate those reviews. They really mean a lot to us.
Starting point is 01:37:15 So you can find some helpful links there at www.codingblocks.net slash review. And I will wait for my lawyers to tell me when I have to cease and desist. Cease and desist incoming. So while you're up there at codingblocks.net and while he's waiting on that letter, go ahead and check out our show notes, examples, discussions, and more. And send your feedback questions and screenies to the Slack. Codingblocks.net slash Slack. To codingblocks.net. No!
Starting point is 01:37:46 No! Screenies. I win! You'll see my tweet, by the way. You can see my victory tweet over on Twitter, at codingblocks, or head over to codingblocks.net. And you can find all our special social links there at the top of the page. I gotta go there now.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.