Coding Blocks - Designing Data-Intensive Applications – Storage and Retrieval

Starting point is 00:00:00 You're listening to Coding Blocks, episode one, how many, I don't know, we stopped, we've been together too long, we haven't been together long enough, it's been too long, I can't remember what number we are on, 127, let's call it. I think that's it. All right. Subscribe to us and leave us a review on iTunes, Spotify, Stitcher, and more using your favorite podcast app. And check us out at codingblocks.net where you can find show notes, examples, discussion, and episode numbers.

Starting point is 00:00:30 We have a Slack, too, that you can hit us up at. And you can email us at comments.codingblocks.net, too. I got left out, man. And you can follow us on Twitter at CodingBlocks or you can head to www.codingblocks.net. I actually trademarked that. And you can find all our social links there at the top of the page. With that, I'm Alan Underwood.

Starting point is 00:00:50 I'm Joe Zach. And I'm Michael Outlaw. This episode is sponsored by Datadog, a unified monitoring and analytics platform built for developers, IT operations teams, and businesses in the cloud age, and educative.io. Level up your coding skills quickly and efficiently, whether you're just starting, preparing for an interview, or just looking to grow your skill set. And Clubhouse is the fast and enjoyable project management platform that breaks down silos and brings

Starting point is 00:01:25 teams together to ship value, not just features. All right. And today we are talking about the data structures that power databases based on the third chapter of designing data intensive applications. And this is my favorite chapter so far. And so I'm very excited about it and glad you're here to be with us on this journey. Hey, in fairness, each chapter that you read after the previous one was your favorite one, right? Like that's kind of how it happened. Yeah. So, but I have read chapters since this one. Oh, this is still the one that like sticks out to me. So there's other chapters that I like, but this was the one where I was like, all right, let me get the popcorn. All right. I dig it.

Starting point is 00:02:02 Uh, well, I mean, I've got an opinion on that. I would say that I'm not saying that the other chapters were bad, but compared to like past books that we've covered and everything, I'm like, eh, you know, whatever. I mean, it's good stuff. What? It was good stuff. This is my favorite book that I've read so far. But this chapter that we are about to cover tonight.

Starting point is 00:02:25 Okay. Oh my God. This is like the one chapter like if you only have one chapter of anything that you are ever going to read for the rest of your

Starting point is 00:02:41 life and you want to be a developer, you need to read this chapter. It's good. It is. It is. Not only is it good, it's that good is what you meant to say.

Starting point is 00:02:51 Yeah. Yeah. I corrected that for you. Yeah. Because like, I think back on it and I'm like, man, I wish I had this chapter at the start of my career. Hmm.

Starting point is 00:03:00 It, yeah, we'll get into why here shortly, but first. Okay. So yeah, but first, uh, you know, as we like to do, we like to say thank you to everybody that took the time out of their busy day to leave us a review. Collector of Much Stuff, Momentum Mori, Brian Briefree, Isla Dar. Oh, you put the L before the Y. You can't do that. I put the L before the Y.

Starting point is 00:03:34 Oh. Isla. Why are you still laughing at me? How about Isla Dar or Isla Dar? Okay. There you go. That's what I was going to say. Yeah, right. And then James Speaker.

Starting point is 00:03:51 Very good. All right. And I got to say a big thanks to iDigily. Appreciate those reviews. You know we live for those. So thank you very much. Yeah. And like, okay.

Starting point is 00:04:03 So I don't know if you gathered some of my excitement at the start right but i think it's been a minute since the three of us have been together it's been a little while yeah like at least a month it's been kind of crazy and it feels like it's well it actually feels like it's been longer than that though right because yeah in recording wise it might have been a month but that's you know tack on another two weeks or so before that. It's been a while since the three of us have gotten together. We've all been crazy, crazy busy. And I think even the last episode that we recorded, we weren't together.

Starting point is 00:04:38 I think you had to record remotely, if I remember right. So this might be the first time this year. I'm like giddy, like, my friends are over! Mom, can I unspend the night? That's a great segue to say that we're actually going to be hanging out in meat space

Starting point is 00:04:57 at Orlando Code Camp coming up March 28th this year, and registration is open by the time you hear this, and it's free. And and so not only get to hang out with us but free lunch and free shirt so if you are anywhere within travel distance to orlando on the 28th you should come on down because it's gonna be awesome uh looking at about like 100 talks from a ton of speakers it's gonna be fantastic 14 different rooms uh just jam-packed with uh free awesome great talks and us.

Starting point is 00:05:25 And us. And we went down to it last year, and I think both of us spoke at it as well there. And it really is a great event. Like, Santosh and the people who put that thing together over there, like, they do a killer job. So definitely come. I mean, you'll learn a lot of stuff, and it's fun. And I'm giving some sort of talk on kubernetes and joe you're giving some sort of talk on we're gonna be tracking ufos with uh streaming architectures kafka and graphql

Starting point is 00:05:51 very nice technically we all three spoke at that conference if you recall oh allah did a lot of speaking but no no no i'm not talking about i'm not talking about at the booth i'm talking about like in the rooms remember oh you were part of the panel well that's right really count that because you showed up like in the last three minutes of like a boss he was not fashionably late he was not fashionably late he showed up and he did get carried out of the the pre-party uh almost by a guy uh you were on his shoulders he was like taking you off to some cooler party. I don't know. That was weird. Yeah, you're definitely telling this like in a weird way. Yeah, go meet us for drinks.

Starting point is 00:06:30 We'll tell you the full story. Yeah, that's right. So definitely, if you're going to be in the area, come hang out with us. Come talk to us. We definitely love to meet you all and come to our talks. Hopefully, they'll be good. Yeah, you'll definitely find me at the booth. So definitely stop by, say hi. I'm sure I'll have some swag there for you to pick up. And this year people won't be writing their email addresses down, right?

Starting point is 00:06:55 Like that stuff. We might've upgraded. It'll be a much smoother experience. Can't read your handwriting. So also I want to mention for this episode, go ahead and drop that comment on the website, and you will be eligible to win a free book that we'll ship to you. International is totally fine. We love it, in fact. So go ahead and do that right now while you're thinking about it. And that book would be Designing Data-Intensive Applications, as that is the topic we are covering. And the most exciting chapter that we're going to dig in tonight,

Starting point is 00:07:26 storage and retrieval. Now- We should say we're not going to finish that chapter tonight because you know how we do. Wait, how do we do? You know how we do. Very long way. Yeah. But I know what you're thinking because you're like, outlaw, hold on. How can storage and retrieval be the most exciting chapter of the book and like the chapter that the single chapter of any book that you should ever

Starting point is 00:07:50 read it it is have you ever wondered how databases work that's why it's i mean yeah let me put it to you like this uh we've each been in our careers for a minute and, uh, you know, been using databases. Did you ever think to take the time to think about like how the data was being written to disc? No.

Starting point is 00:08:16 Right. It's something easy to overlook, right? Well, you just assumed it was boring, right? Yeah. That's the thing,

Starting point is 00:08:22 right? Is you assume that, well, I mean, they figured all this out. Why do I need to think about it? I just need to think about the. Yeah, that's the thing, right, is you assume that, well, I mean, they figured all this out. Why do I need to think about it? I just need to think about the SQL queries. There's the thing. They figured it out.

Starting point is 00:08:32 Why do I have to think about how they did it? Right? I don't, right? Like, why do I care? I don't care. All I got to do is focus on, like, does my query perform? Do I need to add an index? Is there already an index there I can use?

Starting point is 00:08:45 Blah, blah, blah. Yep. And we already mentioned how I incorrectly think that we are in the golden age of database systems because that's not actually what golden age means. But I still feel that way because there are so many good choices to make and it seems like we had a kind of explosion of them a couple years ago. And

Starting point is 00:09:02 after kind of reading this chapter and reading the rest of this book, I feel like i understand why there are so many why they have differences why there isn't just been one to rule them all and why they all exist and the kinds of trade-offs and things you have to consider when choosing one and by looking at like a deep kind of deep dive on how it works underneath i feel like i'm able to tie it into other things i've known about like data structures and algorithms, trees, things like that,

Starting point is 00:09:26 that I, you know, I kind of know a little bit about. And so like bringing these two things together, two worlds that I know a bit about and finding a common commonality between them has just been really exciting. And, and I don't want to take away anything that's coming up, but like,

Starting point is 00:09:39 that was definitely one of the things that I loved about this chapter was that it does talk about, it basically is like, okay, hey, let's just think about this. Like, what if we had to start from scratch and write our own database from scratch? Like, where might we start, right? And he starts off with just writing a key value pair to a flat file using two simple bash functions that he creates in his script, right? And just starting out small and then starts building on there. And then as the chapter progresses and moves on, then the more complicated concepts that Joe just mentioned, where he starts talking about, where you would start to think about like, Hey,

Starting point is 00:10:27 this is where other data structures might be beneficial, right? This is where a B tree might be helpful. This is where an LSM tree might be beneficial. Like those start, those things start to like crop into the conversation. Right. Just so organically. But do you think, I mean, just to set the ground here, do you think that this is more interesting to us now because data is now so massive? to where we need to understand that stuff because those choices of the systems that you implement or you adopt actually have a huge impact on how things work. Is that why this stuff now feels more important to us? Yeah, I keep wondering.

Starting point is 00:11:16 Is this so interesting to me because it's so interesting or is it just because it's literally interesting to me because of its relevance to my day-to-day life? And I don't know the answer. I am guessing by the comments and feedback that we've been getting that a lot of people find it very interesting too. So we've got that going for us, I guess. Yeah, true. I think for me, it's just a part of it is just having taken things for granted. Like, like, I mean, it's one thing to even talk about,

Starting point is 00:11:47 like, did you ever care about how a database is written to disc? Did you ever think to care about how a cube is written? Right. Right. Like, yeah. Like who cares? Like OLAP.

Starting point is 00:11:56 I don't care. Right. Or at least I never thought about it before. And now I, you know, reading this, I'm like, Oh man,

Starting point is 00:12:02 that's so awesome. Right. Yeah. I mean, I'll tie this into some of the stuff that I know that Joe and I have been working on is you look at things like a technology from Google Cloud is called BigQuery. And one of its claims to fame or one of the reasons why people want to use it is because they wrote their own storage format for the data that comes in because it enables them to do faster analytical queries and that kind of stuff, right? So that's all stuff that ties into what we'll be talking about here and what we'll be continuing on as we get on through this chapter.

Starting point is 00:12:36 So I guess with that, let's go ahead and start with just some basics, right? Because get everybody on the same playing field here. Yeah, so I went to Wikipedia and looked up what a database actually was because, um, you know, we throw that term around a lot. And I think a lot of times people will have kind of, um, we'll just think of the databases that are kind of used to using, but I really wanted to kind of hone in on the definition because this book kind of starts separating things a little bit and talking about the various different parts, particularly like the storage engine and like the query engine, Like a couple episodes, we talked

Starting point is 00:13:07 about specifically the languages. And we talked about even how some graph databases, like you can swap in and out, like the language and the syntax that you're using. And it is responsible for mapping that to what it actually performs underneath. And so, you know, I just wanted to kind of go take a look. And the basic definition, no matter where you look it up, it's basically just like, yeah, it's organized data, and sometimes you can access it. And that's kind of it. So the example that Al mentioned of writing with bash scripts to basically just update a file,

Starting point is 00:13:38 yeah, I mean, that's kind of a database. That's it. Just a collection of data. And when you talk to a developer about it, though, of course, when you of a database. That's it. Just a collection of data. And then when you talk to a developer about it, though, of course, when you say a database, like we've got these kind of these preconceived notions, like,

Starting point is 00:13:51 uh, we're generally talking about a database management system at that point. We're talking about the database. Yes. But we're also talking about like the APIs and all the things that kind of go around with that, like the, the query languages and,

Starting point is 00:14:01 you know, even like the ways like you, uh, can organize that data, either like sharding or partitioning or accessing the way you control access to it, like whether, you know, like users or what you can do, like file permissions, work, stuff like that. So those are the kind of things that we usually think of when we talk about databases. We're really talking about that whole system there.

Starting point is 00:14:20 I guarantee you, if you asked any member of your team, your development team, hey, I need you to create me a new database. No one is going to start with a file, right? They're going to immediately jump off to like whatever the platform of choices that your group already uses, be it Postgres or SQL Server or Oracle, whatever, MySQL, they're going to jump into that and that's what they're going to create. That's going to be your database, right? Notice Microsoft Access doesn't count, right? Even though technically it is.

Starting point is 00:14:53 It technically meets the definition here. So one thing I want to point out is we said database management system. There's actually two kind of big flavors of these things that are worth calling out is you'll typically see them called either RDBMSs for relational database systems. So all the ones that Outlaw just said a second ago, SQL Server, Oracle, MySQL, Postgres, those are relational database systems, right? And then you have the other ones that we've talked about in previous episodes that are your NoSQL or your document databases, right? So your MongoDBs, I think CouchDB falls in there. There's a lot of those, right?

Starting point is 00:15:28 So they're both database management systems because they both have those APIs and those access controls and all that kind of stuff. But there are different technologies sitting on top of them that turn them to either relational or document database storage. So just keep that in your head that it's still a database system is nothing more than a collection of data, right? And how it's stored is the big difference in how it's used.

Starting point is 00:15:53 Yeah. And you know what you mentioned a lot about how, you know, someone says create a new database. I go to SQL Server and I right click and I do that. But what's funny is like, depending on what databases you're using, some of the more modern ones are multi-model now. So if it comes to like Cosmos and say create a new database it's like okay well tell me a little bit about your use cases uh same with dynamo and even you know like my sql has different storage engines that are better for different things and so it's just kind of funny to see that our world

Starting point is 00:16:19 is expanding which can be frustrating because those are all new things that need to we need to understand and there's trade-offs associated with each of those decisions. But it's also an exciting time to live. There's things that are evolving and growing and things that we can do easily now that were really hard to do a couple years ago. Yeah, when you started to go down the path of the two systems, for some reason, I wasn't even thinking about relational versus document, even though that was just a topic of a recent episode that we did. I thought you were going to go down the path of OLTP versus OLAP. Right.

Starting point is 00:16:53 I just assumed that's where you – so maybe there's three. Yeah, there's even more. I mean – Have we talked about that on this show? I don't know in depth. It's coming up in this chapter. I don't think we're going to get there tonight, though. Maybe we'll see.

Starting point is 00:17:12 So it'll be a surprise. Let's keep hope alive, man. One thing to point out here, though, is like Joe said, things are changing a lot. Keep your eyes open. I've actually got a blog post coming out about that. Be aware of the things that are out there. Don't just do what you've always done because you've always done it, right? That doesn't necessarily make sense.

Starting point is 00:17:34 Look at what the use case is and pick the tools that make sense. Right? Oh, man. We've always used VB. And so tonight we're definitely going to be focusing on – Tybee says that's a big one. You should use that one. That app or that page hasn't been updated in like 10 years, I'm pretty sure.

Starting point is 00:17:58 So tonight we're going to be focused mainly on how things are basically stored and received. We're going to start going down this path. The first one is kind of talking about why you should care about how the data is stored and received. And that's kind of something that before reading the book, I would have thought like, well, I have no plans or interest in competing with Oracle or SQL Server or whatever. So why should I care? There's been enough to know how to perform well and how to write good queries and how to use the analyzer and stuff in order to get the performance that way. Why should I care? Is it enough to know how to perform well and how to write good queries and how to use the analyzer and stuff in order

Starting point is 00:18:26 to get the performance that way? Why should I care? What I'm getting at is that you also need to be able to make choices about which storage engines to use. If you don't understand the trade-offs and why, say, Elastic is good at some things and the things that it's

Starting point is 00:18:41 bad at too, then it's really easy to get either suckered by marketing or to go with the decision that kind of by default rather than making the decision. So I think it's important to kind of have that knowledge. And it's also just really fun to understand. Well, the part that I liked was the actual statement from the book. Hey,

Starting point is 00:19:00 just because you're not going to create your own storage engine from scratch doesn't mean that you shouldn't understand it. Because like you said, it'll help you choose things better, right? Like, yeah, like you said earlier, when was the last time that you were like, hey, somebody, we need a database. And you were like, all right, I'm going to go write some bash shell scripts, right? Like that's not what happens. So, but understanding it is hugely important, I think. I mean, this whole chapter is all about it focuses mostly on like how you can optimize things for rights and how you can optimize things for reads.

Starting point is 00:19:37 And then where the tradeoffs or balances are between the two, those two needs in different paradigms that might exist, right? And with the goal of, by the time you finish this chapter, you should at least have enough of an understanding that you can then pick which technology best meets what your main use case that you're trying to solve is going to be. If the thing that you're trying to solve is going to be, I need very fast reads and the rights are less important, right? Then at least you can have an understanding as to like what you should be looking for in your engine. You know, I like to always bring up that course I took on Educative, Grocking the System Design Interview. You would not believe how many times you're reading through like the Twitters or Ubers or whatever's architectures.

Starting point is 00:20:28 The question of what data storage you're going to use boils down to first deciding what your read and write traffic looks like. Because that informs, it lets you pick a whole kind of category. And then when you start thinking about what you're going to query and whether it's like transactional or basically doing aggregations and analytical, that's like a whole other category. And then when you start thinking about what you're going to query and whether it's like transactional or basically doing aggregations and analytical, that's like a whole another category. So you can just eliminate a huge number of choices by knowing those two things alone. And it's really exciting to kind of see like, oh, if I start with these things, like I can immediately kind of hone in on some things. And I bet if you kind of take a look at

Starting point is 00:21:01 slicing your business use cases on whatever you're working on today and kind of slicing things that way and thinking about what you're really doing with your use cases, you might quickly figure out that you are using a suboptimal solution. And that could be fine. You know, maybe it's working fine for you. It doesn't mean it's, you know, that you shouldn't still do that because, you know, you've got experience with it or you've already got it or whatever. But it's still, it's good to know that like, Hey, there are tools that are specifically designed for the things that I'm doing. And, you know, hopefully if, if you know,

Starting point is 00:21:27 we've done it right. And if this, this all works, then you should be struggling with the things that that system is, you know, traditionally bad with or where those mismatches are, where the requirements don't quite work, the storage engine that you've got,

Starting point is 00:21:37 and you should be probably feeling some, some contention there. And this will kind of explain why and what you could do about it. And maybe it can't be stressed enough that in case if you weren't already trying other things or didn't already know, like you shouldn't just rely on like, say your SQL server instance to try to be your everything. But you know what, to be fair, and I agree with that, but here's the reason why people do, is over time, the SQL servers, the oracles of the world, even the Postgres, they've turned into Swiss army knives, right? Like, if you need to schedule a job, it's built into it.

Starting point is 00:22:18 If you need to do some analytical type stuff, you can do it. The SQL syntax is there. So it's understandable that everybody's latched on to those things and they don't want to walk away from them because if you know how to write a query, then you're like, hey, I know how to get out of this thing what I want to get out of it, right? But what Joe just said and what Michael's getting at as well, the important part is it might be suboptimal. So yeah, it'll work, right? How many times did you fight that thing where it's like, oh, the query is now taking 30 seconds. You know, it used to take half a second. Oh, well, you've pushed it past what it should do. Now it's

Starting point is 00:22:56 an online transactional database and it's your reporting database and it's your analysis database. So now it's doing all these things and it's doing them all suboptimally now because they're all contending for those same resources. So it's fair to know that you're probably doing it, but you're probably doing it because they built them to be able to do all this stuff, even though they might not be the ideal solution for it. Yeah, it's similar to what Jeff Atwood has said, that your comment about like, yeah, they are Swiss Army knives, but it can be suboptimal at other scales, which is similar to what Jeff Atwood has said, that everything is fast for small in. So, yeah, you might be able to get away with text searching in SQL Server, which is a feature, right? And, you know, if it's not at a large scale, that might be fine.

Starting point is 00:23:57 It might be good enough. Yep. But for an Uber? Yeah, they can't do it. And I actually really like that Swiss Army knife analogy because Swiss Army knives have those little screwdrivers on them. And you can totally unscrew a screw with that thing. And you will be frustrated by the time that you're done with it. But it would be a whole lot easier if you just had a Phillips head screwdriver that you could go do it. They'll both get the job done, but one is going to give you a lot better experience than the other, right?

Starting point is 00:24:24 But the bonus is it comes with a toothpick, so you're probably okay. It does. I mean, that makes up for it. Yeah. And what is the toothpick in SQL Server? Do we know? Ooh. Yeah, I don't know.

Starting point is 00:24:34 All right. Leave a comment. Let us know. We'll build a book. Right. All right. All right. And so in this chapter, you should have a lot more knowledge about how to kind of choose and evaluate storage engines.

Starting point is 00:24:45 And that's really powerful and really interesting. So now it's time to get into the fun stuff, right? Like where we start digging into what is actually happening. So, yeah, this is the part where you'll kind of get an appreciation for it. So in the example that Outlaw started with earlier where there were two bash statements, right, and one that was, you know, right, and I think another one was get, the way that it works is it's an append-only file, right? So you have this text file, and every time you go to write a record, you have a key and you have a value is the way that they're going about it in the book, right? And let's just say that the key is your name and the value is, I don't know, a document about you, a contact information, right? So you have outlaw that's going to write a line. It's going to have outlaws, the key, and then the

Starting point is 00:25:37 value is going to be his contact information, right? Alan, same thing. Joe, same thing. The important part here is it's always write only, right? You don't append only or append only. Yeah. Good point. You're always basically opening up that file, writing to the very end of it, closing the file. So every time, if I update my address, then I have a new line at the bottom of that file and I'm now in there twice. Yeah. Now here's the beauty of that approach approach because if all you're ever doing is appending to the end of the file, all you have to do from a right perspective is just seek to the end of the file, boom, add your new line and you're done. So what you're describing is a, a, a right enhanced file format. You're just appending to

Starting point is 00:26:24 the end of the file and super duper fast like you said it seeks to the end it already knows where it is it every operating system on the planet is highly efficient at doing this yeah i mean for the most part most file systems you know there's some differences but they'll have a pointer to where the file starts and they'll have like a size or you know basically some sort of indicator of where it ends so just like array access you can hop right to the end even better if you go ahead just leave a thread open and with that file open and just have that one writer constantly streaming data to it you can even skip those steps so all it is doing is just moving data you know into that file and uh i don't know if you're familiar with like zero copy

Starting point is 00:27:02 but basically there's a couple ways you you can kind of short circuit a couple of things in operating systems, modern operating systems where you can actually skip running through Ram if you're writing like to a file. So you can go directly from like a network card to disc, which is crazy. And there's some, some caveats around there,

Starting point is 00:27:17 but that's kind of the gist of it. So, I mean, we were talking about super duper ridiculously optimized, like can't really imagine a better way to write data. Yep. And one of the things that they like to call out in this particular section is this file that we're talking about is called a log. So typically, as application developers, we think of logs as, oh, that's where the web server log is or that's where this log is.

Starting point is 00:27:43 My application log. Yeah, my debug output, whatever. I'm using Apache log for net, log for j. All log means is a write-only file, right? So that's what they're talking about. Pinned only. And append only. Yes, not write only.

Starting point is 00:27:59 And append only, always writing to the end of it. So that's the important part. So they call it a log, and the thing here that is key also is it doesn't have to be human readable. And in many cases it's not because it's not the most efficient way to store that data. There's some beautiful ways they talk about later. Yeah. So, so just be aware log and not human readable, but it is append only. And there's pointers to those, those keys or those records, right?

Starting point is 00:28:25 And already you can start to make connections in your mind because then you start talking about, well, it's a log. You're like, oh, transaction log. Transaction log, that's a thing. For all databases. We're talking about databases here, transaction logs. And we're already starting with a very

Starting point is 00:28:41 quick definition of this log. Yes, we were talking about the very beginnings of what an actual database is. Yeah, it's funny. Like you said, let's talk about RDBases. And then you start talking about logs instantly. Like my mind was like, who cares? Okay, I guess we're getting to, you know, like how transactional systems work. But no, it's like it's literally talking about the ways to quickly write data.

Starting point is 00:29:01 And the deal is, and the reason that, you know, in addition to just being efficient and good at appending is that if you think about the opposite, if you were going and writing to a spot in a file, that means you have to seek for it. You have to find it. You have to go to it. If that information is larger than the information that you're updating, then you've got to make room by basically shifting everything else to the right. And if it's smaller, same thing, you have to shift all this data. So it's a grossly inefficient compared to appending. So we're not talking about a micro-optimization here. We're talking about essentially an order of magnitude difference over appending. Hey, when we talk about this thing as a log,

Starting point is 00:29:40 does anyone else think of Ren and Stimpy while we're talking about it. A log, a log. What rolls downstairs, alone or in pairs, and over your neighbor's dog. For fun, it's a wonderful toy. It's great for a snack. It's on your back. I just think of all the countless hours of my life I've wasted sifting through logs when the problem was glaringly obvious in retrospect.

Starting point is 00:30:08 Wait, you don't think about running Stimpy? Even when he said that, you didn't think about it? You know, I do, but it's like number four on the list. Okay, fair enough. All right. Four. We've got Yuletide logs. We've got all sorts of other logs.

Starting point is 00:30:23 All right. So the next thing we had, so you talked about writing into the middle is way more expensive, right? Like order of magnitude more expensive. Well, this is also where they start talking about reading from a log is highly inefficient, right? So we talked about the fact that this whole append only thing is amazing. It's fast. You go straight to the spot. You put in your new data. Now, if I say, hey, I want to get Alan's contact information right now, you got to scan through the entire file. We get into more things right now.

Starting point is 00:30:57 We get into the last record. And then you're going to return the one. You're going to find Alan's record in there, and then you're going to return the last one that you got. And you're kind of skipping over something, but it's important, though. Okay. When you say the last one that you got, because as we mentioned with this append only log, you made the example of like where you could,

Starting point is 00:31:32 if you updated your entry, it was in there a second time. Right. And so that's why it's important. Like, I don't know, maybe you updated it 50 times. Right.

Starting point is 00:31:41 Right. But it's the last one that you really want. Cause that's the one that has the most correct information. Now, in this example file that we're talking about, our contact database, you know, so far we've only really talked about like two records of real interest in there that you put in there, like yours and mine. But, you know, there might be every name in the United States inside of that thing. And think about how many times an individual might change or update their contact information.

Starting point is 00:32:12 If you had to then go and scan that, you're like, okay, well, I need Alan Underwood's last entry. Right. And the important part here is you have to scan it. Now, maybe there's some sort of hyper-efficient way to reverse your way through a file. I don't know. I haven't really had to deal with that kind of stuff that much.

Starting point is 00:32:31 But the key is you're scanning through it. You don't know where things are. And if you've ever worked with like, what are they called? Something plans, query plans in a database. You'll typically see something that says it did an index scanner. It did an index seek. If it did an index scan, it went through every single record, right?

Starting point is 00:32:55 And that's what we're saying. This whole read, we've said that this thing is highly optimized for a write, is not very efficient for a read up to this point. An example like here is like the National Weather Service has like things all over the place, like measuring wind speed and humidity and temperature, just all over the US. And, you know, that's all sending data really quickly. We don't want to lose stuff. So there's some sort of logger, some sort of fashion ingestion system that's taking all that data. But if someone wants to know what the temperature is in Oviedo, Florida,

Starting point is 00:33:27 then that's a terrible system to read from, you know, because like you said, it's going to be at the end because there's, you know, repeats. And if you want to know the temperature now, and you kind of want to start, you know, at the end, you kind of move backwards. And so it's just, it's not optimized for that. And if that's an operation that you'd be, you know, doing all the time, then you don't want to wait four minutes for it to parse through that large file to find that information. So you want to use something that's more appropriate for reading, although you can still take it in the fast way. Yep, totally.

Starting point is 00:33:56 And let me back up here. I said that you could reverse your way through a file, right? That's assuming that you know that my record is closer to the end, right? Like my record could have been the very first record in the file. The problem is you wouldn't know until you went through it. you can see justifications for wanting a write-optimized system versus a read-optimized system. That, yeah, with all the thousands of sensors that might exist out there in Joe's weather example, you could see why you might want a write-optimized system available for that data to go into. But you would then use some other system to say like hey what's the local weather yep or some other form it wouldn't be the same system

Starting point is 00:34:52 which we might get into here in just a moment well you know even the resolution matters like you might have different systems like you know if i'm looking at the temperature i want to know what temperature was like maybe right now and maybe i want to know what it's probably gonna be like tomorrow if you're like a storm chaser and you're studying hurricanes you maybe want to watch that how the temperature changes over the course of 11 minutes as the tornado comes in or something you know so you want to get a lot of checkpoints and so the just the resolution the fidelity that you want to look at that data means a lot to you and so it'd be nice if you could have potentially different systems that are optimized for those use cases, because sometimes you care a lot about the intermediary values and sometimes you don't.

Starting point is 00:35:30 I want options. That's right. And so this is where we start getting into some more of the next steps in building your own system. So this whole problem of trying to find a record in this data set, scanning, we've already said, is not optimal. And in many cases, it could be the worst, right? Like it could be O of N, right? You've got to go through every single record to find the one that you need. So the way that you solve this problem is with indexes, right?

Starting point is 00:35:59 And all this is is another data structure to store data. And we probably most commonly know it as basically a hash table, right? So the whole thing of an index is, all right, so I know that the last time that I wrote Alan, he was in position five in this file. You're going to have a hash table that has Allen as the key. And instead of storing the record, it's going to have five saying, Hey, this is the position where you can go to in the file to get that information. Yeah. Like the example with the temperature too, you know, like if you know that the way you're going to be using this data most often is associated closely with location, then it might make sense to you to have an index somewhere that basically keeps

Starting point is 00:36:45 track of where that information is sorted by location. So you might be able to go to the index and say, I need info on Atlanta. And it says, you know, here's information on how to seek to places that contain Atlanta that make it so you don't have to scan through that whole big file. And you can just jump to this location or these 10 locations or whatever, some information that makes that quicker. And that's a huge value when it comes to writing. And it doesn't slow your ingestion down. It just means that you have to take on the additional overhead of maintaining these indexes. And this is another one of those cases where it – when we talk about like how other data structures could come into play here and like why it's important.

Starting point is 00:37:25 Right. So Alan described this hash map that we, you know, with basically a key Alan and then a value, which is the offset to go look in the main data file for Alan's contact information. Right. If you think back to the past episodes that we've talked about on average, a hash table lookup is o of one so you're

Starting point is 00:37:46 already talking about like a extremely fast operation you went from o of n which if you it's not even o of n right it's o of like because in our in our append only file who knows how many times alan has been updated well so it'd still be o of an n but let's sum in but not the total number of contacts total number of updates. So let's say that you had all the people in America, right? 330 million people, right? And let's say that they were in there twice. O of N is 660 million scans.

Starting point is 00:38:15 We're saying that with this hash table, it's one. You go straight to the record. On average. On average. Yeah. Now there is worst case. Right. We won't talk about it, but it's

Starting point is 00:38:26 O of N. So something you said though, Joe, is you said that this does not impact write performance. And I don't know if that's true. It depends. And that's where some of the different systems and things start kind of taking different approaches on things based on what they care about the most.

Starting point is 00:38:43 There are systems that will not kind of log the data until that index has been processed. And so it kind of doesn't mark this thing as done and move on to the next until it's ready. But I think, I don't know, I guess the kind of loggers that I'm thinking about are so afraid, you know, about missing a message that usually they'll kind of defer to writing things down in the log immediately and then processing after. But it doesn't have to be that way. You know, there's sometimes when people talk about like queues or which are kind of day storage system that are really care about write speeds. Sometimes we'll talk about messages that are queues that focus on guaranteeing at least once delivery, meaning it never drops a message.

Starting point is 00:39:25 So in everything they do, they're going to try really hard to always make sure they get the data no matter what, even if it's slow, even if it's not processed or whatever, they're always going to defer to that. And you have the other kind, which is at most once delivery, which is the opposite,

Starting point is 00:39:40 where it defers to never having more than one message. So it kind of makes different tradeoffs. And, you know, I'm sure there's even other specialties that kind of branch off from there. So I do want to be careful about kind of making generalizations there. But for the most part, you can think of, you know, the logging itself being fast, but that data being accessible being more of a question mark. So it depends on the storage engine. And this is why I wanted to bring

Starting point is 00:40:05 it out, right? So first, let's back up and also talk about the fact that an index is based off the original data. So anytime that you're indexing data, you get that original record in, you're trying to create a fast lookup to it for the read performance. It's deriving that index based off the original data. Now, this is whyiving that index based off the original data. Now, this is why I said it depends on the storage engine. If you're talking about an online transactional database like SQL Server, Oracle, Postgres, those type of things, the more indexes you have, the slower your write is. Because it is an ACID compliant or whatever transactional system, when it writes that

Starting point is 00:40:44 record, it also has to write all those indexes before it marks it as done. So yeah, good point. So we use the simple thing of our names earlier as the key, right? Typically when you're indexing things, you might also index it by additional stuff, right? So, so maybe when we wrote our contact information, we had, we had our first name as the key initially. But then the entire record had our first, middle, last name. It also had our address. It had the zip code, all that kind of stuff. You might want to add additional indexes.

Starting point is 00:41:15 You might want to find all the people that live in a particular zip code. Right. Well, if you think about I mean, basically, it's almost like we're describing for if we really want to talk about a phone book and at that point there, it's a composite key, which is what you're describing when you use more than one field. And in that case, it's last name, first name address is the composite key in the phone book. Yep.

Starting point is 00:41:37 And so here's the key part that I'm getting at here is you can't, you don't just have to think about it as one thing, right? So when we were talking about appending to this file, we were talking about there's a key and there's a value. An index doesn't only have to be the key, right? So you could actually have another index that's derived off that data that says, Hey, I want to create an index that's based off the zip code. And so now you create a new index and it's going to keep pointers to all those other records where all those people lived in that same zip code.

Starting point is 00:42:06 So that's why the right performance actually does suffer. Because as you write that first record, depending on how many indexes you have backing that for search, it's having to go in and update all those locations and all those indexes as well. Am I remembering it wrong? There wasn't like a portion in this chapter where he was basically describing other systems, though. And this might be where you were saying the engine matters. Because I thought I recalled him describing another one where it was just writing to this transaction log, and it was used for crash recovery. It could pick back up after the fact and then continue back to rebuild indexes and whatnot as necessary. That was based off it snapshotting things as it went.

Starting point is 00:42:47 Yeah. But yeah, that was different recovery models. And that was actually the pros and cons of doing some of these systems. So yeah, at any rate, going back to this, that's the trade-off. You have fast write speeds. If you need increased read speeds with this particular format we're talking about right now, you take a hit on the write as you write your indices or keep track of your indices. Yeah, that's a good point. I was kind of focusing kind of extra on like specifically like logger type systems there, then if you ever see a question that begins with,

Starting point is 00:43:26 the customer you're working with has 10,000 IoT devices, you can automatically rule out relational databases as one of the answers. It's not the case. It's not meant for ingesting that kind of fast data. It's just not meant for that sort of thing, and it's going to fall over. So a little tidbit there. That's hilarious. Today's episode of Coding Blocks is sponsored by Datadog, the monitoring platform for cloud scale infrastructure and

Starting point is 00:43:50 applications. Datadog provides customizable dashboards, log management, and machine learning based alerts in one fully integrated platform so you can seamlessly navigate, pinpoint, and resolve performance issues in context. Monitor all your databases, cloud services, containers, and serverless functions in one place with Datadog's 400-plus vendor-backed integrations. If an outage occurs, Datadog provides seamless navigation between your logs, infrastructure metrics, and application traces in just a few clicks to minimize downtime. So go try it yourself today by starting a free 14-day trial and receive a Datadog t-shirt after installing the agent. Visit datadoghq.com slash codingblocks to see how you can enhance visibility into your stack with Datadog.

Starting point is 00:44:43 That URL, again, was datadoghq.com slash codingblocks. Okay, well, how about we get into my favorite portion of the show. It's time for a joke. You didn't see that one coming, did you? I didn't. I like it. All right. So our buddy James on Slack sent me this one, and I was like, oh, this is so great. And so topical, too.

Starting point is 00:45:09 Are we talking about the cynical developer? Yes, that would be the one. Thank you. Yes. Another great podcast. Yes. Do you know how much space Brexit will free up for the EU? I don't.

Starting point is 00:45:30 I hope it runs with Brexit. That's all I know. One GB. Oh, gosh. That's pretty good. Man, James, that guy really knows his onions. That's all I got to say about that. Yeah.

Starting point is 00:45:43 All right. That's actually really knows his onions. That's all I got to say about that. Yeah. All right. So – That's actually really good. Yeah. So obviously it's time for Survey Says. All right. So a few episodes back, we asked the question that people really wanted to have an answer to. Which sci-fi series is best and the choices were star trek damn it jim i'm a doctor not a doc oh okay fine or star wars han shot first and i think this was i think we did this survey

Starting point is 00:46:19 around the time that uh because it's been a minute i think this was around the time of uh what was the last movie can you remember the name yeah uh yeah the's been a minute i think this was around the time of uh what was the last movie can you remember the name yeah uh yeah the last jedi last jedi thank you last styles yeah the last what the last sky jess skywalker last guy it was like right around the time the mandalorian was coming out too which i think has probably biased maybe the survey oh okay i guess we'll see good well go ahead well let's see then um okay. I guess we'll see. Well, let's see then. So we'll go ahead. Joe's already throwing his opinions out there.

Starting point is 00:46:52 So we'll let Alan go first. Oh, I didn't see that one coming. No, I didn't. So I'm going to say Star Wars, Han shot first. We'll go with, I mean, there's only two chances. So I got to go greater than 50%, right? So let's say 51%. And I have spoken.

Starting point is 00:47:11 Random. Wait, wait. Did you not watch The Mandalorian? Yes, yes. You don't remember the character? Yes, you're right. Yeah, that guy was great. It was somewhat relevant.

Starting point is 00:47:22 All right. Thank you. Yeah, that was a fail on my part. I will admit. All right. Good. So I'm going to say with 33%. What?

Starting point is 00:47:32 I'm going to say Star Trek, damn it, Jim. I'm an intellectual, not an action hero. Let me see if I understand the math here. So you are supposing that Star Trek is the most popular answer with only a third of the vote between two choices. Right. I don't know.

Starting point is 00:47:54 I mean, I might be right. You might be. You know what, Joe? I like your optimism. You just might be right. Find out. We will, sir. We will.

Starting point is 00:48:10 Only because we play by Princess Ready rules. Make it so. Oh, God. All right. Well. Okay. Well, okay. Well, I have to be the bearer of bad news to one of you. Care to take a gamble?

Starting point is 00:48:33 It might be. I am wearing a Star Wars hat right now. It's got to be true. I have to have won that just because he went under 50%. So, yeah. Alan, you won. It was Star Wars. I mean, surprise. Alan, you won. It was Star Wars. I mean, surprise, surprise, you won. And it was over 50%, wasn't it?

Starting point is 00:48:50 It was. You know, that's the funny thing about math. It was that Mandalorian. So, yeah. At about 60% of the vote, it was Star Wars. And maybe Yoda might have weighed in a little bit on it. Hey, look, let's be honest, right? I don't care if you're a man or a woman.

Starting point is 00:49:09 We can all admit we want a little Baby Yoda. Baby Yoda's pretty awesome. Yeah, I want one. I'm not going to lie. I do have a couple Baby Yodas on my shelf, actually. Oh, really? Yeah. That's a cute little thing.

Starting point is 00:49:20 I need one. Yeah, and he has his own little force capability. Right. How cute is that? And his ears wiggle. You're like, you go to change his diaper, and he has his own little force capability. How cute is that? And his ears wiggle. You go to change his diaper and he's like, no, go away. He forces you away. I think I'm actually going to go back and watch it again. That little guy made me smile every time he came on

Starting point is 00:49:38 screen. Yeah, he was so cute. Alright, well, huh. Who would have thought that 50% is the winning amount. You know what? Maybe the next episode we'll just like rehash some math. Joe's about to drop off. I was going to do another joke, but I don't know that we need to now.

Starting point is 00:50:01 I'm just saying, you know, replicators, right? No money. Totally communist universe. I where did we just go there star trek oh okay all right no more traffic all right so uh yeah so much for humor uh well let's let's do another joke anyways. How about that? Oh, my head hurts. So, uh, from Slack, our, how do you pronounce this one? Our bleeder, I hope I'm saying that right, uh, gave me a, gave me a, this one, this is my joke for life. Oh.

Starting point is 00:50:42 That's how, that's how I how – that's your biggest hint already. So we've got the best chapter he's ever read plus the joke for life all in one episode. All right. You ready? You ready for this? I'm ready. What does a developer do before starting their car? Make – I don't know.

Starting point is 00:51:04 I have no clue. Get in it. Oh, my gosh. I gave you such a big hint. Such a hint. I just know that you liked it. I should have known. Yeah, get in it.

Starting point is 00:51:17 Wow. That's really good. Okay. Yep. So, all right. So, for today's survey, we ask the hard-hitting questions that other shows just don't even think to ask. And so, today's survey is, which fast food restaurant makes the best fries? Because the people want to know. That's right. All right. So your choices are Arby's, Burger King, Checkers, Chick-fil-A, Hardee's, In-N-Out, Jack in the Box, McDonald's, Popeye's, Steak and Shake, or Wendy's. And I'll give you a hint.

Starting point is 00:52:03 Some of you are going to be wrong. And you want to know what's great about this particular survey is I have a feeling there's going to be a lot of passion in the answers behind these, right? Yeah. You're going to have to defend your answer in the comments, and that will enter you in for a chance to win the book. Oh, man. I would actually love to see the dissertation as to why people chose one over the other instead of just choosing,

Starting point is 00:52:30 like totally leave a comment. Like, yeah, it's gotta be these. And this is why. And if you don't have palm frets in your neck of the woods, then write in and let us know what you like instead. Oh,

Starting point is 00:52:43 that's meant that makes me remember. So why do we call them French fries here, but they're called chips overseas? What is that? A chip is a thin, sliced, fried thing. Why are those called chips? Like, why is fish and chips not fish and fries? It hurts my brain.

Starting point is 00:53:07 Biscuits and cookies too, man. I, you know, I don't know. Wait. Well, you know, I did forget one last joke before we leave this section because as it relates specifically to our survey that we already gave the answers to, Mike RG from Slack, you might have heard his name like once or three billion times. Per episode. Yeah, per episode. He pointed me to a tweet from Parker Higgins that really makes a lot of sense and really gives you something to think about, especially as it relates to our survey

Starting point is 00:53:46 and just this architectural type conversations that we're having. So Parker says, I used to wonder why the interfaces on Star Trek are so clunky, given that it's centuries in the future, but I guess that's just enterprise software for you. That's good.

Starting point is 00:54:08 This episode is sponsored by Educative.io. Every developer knows that being a developer means constantly learning. New frameworks, languages, patterns, practices. There's so many resources out there. Where should you go? Meet Educative.io. Educative.io is a browser-based learning environment allowing you to jump right in and learn as quickly as possible without needing to set up and configure your local environment. The courses are full of interactive exercises

Starting point is 00:54:38 and playgrounds that are not only super visual, but more importantly engaging. And the text-based courses allow you to easily skim the course back and forth, just like a book. No need to scrub through hours of video just to get to the parts you care about. The incredible thing about Educative.io is that all of their courses have free trials, a 30-day return policy, so there's no risk to you. You can try any course you want and see what you think of it. And you're going to love it.

Starting point is 00:55:04 And here's the great thing. They recently introduced subscriptions. So now you can go, our listeners can go to educative.io slash coding blocks, and you can get a 10% off discount on any course or subscription. Again, that URL is educative.io slash coding blocks. And, you know, I got to bring up my favorite course, Gropking the System Design Interview, in which they go over a bunch of common architectures for, no, I shouldn't say common. They go over architectures for prominent platforms,

Starting point is 00:55:37 like say YouTube or Twitter or Uber, and break down how those systems are designed. And it'll show you just how important it is to know the read, write ratio and volume when you're trying to think about how to design a system, or if you're trying to interview doing a system design interview. So I definitely recommend checking that out. And remember,

Starting point is 00:55:57 they've got that 30 day return policy. So if it's not for you, then that's okay. You can, you can afford to try it out with no risk. Hey, and with 10% off, you can't go wrong. Yeah, absolutely. So make sure to start learning today by going to educative.io slash codingblocks. That's E-D-U-C-A-T-I-V-E dot I-O slash codingblocks. And you can get that 10%

Starting point is 00:56:23 off any course or an additional 10% off of a subscription. So let's jump back into the conversation with hash indexes. So, I mean, this is kind of a continuation of the, the hash map conversation that we were talking about before the, before the break,

Starting point is 00:56:41 uh, where like we, how we might store the, um, the break, how we might store the key in a hash table and be able to then have the luxury of doing an O of 1 lookup, and then that pointing us to an offset in the main data file that we can then go and retrieve Alan's contact information. That's right. In a nutshell.

Starting point is 00:57:09 And what's interesting is they say that they did this. I've never even, I've heard of RIOC or RIOC. I don't, I don't even know how you say it, but I've heard of it before, but they said that this is what's done for bit cask, which is the default storage engine for RIOC. The interesting thing though,

Starting point is 00:57:23 is they store this entire set in memory so super fast but you gotta have enough ram right yeah i wonder what kind of applications people are doing with react i haven't really looked into it too much yeah one one thing i kind of learned recently is that how often uh sometimes the databases are kind of embedded into different applications like a kafka embeds um rocks db in their kakka streams applications and that's kind of like the most prominent example that i think of uh jaeger uh is the application i've been using for some tracing that lets you use oh sorry oh uh just uh different kind of databases underneath,

Starting point is 00:58:06 including Elasticsearch can kind of power its stuff. And there was one other example I wanted to give. Well, what were you going to say about Jaeger lets you use... Elasticsearch, ultimately it's like storage engine for displaying it. Oh, the other one, Grafana, you can have different storage engines that are kind of like underneath it. And so what I think of is Grafana is a bunch of pretty graphs. Underneath you can do like Pr engines that are kind of like underneath it and so what i think of as graphon is a bunch of pretty graphs underneath you can do like prometheus or influx or maybe there's other choices there but it's interesting to see that you um that these other

Starting point is 00:58:33 applications kind of are built around databases but don't necessarily expose that database to you and uh you know that's there's nothing new about that. I just, like, I forget sometimes that so much of what I get out of applications that I like out of applications that I use is often kind of granted to them by the magical powers given by their kind of embedded database. But how does SQLite not come to mind for you? I've never used SQLite. Yeah, I definitely have. I mean, you talk about it like something that's, like, embedded everywhere.

Starting point is 00:59:04 Yeah, it was, like, de facto a lot of times for, for mobile applications. Right. And even PWA, all the things. Yeah. Right. I guess I did a little bit when I was best with unity,

Starting point is 00:59:17 it was easy to embed it in there. And that, that's a great use case. So like you want a relational database inside the game, like SQL light. Great choice. Yeah. Yeah.

Starting point is 00:59:23 I mean, clearly the author martin kleppman he has had a lot of experience in a lot of different uh database technologies because some of these i'd never heard of like the react i was like i it sounded more like a car you know the funny part is it is actually pronounced react i had to go look it up so joe said it properly first but yeah um the interesting thing about this one is they say that all the keys stay in memory but you're still appending to that file constantly so every time you write to that file all you're doing is just going back with that o of one look up to

Starting point is 01:00:04 get back to that key update the the new pointer and you're going in. So it's hyper efficient. Right. So that's bit cask and react. Yeah. Now you have to think though, like, okay, if you're just always going to run right to this file, like how, how, what next? Like you're eventually going to run out of disk space, right? Like that can't be your strategy for life, right? Or can it? I mean, that's what I love about this particular chapter, by the way, is it just keeps building on. It's like, okay, well, here's

Starting point is 01:00:37 the, here's the very first problem you're going to run into. Right. So, so the answer is obviously no, no, you can't, you can't do that. So you have to come in to some other solutions. So that's where file segmenting and compaction come into play. So by that, what I mean is we gave this example where we were using a bash function set and get to write contact information to a flat file. And we made some mistakes and we had to update Alan's information 50 times. All right. So it's not until, you know, that 50th one is the one that is really the one that matters. So the compaction, what that would do is we would eliminate all those other ones and we would just

Starting point is 01:01:20 store the one entry for Alan, but in this home brew homegrown version into a new file is the important part right so again we're still in append only mode the big difference is when you go through the compaction you're reading through the old stuff and you're basically trying to merge that into a new file that is also going to be append only and will eventually become the new log file that everything else is writing to. You can imagine like if you're kind of designing a new system like this and you start going down this path and you realize that you're potentially going to run out of disk space. You start thinking about how you might do this underneath is like I would probably pick a size like four gigs and I would just allocate that size on disk. And then I would start at the top and start appending. And as I started to get close to that four gig limit, then I would go and allocate a

Starting point is 01:02:08 new file. And then as soon as I hit that limit, I've already got that next four gigs allocated and open and, and I can run over there and do that. And that point I can drop my pointer to the file that I had open, I can exit it. And then another process can come along and take a look at older files at some point, whenever it chooses, and go through and kind of clean things up, compact. And that's really powerful. And it reminds me a little bit of garbage collection, except that it can cleanly segment these things off by kind of saying, like, we're not garbage collecting the stuff that's actively being written to right now because we've made this rule where we only ever write

Starting point is 01:02:43 to the end. You know, I mean, that's one of the subtleties of this book, of this chapter that I loved about it, is that even in the scenario that Joe was just describing where you might pre-allocate this four gig file, in this chapter, he specifically discusses, like, even the performance gain that you would get from sequential writes and reads by writing all of that in one contiguous block on a spinning hard drive, right? And, like, what benefit you might get from that. Just little things like that that, you know, if you weren't thinking about it, you know, and you could easily take for granted, right? But he calls it out. Yep. This chapter is extremely thorough.

Starting point is 01:03:29 Yeah, he goes deep. I mean, it's a step-by-step on how you would actually do this from scratch for a very basic but still functional database, right? And one of the things that they point out in this whole compaction type thing, right, to what Joe was saying is typically these things happen in a background thread, right? So think about it, right? You have something that's constantly writing to your live log file, and then, you know, it's approaching the time where it's filling up and it needs to create this new segmentation, this new file. It's going to do that on a background thread. And also in the background, it's going to try and go through and find all the latest, newest records for any particular key, write them to it. And as soon as

Starting point is 01:04:10 that's done, it's basically going to do exactly what Joe said. It's going to deallocate the pointer to that old file, point it over to that new one where you've got that compacted data in it and start writing to that. And then that garbage collection, that file garbage collection can take place and you can delete those old files if you want, right? Because what we were talking about is once you eventually run out of space, not if you keep basically trimming the old stuff

Starting point is 01:04:34 as you go along. So you can kind of envision like where the system part of RDBMS comes into play because you can already imagine like, okay, I need a whole other separate process maybe to like manage some of this compaction and segments and whatnot and moving the pointers around, you know, versus another process. It's just like, okay, you're going to give me data. I'm going to take it in. I'm just going to write it to the transaction log. I'm not going to even think about it beyond that. I'm just read, write, read, you know, read it from you and write it to the file.

Starting point is 01:05:10 Yeah. I want to mention to the two systems I work with a lot day to day are Kafka and Elasticsearch. And both of them have this concept directly. It maps segments and compaction. And both of them work exactly the same way. And after kind of reading this chapter and being able to kind of correlate the things that I've learned about like Elastic, one thing that I've noticed is that like if Elastic, if you fill up your disk space,

Starting point is 01:05:33 it's a problem because it's not easy to clean up old records. When you try to compact, it needs to allocate new space so it can go through its segments and write to a new file before it can delete those old sections so what that means is if you've got 100 disk utilization filled up it's really hard to make more room and delete stuff so you can't just go in and say okay fine delete the top thousand or delete the oldest thousands records because it's like actually i can't i can't make any room to delete anything truly off disk because there's no room for me to write it. And so you can get into a kind of a bad problem there where you basically have to move stuff off disk in order to clean up some room and move that stuff back on.

Starting point is 01:06:13 So it can be a big problem in Kafka too. Actually, one thing I ran into there I never really realized until reading this chapter is that they've got retention policies on their topics. And you can say like age data out after a couple of days. Or I can age data out after a couple of days or I can age data out or I shouldn't say age. I should say, you know, maybe only keep 50 megabytes of data around for this. Retention policies.

Starting point is 01:06:33 And yeah, so I kind of thought like, OK, cool. So as soon as a new record gets written, maybe it looks at the oldest record and kicks it out. That's not how it works. It basically happens when those segments roll over.

Starting point is 01:06:44 So whenever it kind of hits that limit on disk, that's when things get works. It basically happens when those segments roll over. So whenever it kind of hits that limit on disk, that's when things get cleaned up or that's when things get compacted. And that can have a big impact if you're writing a program and your program doesn't ever expect to get older data because you've got a retention policy set on. But that data doesn't actually get cleared out of that segment until compaction occurs which in kafka's case doesn't actually occur until a segment essentially rolls which is that process where you start writing to a new one so if you've got a really low volume topic even though you've got a retention policy set to say like three days if it's so it's such low volume that it doesn't ever roll over the

Starting point is 01:07:19 segment size you could have a year's worth of data in there yeah so if you're doing things for like government work or whatever where you're like you can't keep data longer than x days or if you've got like a gdpr incident you need to wipe someone's data and you think that you're clear because your retention policy is only three days it may not actually be the case because of how these things work underneath and so it can really chip you up if you don't understand that those are how things work yeah now during this chapter i can go look at those systems and like i kind of understand them more deeply and i'm glad that they use the same terms for these things yeah it is really interesting too because what you're talking about even in the kafka world is that retention policy of size competes with the time retention policy

Starting point is 01:08:00 right so what you're saying is if you you might you might think, okay, well, I'll be smart about this. And I'll just make it to where, you know, these things roll over segments every minute or something, right? Because, hey, if I want to make sure that these records age out really fast, then I can do that. But the problem is now you're creating these new files constantly, right? So you have the contention of creating the new segmentation file while you're closing out the other ones and you're having to write to disk at the same time. So it's really a balancing act, right? Like you're going to have to set the proper size to say, hey, I think I'll get this much data so that it'll trigger this new segmentation file. And I need it to work within this amount of time so that they're not competing with each other and you're constantly writing new segmentation files. Because in the Kafka world, probably very similar based off these

Starting point is 01:08:51 very simple principles we're talking about here is each segmentation has basically a starting offset and an ending offset. So that when you go to seek to records sort of ish in the Kafka world, it kind of knows where to go find them. So all of these principles we're talking about here in this very simple implementation of a database are used in a lot of storage systems that are now adopted by massive companies. So one of the things here that I don't know if we covered is while that background thread's running where it's basically trying to create a new segmentation thing, they also point out in this implementation, you're still going to be writing your new records to that other file. To the append-only log. To the append only log to the append only log and you're also still reading

Starting point is 01:09:46 your offsets from that append only log while this other segmentation file or a segmentation log is being created because you don't want to contend with that thing while it's creating it only when that thing's done do you switch over and then they say after it's done after you've created that segment that new segment file then the old ones can be deleted and then again that of course is going to boil down to if you have a retention policy or something right like maybe you just kill them as soon as they're done you know i've seen guidance from elastic search that says not to run compaction or what they call force merge in their case to basically clean up those sector sections uh segments uh they say not to do that to an index that's actively being written to and now i think about it's like okay that makes sense because we're writing to stuff while we're trying

Starting point is 01:10:32 to clean it up but i don't know what happens in that case like does it does it mean that it doesn't clean up the active segment like because that makes sense to me or does it like try to clean it up and make things get a little weird. Or do you lose your rights? Does it kill those? Yeah, I don't know. I don't think it loses, but I don't know. We've got a science experiment to try out. But you know what's funny is with these systems,

Starting point is 01:10:56 benchmarking is really important with them, not only just because your data is different, but because there's so many different factors to play. Like say, if you do set the segment size really low, then you could slow things down. But depending on your use case, maybe that's, you know, important and worth a tradeoff. So you can make a little tiny setting change and drastically change the performance of your system as a whole. And so it's really important to be able to kind of test that stuff before you roll the changes out. And that's how, you know, small config changes can bring down, you know, large data centers on Christmas Eve or whatever.

Starting point is 01:11:26 It's always the best time to bring down your database too. Lizard squad. Lizard squad. That took me a minute. Yeah, that was a different incident there, but it just reminded me. So some key factors

Starting point is 01:11:44 making things work well uh file format uh mentioned here that csv is not a great format for logs uh it's common separated values um typically you want to use something like a binary format that encodes the length of the string in bytes with the actual string appendant afterwards. Yeah. So someone would say why? Yeah. So they basically said that CSV, again, because the format's not good for it, by getting the length of it, you can basically store the offset to the end of that thing, right? So I could jump to the next line without scanning through all these characters or like regexing for a new line character or whatever.

Starting point is 01:12:24 So it's kind of like the thing we talked about at the beginning where like a file system depending on your file system might contain the start and the length of the file so if you need to hop to the end it's got a really easy way to do that same thing here so if you're trying to seek through these things as quickly as possible like it wants to be able to go line by line the fastest way to get to the next line is to know exactly where that next line begins so you can just do a simple add in order to hop to the next line yep when you think about if you've got like a you know 300 gig file and you've got a billion lines to go through that's a lot of hops it's a lot of math just to get to you know that last item there or somewhere near the bottom yeah it's important to call it too though the reason why we're even talking about csv is because we didn't

Starting point is 01:13:03 make this call out before but technically that was the format that he was storing everything and he it was like a key comma and then a string value yep that he had so that's why cv csv came up yeah good point didn't even think of bringing that up and then uh deleting records is requires some special attention because you have to create a sort of a tombstone to record the file or um you when you do the merge process and that's something i hinted at with elastic where um you know it it does ingest things in a log type format where it keeps appending appending appending appending and if you delete an item in an index then it doesn't go and remove that because we as we said like uh you know when we're dealing with logging systems that have to be really fast for ingestion we typically only write to the end

Starting point is 01:13:48 and so in Elastic's case what it does is it stores the fact that you deleted this document somewhere else and when you query and it does its filtering and does its magic whatever it needs to take that into account and say oh this one's been deleted and exclude it from the results which is overhead but

Starting point is 01:14:03 we'll probably get to that later and talk about how they could do that quickly. But the gist is to know that when you delete a document and something that's using this kind of mechanism underneath the hood, it doesn't automatically free up disk space. And so if you run that space on Elasticsearch and you say, delete these 100,000 records, it might go and mark them as tombstoned. It might set that first bit to zero or whatever and say, hey, this is deleted. But it doesn't free that disk space up, so you still can't take in new records, even though it feels like it should have been a delete, had it successfully operated, executed.

Starting point is 01:14:35 I like to think that we live in a world now where emojis are such a big thing that instead of writing a zero, it could just write a scroll and crossbones. Yep, literally puts in a – and that's something actually with kafka too um since a lot of the systems that we talk about are um a lot of modern kind of queuing systems or topics deal with immutable messages they really want you to keep things alive for like event sourcing or so you can recreate the state of your document at any time until it's really tough and the way you do this is with the tombstones like we mentioned but you have to be careful with you know your clients that you uh you know if you're doing something kind of naively and maybe starting from the beginning

Starting point is 01:15:13 of time and building up some sort of system or map or picture you've also got to be able to handle things like tombstoning and removing records as they come along too and so your clients have to be a little bit smarter about things and it's kind kind of funny that you can do all these operations on something that ends up getting deleted a few minutes later. Yeah, it is. One of the cool parts about the Kafka world, at least if you're talking about Kafka streams, they actually use the same term there as well. So when you're trying to delete something from a streaming process, you tombstone it.

Starting point is 01:15:42 So you basically send it a key with a null value, right? And it will mark it as ready to delete. So yeah, it seems kind of goofy at first, but it makes sense to me now. But like when you think about like event sourcing and like replaying events, like you kind of think like you might be tempted to say, why don't I just delete the records? It seems goofy that I'm going to do all this math on things that I'll end up late, you know, later, maybe I delete 90% of them. But these topics or these systems don't I just delete the records? It seems goofy that I'm going to do all this math on things that I'll end up late. You know, later maybe I delete 90% of them, but these topics or these systems don't know what you're doing with the data. It doesn't know if you're making decisions based on the current state of that

Starting point is 01:16:12 system. So in order to be able to replay things and needs to replay everything, even if it ends up getting discarded at the end. And this is like in a, in a database, this would be what's known as a logical delete instead of physical delete, where you basically just mark a flag on a record and say, hey, it's deleted. Ignore me when you're trying to show anything that's still alive.

Starting point is 01:16:31 You might also call it a soft delete. A soft delete. You might hear that term. Yep. Oh, yeah. I should mention, too, that we've been focusing a lot on logs this episode. We are still talking about relational databases right now. Yeah.

Starting point is 01:16:42 Because this is a fundamental piece of how relational databases like SQL Server, Oracle, or Postgres, this is how this is a big part of how they work underneath. So we're actually building up to their specific data structures that are built on this core, on these core kind of tenants of logging. I want to correct that.

Starting point is 01:17:00 I would not say that we this is a core these are core concepts to databases but we're not necessarily talking about a relational database. We haven't talked about relating anything to anything. We're just talking about how to store some data. Yeah, I'm just saying that this is a – So the concepts that we're talking about could apply to a document database. They could apply to OLAP.

Starting point is 01:17:20 Like, we don't care yet. Yeah, yeah, good point. I did not phrase it well what i meant to say is like this is an important facet of that plays a big role in relational databases as well as all these other systems like i mentioned kafka and elastic search stupid so it's not just relational databases but that we're getting there is what i'm trying to say i would agree with that yeah so we kind of already hit on this one but like you, you know, crash recovery, right? Like it's not a matter of if your server is going to crash, it's a matter of when. And, you know, you mentioned these, uh, like pre-allocating a four gig file, right? Like,

Starting point is 01:17:55 so if that's the size of your segment file, like depending on the size of the segment files, it could take a minute to, to, uh, you know, for the server to spin back up, depending on how it's writing the, these files to disk and, uh, you know, what is it that's being written in and what order is it being written? Right?

Starting point is 01:18:16 Yeah. So this was the whole talking about, right? Like if you didn't have this in memory hash and now you have to rebuild this in memory hash, you're going to have to scan that four gig file to rebuild that memory. That can take some time. And they said that BigCask, what they do is on occasion, so they're writing their log constantly, right? But at the same time, it will snapshot it's in-memory hash and write that to disk. So if it

Starting point is 01:18:44 did crash for some reason, when the thing comes back up, it can go load up that snapshot file, load that straight into memory without having to scan the four gig file. And then that way you have your pointers right back to that data. And just to put some, some terminology around that, that that snapshot file that you're referring to at that point would,

Starting point is 01:19:02 would in fact be the index. It is that is would, in fact, be the index. It is. That is being kept in memory, but it occasionally does snapshot that index out to a file, but it can reread that occasionally. Yep. You know, in the case of a server crash. Which, depending on the size of the file, could save minutes, right? It could save maybe even more than that. Well, that's why I call it like the four gig segment file you know since uh joe had mentioned four gig as the you know what the size of the file you might pre-allocate right like

Starting point is 01:19:28 uh depending on how large those files are you know it could definitely have an impact on uh you know i mean it's part of the trade-off that you're gonna make right because either you take the overhead of writing a bunch of little files or you're gonna like pre-allocate one large file so that you can have one large sequential file to read and write to but then either way yeah either way if it's four gigs of data in 100 files or four gigs of data in one file you're still going to take a hit on trying to go scan through all that data well it depends on what that hunt does those hundred files represent though well let's say it's the same format right right? Yeah. You're just going through it.

Starting point is 01:20:06 But one of the other things that they talk about here is another thing you need to be concerned about in this particular model that we've been building up or this underlying storage is what happens with incomplete record writes. You know, this thing crashes in the checksums and say, hey, is this thing done? Is it complete? Is everything kosher? If it's not, then it can skip the bad sectors and you kind of pick up where the good stuff left off. Finally, when I mentioned concurrency control, so you got something outlaw. Well,

Starting point is 01:20:48 I was just going to say like, you know, even if you were to think back to bit, you know, cause you're talking about big cast, but none of us really have a lot of experience that. But if you were to think back to like any other traditional kind of database, like a SQL server,

Starting point is 01:21:02 you know, you could think about like how in um, in the, the asset compliance that you mentioned earlier, right. Where that thing isn't truly written, considered written until like all the indexes have been updated additionally. Right. But if it did write to the, to the log and you know, maybe two out of five indexes got written and then it crashed, you can start to imagine now like, oh, here's where some of the spin-up time can come from on a restart because as it rereads that log of, okay, what were the last things that were done?

Starting point is 01:21:38 Did those things finish? Let me go finish those things. So now you can kind of get an idea as to like, how does it recover from that crash and still adhere to acid yeah there's like checkpoints all over the place right so it is interesting and jay-z you want to pick back up on the concurrency yep just want to mention so for concurrency we kind of hit on this before, but it's common for there only to be one writer that has the open file pointer

Starting point is 01:22:10 and that is responsible for streaming that data as quickly as possible to disk. But it's also common to have multiple readers. And we can do that, again, because we know that the data written, once it's written, as long as we're doing a proper log, is immutable.

Starting point is 01:22:24 So it's safe for multiple readers to be in there in time so that's something we can parallelize out and uh you know do a couple different interesting things with that without worrying about slowing down the writing at all so no locking yep this is this is where the uh the questions that everybody's had bouncing around on their heads like why why are you just writing only, right? Like, so why not update? Yeah, I mean, it seems terrible. Like, you have to worry about writing this space.

Starting point is 01:22:49 If you've got data that's update heavy, which is a kind of right, but it's a specialized kind of right, then this seems, you know, like a terrible idea. And so, you know, we kind of touched on some of these things before, and it can seem really inefficient at first. But like we mentioned, if you stick something into the middle of the file, not only do you have to seek to it and search for it and find it, but then depending on if that data is larger or smaller, you potentially have to bump all the data, you know, one way or the other, which is a pain.

Starting point is 01:23:17 If you imagine like even deleting the first line of a file means you have to shift everything up, you know, to read. Now that I say that, I don't know if there's maybe optimization for that in the file system, whatever. But as far as I know, you have to shift everything in place. Well, but I mean, if we're keeping true to the spirit of this chapter, though, you're managing this file yourself.

Starting point is 01:23:41 Right, right. So if you're going to delete the first entry from the file then you're responsible for shifting everything up yeah yeah that's true uh just got hung up thinking about if there was some way maybe i could because the point is to try to think about like you know something has to do this shuffling around right and like where where are some um you know the the weak spots the strengths, the advantages of different ways of even reading or writing or storing this stuff on disk? Yep. And we mentioned, too, how this is particularly efficient on spinning hard drives.

Starting point is 01:24:18 But one thing that the book mentioned we didn't really go into is that sequential operations are also more efficient on solid-state drives. I hadn't heard this before and i haven't looked deeply into it i was curious if anyone looked it up or what why that was i've never looked into it but i have also seen that at least on performance charts and stuff where they're comparing multiple drives like your sequential versus your random reads and all that kind of stuff like there's massive differences in them yeah i mean it's generally like if you really want to gauge the performance of any drive be it a solid state or a spinning disc it's the randoms that are going to be like the true measure of like how fast it is right because that's what your os is doing is

Starting point is 01:25:02 just throwing stuff all over the place yeah because, because the sequentials are usually going to be like that. That's what they're all optimized for. Right. The sequential is if you're writing a massive chunk of a file at once. Like you have a movie that you're bringing over and it's 30 gigs. That's your sequential write and your read. But that's not most of what's – You hope.

Starting point is 01:25:23 But that's not what is most of your operating system, right? It's usually files scattered all over the place, which is why your performance numbers are based off that. Yeah. When you get a 550 megabyte per second random, then that's good. Yeah. You're flying. Well, actually, no. That's not even good by –

Starting point is 01:25:40 Nowadays. Nowadays. Yeah. I'm sorry. I should have added a zero to that. That was a SATA 6. I'm sorry. I'm sorry. 2007 should i should have added a zero to that that was a set that was a sad i'm sorry 2007 called they want their hard drive back right uh so i looked up why ssds why it matters at all for sequential and what they kind of say is basically if you're doing

Starting point is 01:25:57 random uh access and you're writing to random spots on disk then they can basically leave little holes we're talking about kind of fragmentation essentially. And if you're writing a lot of data, those holes eventually need to be cleaned up. And so you're kind of forcing more kind of garbage collection or defragmentation type operations because you're going to be filling up space

Starting point is 01:26:17 in an inefficient way compared to a sequential write, which doesn't leave any of those holes and is able to just keep streaming that data out but that's the same for both spinning and ssds yeah yeah i was just curious like i knew so i knew i understood like you know you've got that physical pointer you know that physical like writer head on the the spinning disc and so that made sense to me as to why sequential was really important there but ssd is always kind of it's like thought of them as basically being like similar to ram

Starting point is 01:26:43 and so it's like why does it matter if it's sequential or not? But it basically has to deal with – it has to do with how things get junked up essentially. So similar type thing. I just imagine – because I just assumed that it would be like even on that disk, there's still a controller. There is. And like any other process, it's got a certain number of threads that it can do stuff. It has to know where to go get it and all that kind of stuff. If it can just be like, oh, hey, start from this offset and read 10 gig,

Starting point is 01:27:16 that's going to be a lot faster operation than, okay, read from this offset and read 5 meg. Now go to this offset and read 2 meg. Now go to this offset and read a gig. Now go to this offset and read a gig. Now go to this offset. You know what I'm saying? That was just my assumption. Well, it turns out that's part of it too. So I just round someone else that said even on smaller writes,

Starting point is 01:27:35 excluding this kind of filling up and having to do that cleanup, it's not a true zero cost operation in order to kind of hop around because when things aren't stored uh linearly you do have to go back and do that those calculations in order to kind of move around and read and write to different areas so exactly what you said cool it's close to zero it's much's much faster, but it's not zero.

Starting point is 01:28:07 SSDs are such a beautiful thing. True that. So concurrency and crash recovery, we mentioned how logging systems deal with these, but it's much simpler. When we get into relation databases more, we're going to talk about the things that they have to do to go pretty far out of their way in order to make those things work. To keep this back in context, though, we're talking about it's much simpler if you're in append-only mode and not updating portions of the file. Yeah, you're right. So I only wanted to bring that out because we floated away a second. So we're talking about why not update in the middle of the files, right?

Starting point is 01:28:45 Like why not go update the value for some key? And this is one of the reasons because the concurrency and the crash recovery are much easier if you're always doing append only. Yep. And merging all segments is convenient and intrusive way to avoid fragmentation. It gives us a nice convenient pattern for following there. It's low effort. It's hard to mess up. And it just kind of works out really well in practice. And so this fragmentation, I think it's important to understand what he was just talking about with the SSDs and the hard drives and all that. Imagine, you know, a simple example because we're using my name or let's use Joe this time. So let's say that he first went in there as Joseph Zach, right? And then at some point they come back and they're like, no, no, no, we want that as Joe. Well, because you don't

Starting point is 01:29:31 want to shift all those bits around on disk, because that's a really expensive operation, especially if there's 10 gigs of data after him, probably what you're going to do is you're going to write J O E there and you're going to know out that next bit in there and just leave four bites open. Right. And so that's where you start running fragmentation. And that's why they're saying, if you do this append only thing,

Starting point is 01:29:55 then you don't have this fragmentation. You don't have all these empty blocks all over the place because you're always putting it right at the very end. So you don't, this update causes fragmentation and that's why they lean towards a spend only mode. And if you also think about it like this, um,

Starting point is 01:30:12 if you were creating this database servers, this database system, and this system was only ever going to be used for this database, you can kind of already get an idea where like, if it's always pre-allocating these files of a certain size, right? And it's always going to be like, you know, from the beginning of this computer, this server's life, it's always going to be pre-allocating them in a certain size,

Starting point is 01:30:39 sequential writes and reads. And then, oh, I got to get a new file. file so i'm gonna go and create a pre-allocate another one and then eventually i can age off this other one you could see like how the disc itself is always going to lend itself to be in a non-fragmented you know hopefully more often than not the disc itself will remain, you're just packing them in there. At least you hope. Yeah, I mean, that would be the hope for sure. But yeah, you're just constantly churning over the same thing, right?

Starting point is 01:31:14 Like you're filling it up. It's almost like filling a glass of water, pouring it out, filling it back up. I mean, I'm kind of like being optimistic here, though, and hoping that you're going to delete the same amount of records that you're inserting. So maybe it's not true then. But at least from like – but think about it, though, from the point of view of like this is also why systems will allow you to keep the data files in one place, temp files in another, and the log in another. So you can allocate different disks for these things, for these purposes. Yeah, there's all kinds of optimizations.

Starting point is 01:31:49 We're doing the simple database right now. Yeah. Simple. Yeah. All right, so we have some downsides here. And this is going back to what we were originally talking about. So the first one is the hash table must fit in memory, right? If you don't have enough RAM, this was the whole bit cask, right?

Starting point is 01:32:07 Like our hash to look up thing. Then you might, if you don't have enough memory, then you're probably going to have to spill this over to the disk, which isn't nearly as efficient, right? Because now you can't just go straight to the spot in RAM where that hash was to point to the location. Now you're actually having to go, okay, well, it's not in RAM. Let me go find it on disk and then go look it up. Right. So it's an additional couple of hops. Yeah. And then, you know,

Starting point is 01:32:38 another downside here, we haven't even talked about this one yet, but range queries are going to be inefficient if you have to look up each individual key. And so what I mean by that is like, if you had to do a query where you're like, hey, give me all of the contacts for people whose names are from A to D, right? Well, depending on how that's sorted, right, you could see like how that log, going through that log, is going to be inefficient. If you had to go through that and look for at each individual entry to find the matches there, you know, match that range query. Yeah, because the key is no longer actually the key, right? Like it's not like it's stored as, you know, Alan, Joe, and Mike.

Starting point is 01:33:27 It's ABC 123, you know, DEF 245, whatever, because it's hashing that key for the fast lookup. So, yeah, these range queries are going to be super expensive because they're probably going to be scans. Scans through your hash table, more than likely. Which is an interesting point. Like when you see something doing like a table scan and something like SQL Server, or you see something doing an index scan, neither are good. But typically your index scan will probably still be faster than your table scan because you're scanning a smaller chunk of data to get to it a lot of times.

Starting point is 01:34:06 So it's not efficient, but it's still more efficient than having to go through the entire data set in a lot of situations. Yeah, I mean, another way that you could think about that, this is definitely getting a little bit ahead, but to your point, the data file might contain every column. So if you had 100 – if it was a really wide table, you have 100 columns in it versus the index. Even in your composite key example, like in our phone book example, we gave three columns. So it's already a much smaller – Size. Size-wise, it's already a much smaller – wise. It's already a much smaller, you know,

Starting point is 01:34:45 the width of this thing is, is already smaller. So you can already see like how it would be, uh, you know, the size would be greatly impacted because now you're only talking about an index. And then,

Starting point is 01:34:55 you know, there might be there additional operations that could compact that even further that we haven't gotten to yet. Yep. So, so stay tuned. Yeah. This episode is, so stay tuned. Yeah. This episode is sponsored by clubhouse.

Starting point is 01:35:08 Clubhouse is a fast and enjoyable project management platform that breaks down silos and brings team together to ship value, not just features. So let's face it. Slow, confusing UX is so last decade. Clubhouse is lightning fast built for today's software teams with only the features

Starting point is 01:35:26 and the best practices you need to succeed and nothing else. And here are a few highlights about Clubhouse. They've got flexible workflows so you can easily customize workflow states for teams or projects of any size. We've got advanced filtering so you can quickly filter by project

Starting point is 01:35:43 or by team to see how everything is progressing. And you can do sprint planning, so you can set your weekly priorities with iterations and let Clubhouse run the schedule. And Clubhouse also integrates with the tools that you love. They tie into existing tools, services, and workflows. or create a story in Slack, update the status of a story with a pull request, or preview designs from Figma links, or even build your own integration with their API and a lot of other things. And Clubhouse is an enjoyable collaboration tool. Easy drag and drop UI, dark mode, emoji reactions, and even more. When you're doing your best work and your team is clicking, life is good. Clubhouse has recently made all core features completely free

Starting point is 01:36:26 for teams up to 10 users. And they're offering CodingBlocks listeners two additional free months on any paid plan with unlimited users and access to premium features. So give it a try. You can go to clubhouse.io slash CodingBlocks. That's clubhouse.io slash CodingB's clubhouse.io slash coding box to try it today all right so as far as resources are like of course the book designing data intensive applications

Starting point is 01:36:54 is fantastic make sure to leave that comment if you want a chance to win that and uh yeah now it's on to uh alan's favorite point of the show? Or not quite. How about this? How about this? Because I had this question that was written into us, and it came to mind because you mentioned if you were going after, I think you said if you were to go after cloud certifications. I think that's how you worded it, right?

Starting point is 01:37:23 Oh, I did get one. I got one of those. So the question that we received was, are certifications worth it? So Joe... And so I was kind of curious, since you just recently

Starting point is 01:37:37 got your certification, maybe you would have an opinion on such a question. Yeah, I do. So I recently got the an opinion on such a question. Uh, yeah, I do. Um, so I recently got the,

Starting point is 01:37:48 uh, GCP, the Google cloud, uh, platform, uh, ACE, uh,

Starting point is 01:37:52 certification certification, which is their associate cloud engineer. And, um, uh, for that's the first certification I've ever went and gone after and, gotten, and,

Starting point is 01:38:02 uh, I've kind of avoided for years because I never really kind of saw the point I had a couple bad experiences with tests that didn't really I feel like actually accurately judge like how well I knew a subject like I took like a cold fusion test once and it was all about CF forms and like nobody used those at the time it was a terrible way of doing things and I was really upset about the test and I was angry I was like like, screw this. And, uh, but I, you know, I've kind of changed my mind recently a little bit about some certifications because, uh, kind of two things, two reasons. One is that when you study for a certification, particularly for something that you're kind of newer at, it really highlights

Starting point is 01:38:38 the things that you, uh, may be overlooking. So if it's, you know, something you've been doing for 10 years, like, and you doing for 10 years and you're already doing that job, then studying for the certification isn't really going to highlight areas that you're weak on unless you're particularly willing to bone up on that language just to do it. But when you're first getting into something, it's really easy to not understand or not understand that you're missing big areas or that you have big gaps or big misunderstandings in your knowledge. So it was a great way for me to learn GCP and have a goal that I was going after. So it was great for the knowledge aspect.

Starting point is 01:39:11 Also, I think that some certifications are particularly valuable now. And particularly, I'm talking about cloud and specifically Kubernetes-type certifications. There's a couple in there like Elastic, Kafka, I think that are really valuable now. But those are all ones where it's like there's a lot of knowledge that you can be tested on that could be really important because there are all these like dark little kind of corners that will trip you up and, you know, bite you on the ankles and mess you up. And so I think by having those certifications, you kind of show that you've at least done

Starting point is 01:39:44 kind of a general lay of the land and that you aren't just kind of strongly focused on whatever your small kind of slice of working with that technology like a day job does. It means that you've kind of kicked the dust off of most aspects of those platforms. And so I think that me and some organizations are kind of coming around on valuing those certifications higher. I think security is probably another great space where if you're working in security space, those certifications are highly valuable. Same reasons. Alan?

Starting point is 01:40:17 Yeah, I have similar feelings and probably even a little bit further. So one of the things that we talked about in this episode was the Kafka retention policy and things aging out or being pushed on a different segmentation type stuff. The only reason I even know about most of that stuff is because I was actually working towards getting Kafka developer certified, right? And that's some of the stuff that you learn about as you're going through preparing for that. And that's some of the stuff that you learn about as you're going through preparing for that. And that's not something I would have assumed the same thing. Hey, I said that the retention poly seven days, seven days, that data is gone, right?

Starting point is 01:40:53 That's not the case, right? So it's, it's like he said, filling in those gaps is important. I will also say it, this isn't as important to me. And. And by no means do I want people to get hung up on going and getting certifications over getting experience on things. I think that certifications help lend credibility to your knowledge and your experience and your ability to work on things. However, there's going to be people that are like, oh, they say go get certifications. I'm just going to focus on certifications. And unfortunately, you can go take tests on things all day long and not actually understand how they work, right? So I feel like certifications lend, if you're experienced in something, they also kind of give you some credibility to go along with it. So if you're talking to somebody within your organization and they say,

Starting point is 01:41:47 hey, how do you think we should do this? And you say, hey, I think that Kubernetes would be a good fit for this because X, Y, and Z. And they can look and say, oh, this guy's actually Kubernetes certified developer or Kubernetes certified admin, right? That lends some credibility so that it's not just, hey, I've got this crazy developer over here saying that we should do this. So I think it's two things, right? Fill in the gaps. I think it's important. And I do think that it can help you sell your

Starting point is 01:42:17 case in certain circumstances. And it could also be good for getting jobs because let's be realistic. Nowadays, your LinkedIn's, um, that social profile and being able to put certifications on there is a big deal. And I actually noticed at our job in your personal profile, like in, in our HRE personal profile, there's a place where you can plug in certifications there. Right? So it can actually matter for you in multiple ways. It takes a lot of time. It takes a lot of effort. Some cases it takes a lot of money, but it can be worth it. It's funny. The downside to this is you guys, I know you both remember this. You remember when MCSE and MCP were like the big things. And there were so many exam companies that started up that it was, hey, come over here and take our training for $1,000 and we'll get you MCSE certified, right? And you had all these people getting MCSE certified that didn't know jack about how to set up systems.

Starting point is 01:43:21 And it kind of tainted the market back then. And I think we've gotten past that maybe a little bit. I don't know, but I don't know. So I'm always kind of torn on them. So, so the interesting thing is that, um,

Starting point is 01:43:35 both of you targeted infrastructure-y type certifications. Nobody said a Java certification or any kind of application developer certification. You both went after infrastructure type things. So that's curious. And I don't disagree with it. It's funny that you bring up the MCSE certifications, because that's definitely put a bad, you know, opinion of certifications in my mind where like, I never went after it. Because exactly like what you just described, I worked with a guy, we hired this guy. He had every certification that you could imagine from Microsoft that you might want your network and sysadministrator to have. And I'm not kidding when I tell you he asked me how to find the IP properties on the server. And I'm like, really, man?

Starting point is 01:44:40 Because I don't have any of the certifications that you have. You have every one that Microsoft offers. And so it was at that point, I kid you not, I was like, these things are worthless. I will never waste a minute of my time or a single dollar going after one. Right. Right?

Starting point is 01:44:59 Now, that said, you know, things have changed. Yeah. I will give you that. It's been a minute. And so I will give you that I probably have a very tainted old view of it in perspective, and I should change that. Because, you know, it does look impressive when you can look at somebody's, uh, you know, LinkedIn profile and you see those kinds of things. Right. But I think that what I would say is, you know, like whether to go after them or not is like, I would definitely put the, um, emphasis on the experience, like what you said, Alan, like that cannot be understated enough.

Starting point is 01:45:47 And if after going through that process of gaining that experience and whatever the thing is of choice, that's of interest to you, if you're at a point where you can get the, get the certification for it, by all means, man, go for it. Go ahead. Like, why not at that point? But I wouldn't start at it from the inverse. I wouldn't start with like, oh, I should probably go get the certification in this technology that I know nothing about. So I'm going to study for the certification exam without trying to gain the experience. Because you can do that and you can get the certification. But when it comes time to, you know, show off that knowledge, you're going to come flat. Yeah. Right. It's true. And I mean,

Starting point is 01:46:33 I'll give a good example. So I have been working towards the Kubernetes developer. I think it's CKAD maybe. I think that's right. At any rate, there's, there's a Udemy course that is, you know, get your CKAD Kubernetes certification. You know, this course will teach you how to do it. But it's a great, I'm not using it to go get the certification. It is a fantastic course to show you all the things that Kubernetes can do, right? So, yes, it's a good study guide, but more or less, it's almost like what this podcast is for me.

Starting point is 01:47:15 What I think this podcast should be to most people is how do you improve your knowledge set without having to know everything? because it's really hard to do, right? So there's no way I'm going to read through every single doc on the kubernetes.io website and go through everything. But if I can go through this course and this guy's like, Hey, this is a config map and this is what it's used for. You know, this is a security vault and this is what it's used for. It's like, Oh, okay, cool, right? Now, when I go do something, I'm not going to do it harebrained because I'm going to have an idea that, you know what? I heard about this. I need to go look at it. So, to me, it's like an accompaniment, right? Like it helps push you forward. It helps build your knowledge. But if you feel like you're close

Starting point is 01:48:03 to it, do it, right? Hey, look, I'm not going to lie. If you go get your cloud certification in AWS, if you become a cloud certified architect in AWS, probably gonna have a decent paycheck. If you get your cloud certification in Azure as an architect, you're probably going to have a good paycheck, right? Like, so these things can help you, but they should be part of your experience as you go. Not the only thing, because if you get in there with a cloud architect certification and you can't answer simple questions, you're messing it up for yourself for a long time. And you're also messing it up for everybody else that comes after you.

Starting point is 01:48:43 Yeah. And that's, that's where it's so tough so like to the question about like which ones i don't have any specific recommendations i would just say like if it's already something that you are in and you feel like you've mastered it then you know and there's a certification available for it go for it by the way i am cold fusion certified nice very nice. Was it for CF7? That's what I saw. I took a practice exam, and it was like CF form out the wazoo, and I had a fit. No, this was pre-Java version of ColdFusion. So I think it was CF5.

Starting point is 01:49:17 I think it was CF5. It was a layer certification back at the time. I still had the bag, the green bag with the certification. Yeah, it's been a minute. That was like $700 back then, wasn't it? I think I spent – I think it was $500 back in the day. It was not cheap. And I didn't study for it.

Starting point is 01:49:35 How about that? And I still got it. That's awesome. I'm embarrassed to say what my last certification was in. So let's just move on. CPR? CPR. No. That was a good one though i mean i did get that one at one point but yeah like real quick before we move on i don't want to belabor this too long though why didn't we mention languages yeah i mean it's a good question

Starting point is 01:50:00 why didn't you i don't know why you didn't is it is to me i think it's because like i i mean technically mine covered i think my my at least i intended for my answer to cover it like if that's what you feel like you're good and go for it see and i think i'm pretty good at languages but i just don't i don't know maybe it's that i want to fill in my knowledge gaps elsewhere i i think that for some of them you just don't hear a lot about it like java maybe microsoft technologies maybe a javascript one i don't know so i'll tell you the reason is like if i see you know a javascript certification from like udemy or plural site or uh you know corsair or something like it makes me think that you spent a weekend with the course and took the test and did it and all that's good you know it shows me that you care and you're driven whenever,

Starting point is 01:50:45 it doesn't tell me that you can program, that you can get the job done and you can hear a use case and then go off and make it happen. But when I hear that someone has like John Calloway from Six Figure Developer just got the Azure DevOps certification and I know he does a lot of work with it.

Starting point is 01:51:00 And so when I hear that he's got the certification, it tells me that he went and he looked and zeroed in on the area and made sure that he covered all his bases. So if I talk to John and he's consulting with me on a job, I know if I ask him about something that he just doesn't – there's not like some big missing hole in his knowledge. And it's like, for example, like billing is something that was big on GCP. So if someone asks me a question about GCP, it's not like I don't know about billing accounts and how that works and how that relates to projects and stuff. So like, you know, like a bare minimum that there's not some big fundamental misunderstanding with how I think about the system.

Starting point is 01:51:32 And if you're someone like you code, say like you work with AWS technologies all day long for three years, you may only work with like, you know, S3 and Dynamo and you may know nothing about IAM or the networking or any of that stuff. And so I think when it comes to technology, the cloud technology specifically, and those kind of infrastructure type ones, it's really important for those people to like show that they've got a lay of the land.

Starting point is 01:51:54 They've got a wide knowledge of that platform and not just like a, you know, kind of a very narrow view, which can be really dangerous if you hire someone who's only worked with S3 and Dynamo and they're setting up your billing accounts and maybe they don't know how to use the pricing calculator or don't even know that the pricing calculator exists or something like that.

Starting point is 01:52:13 Well, that's where they need flow for the pricing gun. You know what? I think this answer, though, made me realize why I don't think I've ever focused on languages is because what you just said is you're thinking about use cases and outcomes when you're thinking about infrastructure, things in the cloud, that kind of stuff. Just because you passed a developer certificate doesn't mean you know how to program. It just means you know the pieces of language. You know the libraries.

Starting point is 01:52:38 You know where to go find things. Well, I mean, for me, where I was thinking that, though, was that for some, it's easy to say who the owner of that thing might be to even give the certification out. So for Java, it's easy for, like, you know, back in the day Sun and now Oracle to be like, hey, I'm Oracle Java certified or whatever. Right. You know, but if there's no quote owner to it, you know, like a JavaScript, then it's kind of like you went after a Udemy one. It's like, OK, I mean, yeah. Yeah. You got the concept.

Starting point is 01:53:15 If it's your first job, you know, if you're coming out of high school or college or something and then you spend a weekend and got a React certification, I'm much more impressed by that than the person who just got out of college and didn't do that. Totally. Totally. Yeah. And we're not trying to, we're not trying to downplay people that have gone and gotten their J two E certifications or anything like that. Right?

Starting point is 01:53:33 Like it's, it's not a small amount of work. It's just why I haven't chosen to do it. Yeah. If, okay. So fair. Then if you're,

Starting point is 01:53:41 if you're brand new to it, to, to the industry, then, you know,, then the certifications might be better. But if you've been in the industry for decades and you're like, hey, I just got a JavaScript certification from Udemy, then – Well, I'm not talking – I don't know. Like when I said the Kubernetes thing, it was actually a course on how to get certified CNCF Kubernetes. Well, yeah, but that's an infrastructure one.

Starting point is 01:54:10 That was a language one. Right, yeah. Getting a certificate for completing a course on Pluralsight or Udemy is probably not what you're aiming for. Yeah. All right. Well, I think we already said that we were going to have the resources in this – obviously, we're going to have a link to this book, Designing Data in Terms of Applications. But with that, let's get into Alan's favorite portion of the show. It's time for a joke.

Starting point is 01:54:39 I got him again. One last one. One last one. I promise. Last one. One last one. Promise. Last one. So Arlene shared this one with me, and I thought this was pretty good because, you know, we've got springtime coming up, and you're going to want to get out there and whatnot and get active. So this was a – she sent me a screenshot of a tweet from Dad Jokes and said, I made a playlist for hiking. It has music from Peanuts, the Cranberries, and Eminem. I call it my trail mix. That's pretty good. I like it.

Starting point is 01:55:22 And with that, we head into Alan's favorite portion of the show. It's the tip of the week. Yeah, baby. All right. So seeing as I was missing last episode and got impersonated, or actually the episode before last where I got impersonated. There wasn't an impersonation there. That was wrong. All right.

Starting point is 01:55:39 So the first one, I was chatting on Slack, which if you're not a member of our Slack, you should be because it's full of awesome. I was chatting with Steven Leadbeater, and we were talking about security stuff and certificates and all kinds of randomness. At any rate, he drops this link out there nonchalantly called keycloak.org, and it's amazing. We've talked about if you go to create a side project or any project, you typically need authentication and all that kind of stuff. And it can be a pain in the butt, right? Like you want somebody to be able to log in with their Facebook account or Google or whatever. There's this little thing called Key Cloak that allows you to sort of painlessly do this. And from what I understand, it can run in a container and it can federate your authentication and all kinds of things.

Starting point is 01:56:24 So check that out. It might make your life a lot easier if you're considering creating some sort of membership type thing or anything that needs some sort of authentication. All right. This one is not developer centric, but for anybody that travels at all, man. So this came from Jamie from the.NET Core Show. So while I was over in London for the NDC conference, trying to get around in a place where public transportation is the thing and you haven't been there before can be a little overwhelming. They got 50 different lines of railway systems or whatever. And if you're trying to get from point A to point B, it can be really overwhelming. There is an app on iOS and Android, I believe called city mapper. You can go to city mapper.com.

Starting point is 01:57:17 It is amazing. And when I say amazing, I cannot understate that or overstate that enough. I should say, is if I needed to get from, I don't know, wherever I was to another part of London, I could plug it in and it'd give me like eight different ways I could get there. It would tell me the roughly the times, how much walking you had to do, what rails you had to get on, how much money it was going to cost you. Like, so from what I understand, this is only available in like major cities that have you know that they have access to some of the infrastructure so like the railway times the subway times and all that stuff amazing like killer application it saved me many many times and then when you say

Starting point is 01:57:58 it's limited though you're not kidding man i mean you're talking about like how many cities are 41 okay it's big cities, right? But it's mostly cities with public transportation. And one of the key things there is they even had Uber, right? So if you wanted to skip all the public transportation, it would give you prices roughly for what it would cost you to Uber from point A to point B. So that was really killer. I see a smirk on Joe's face. I don't know what that is.

Starting point is 01:58:26 I heard somebody's notifications, but I can't call them out on it because that also meant that I was chatting while we were podcasting. That's awesome. All right. And then here's my last one. I was paying attention there for what it's worth. I appreciate that. Yeah, you're welcome. That's fine. Yeah. All right. So here here's my last one and this one's kind of interesting so you guys have heard or at least if you haven't you should know of my love for docker right and i use docker in all kinds of crazy ways uh outlaw and i were talking about it tonight where he'll use it just so he doesn't have to install software on a system right i have very similar type things like if i want to run ruby i don't want to install ruby 2.6 and 2.7 and 2.5 and all those on my system, right?

Starting point is 01:59:08 Like I'd rather run a Docker container that has it all in there and I'm good. It's interesting is when you're running Kubernetes, one of the dreams of it is you have this infrastructure and it can deploy containers out to different things, right? That's the whole point of it. Well, when you really dive into Kubernetes, you find out this notion of a node is kind of a server, all right? And then you have containers that run on different nodes. Well, one of the things that always bugged me about running Kubernetes locally, because when you install Docker, you can say, hey, turn on Kubernetes and that's fine. But you have a one node server, which really bothered me, right? Like, because some of the cool stuff with Kubernetes is you can create taints and things like that on the nodes to where like, let's that, for instance, you want to run a database in a Kubernetes cluster, and then you've got your application servers.

Starting point is 02:00:10 Well, that database server or that database needs to run on some beefy hardware, right? So you want that thing to run on your most powerful node that you have. And then your application stuff can run on all the other kind of nodes that are semi-powered, but they're not crazy powerful. Well, one way you can do this is have multiple VMs on your laptop or your home computer, and then you can register those things with your Kubernetes cluster. And then that way, you can say, hey, I want to run my Kubernetes containers on these various different nodes, right? They're all treated as servers. Okay, Well, you can do that. You could totally load up VirtualBox and then add a bunch of different Ubuntu servers or CentOS or whatever you want, but that's kind of a pain. Ubuntu actually made this thing called Multipass that is really sweet. It's a command line way to spin up VMs quickly and easily. So if you wanted to say,

Starting point is 02:01:09 have four different Ubuntu instances running so that you had four nodes for your Kubernetes cluster, it's basically multi-pass launch and then name it. And you have a VM running and you can pass in the number of CPUs, the number of RAM you want and that kind of stuff. And you're good to go. So this was something I stumbled across. I've got to play with it some more, but it's really promising for being able to spin up VMs in a very lightweight way. So those are those are my tips. Oh, yeah, it's really cool.

Starting point is 02:01:42 I was blown away when I saw it. I do feel like, um, just to elaborate though on the Docker thing. Like I do, I do feel like in a lot of times I'm like, I use Docker for the dumbest things possible. Like everybody will like spin up a Docker container or, or like create a Docker image to be like, Oh, I want to run my app server and I'm going to like install this in it and I'm going to install that in it and now I'm going to run it and I can just, boom, I can hit it with all these different services and whatnot. And I'm like, yeah, well, you know what? I'm going to Docker run for one command

Starting point is 02:02:13 so I don't have to install whatever it is. I don't want WGit on my system. Yeah, yeah. That's right. Dude, there's a whole slew of Microsoft containers for mimicking Linux commands. Oh, we've talked about that. I can't remember his name off the top of my head, but I will definitely have a link to it in the resources we like

Starting point is 02:02:35 because I want to say his name is something, Stephen something. He was titled the Docker Captain for Microsoft because we've referenced his repo before of all the different Docker containers or Docker images that he created, Docker files that he has available for all that kind of stuff. And yeah, I'll be like, oh no, I don't want to install Postgres,

Starting point is 02:02:59 so I'll just Docker run Postgres so that I can PG dump some other database and I'm done. Why bother to install Postgres so that I can PG dump some other database and I'm done. Like, you know, why bother to install Postgres when there's a Docker image already available for it? Yep. So, yeah, I use it for dumb things. No, those are amazing things. That's my guilty pleasure.

Starting point is 02:03:18 Okay. So, my tip of the week. So, one, okay, last episode, two episodes back, Joe and I were talking, and Joe brought up his tip of the week, which was muzzle. Now, do you guys ever – am I – I can't be the only one that quite often goes and looks at the source for some of these things? Oh, yeah. Okay. Thank God. Yeah.

Starting point is 02:03:48 All right. So Muzzle was Joe's tip of the week from a couple episodes back. And if you haven't checked it out already, I cannot stress enough how you need to go to MuzzleApp.com because it is hilarious to watch the messages that will fly in. So for Alan's edification, cause you weren't on that episode. If you get notifications, they're coming in.

Starting point is 02:04:12 What muzzle does is it will automatically silence those notifications on your Mac when you are sharing your screen. So it happens automatically for you. Like Mac OS has the capability to turn off those, you know, to get into a do not disturb mode, but this does it for you automatically. And if you notice these messages flying in, they are hilarious. Right. And so I was curious because, you know, I was like, man, I just want to read the full list of the messages. Like, you know, I can sit here and watch them come in one by one, but sometimes like, you

Starting point is 02:04:46 know, you might blink and, you know, oh, it's already gone. And so I just wanted to like, so I started hunting through the source because I just wanted to read like the full set of messages. And in doing so, I found this beautiful gem that was hidden in there that I didn't know was a thing until, until now, which is a, we've talked about, um, in the past, like Lipson, Lipson, Lipson, uh, Loram Ipsum for like, uh, picture generators as well as text generators, right? Well, there's a random user.me site, which is a random user generator. And what it'll do is it'll just return back like, hey, here's the name. And here's the, if it's male or female, here's a photo for it. So if you wanted to create something

Starting point is 02:05:43 random, like what Muzzzzle has on their app for showing like, Hey, you got this notification from Sergio and you want a photo to go along with it. Like random user.me will give you just, you, you hit the API and boom, you'll get back a random call. So for example, if you were to go to muzzleapp.com and you were to open up your dev tools and you watch your network tab, you will see a bunch of calls coming to randomuser.me and you'll see what I'm talking about with what that payload looks like. And it's just so awesome. You're like, oh man, I never knew that was a thing. So if you ever have yourself, if you need random users

Starting point is 02:06:31 for your system, random user.me. That's cool. All right. So that's, that's my first tip. Uh, then my second one is, uh, I forget now, maybe it was a couple episodes back. No, I think because I think Alan was here for that episode where I had talked about, dang, who was it? Was it Russ that told us about it? The Git playing cards? Yes. You remember that? I remember that.

Starting point is 02:07:02 Okay. you remember that? I remember that so I found this other one that like I gotta have this in my life now it's a get cheat sheet coffee mug and here it is right there for you

Starting point is 02:07:21 but they have these for everything so you can go to remember the api.com and you can find all your favorite things there. And that's cool. They'll just be cheat sheets for whatever. But you know, of course I was going to like pick out the one forget, but yeah, there were some great options there and there. So like, uh, let me just see, like what was some, let me go back in refresh my memory of what some of the other ones were. Um, okay. So a Vim cheat sheet, we we've talked about our love of cheat sheets. There was a Reg X cheat sheet. Uh, there was a, um, well, those would both get,

Starting point is 02:07:55 then they had them in mouse pads, uh, water bottles, like whatever you want, you know, like you want a, a travel mug for your coffee instead. They've got that versus the traditional kind of, uh, um, you know, coffee, you want a, you know, just a notebook, you need some stationary and you want your notebook to like stand out from everybody else. Why not have a get cheat sheet on it? Right. And it's like, there's a, remember the, the, you know, like when you were in school and you would, uh you would get your new notebook and it would have like, oh, hey, here's all the tables of conversions for metric measurements to imperial measurements or temperature measurements or whatever. Yeah, so now you can have a good cheat sheet on it.

Starting point is 02:08:37 So I thought that was pretty awesome. Oh, there's a Kubernetes one. Oh, was there? I didn't even see the Kubernetes one. Really? Yeah, man. How am I still not seeing the Kubernetes one? Docker CLI, Kubernetes. What? How come you're seeing more cool stuff than I am? I'm on remembertheapi.com slash collection

Starting point is 02:08:56 slash mugs. Oh, okay. Oh, okay. Yeah, yeah, yeah. Right. Oh, computational complexity cheat sheet. Yep. Big O. Tell me you don't want that in your life. Right? Oh, yeah.

Starting point is 02:09:11 Now I see the Docker CLI one. That's great. Cron cheat sheet, man. I'm telling you. Oh, my gosh. Everybody needs that mug. Everybody. You tell me you remember every position of that. I always forget it.

Starting point is 02:09:21 I'm like, wait, is it going from least to most or most to least? All right. Whatever. Inside my dumb head. I want to do a cheat sheet mug delivery service. Every month you get a different cheat sheet for some sort of tech tool. Oh, yeah. That's amazing. They need to do that. That would be a great Christmas gift for everybody.

Starting point is 02:09:40 Okay. The last one that I have here is that Alan actually told me about this one. How did I not know about this already, A? B, how is Alan not use this as a tip of the week, B? And C, forgive me if Alan did use this as a tip of the week and I forgot because I swear I was listening. Reference to the comment earlier.

Starting point is 02:10:08 Yep. But in your dev tools in Chrome, if you are in your JavaScript file, you're trying to debug something maybe, and you know the specific function that you are looking for, you can type Control-Shift-O, and then it'll bring up a prompt, and you can just type in your method name, for example, and it'll navigate right to it instead of going to it by line number. Yeah, I was watching Outlaw one day. We were trying to figure something out, and he kept doing a Control-F to go find stuff.

Starting point is 02:10:41 No, no, no. I was doing Control-G. Oh, you were trying to go straight to the line or whatever, and I was like, hey, dude, just do this. go straight to the line or whatever and i was like hey dude just do this you type in the method name and you're right he's like oh it's just one of those things it's muscle memory right i've been doing it so long that i don't even think about it anymore so very nice yeah all right well you guys just did like seven tips uh and uh maybe eight i only have one tip but it's super good probably so that makes sense uh whatever podcast app you're using right now um after you finish this episode go and subscribe to tabs and spaces this is a new podcast but it's super lit uh if you're a member

Starting point is 02:11:22 of the slack then you've seen uh many of the the characters around we got um it's like an all-star cast it's basically like uh i want to say the coding blocks of the uk but we got zach braddy uh james cynical developer uh and the progman uh.net core and waffling tailors uh it's an excellent show they only got two episodes out but i just tell us a bit amazing it's going to be amazing and i see their first episode was 59 minutes so i feel like you know there was a decision there to keep it under an hour and the second episode is 75 minutes so i can see that they're on a very similar trajectory to us and they'll be at three hour episodes in no time and i'm looking very much forward to listening to it and it's great it's conversational style

Starting point is 02:12:01 so i i'm i'm to wager that, uh, I'll wager a few euros or whatever pounds. Uh, if you like this podcast, you're probably going to like this tune. You should check it out. Hey, so I haven't listened to it yet. I'm absolutely going to, but I met Zach Braddy and I met Jamie Taylor. Unfortunately, James couldn't make it. But if this, if this podcast is half as entertaining and half as fun as the conversation and in time that we had talking while i was over there it's got to be amazing so yeah you're gonna be cracking up it's great that's excellent uh i'm absolutely looking forward to it and they've even got a sweet looking logo so that yeah the site's really good some jerks hey wait the site's really good. Some jerks. Hey, wait.

Starting point is 02:12:45 Our site's really good. We can use a little touch-up. There's this jam stacking. I ain't nobody got time for that. All right. Well, we hope that you've enjoyed this episode as we've dug into how to write our databases to disk and retrieve that data. And this is just the start of this awesome conversation. Next up, we're going to be talking about sorted string tables and log structured merge trees and even B trees.

Starting point is 02:13:22 That's when you get like a lot of B bees around you. So be careful of those. Yeah. So if you are happen to be listening to us because a friend pointed you out to us on the website or they're, you know, letting you use their device, you can find us on all your favorite podcast platforms. So be sure to subscribe to us there if you haven't already subscribed. So iTunes, Spotify, Stitcher, whatever your favorite podcast destination might be. And if you haven't already left us a review, we would greatly appreciate it. Obviously, you're going to get your name butchered by me. So, you know, I can only say that, you know, you're welcome for that. You can find some helpful links there to leave those reviews at www.codingblocks.net slash review, as I remember how the internet works.

Starting point is 02:14:15 See, you can't mess with the trade. Yeah, see. That's why you can't say it right. Yeah, I was going to make it a Twitter thing or something. That's right. So while you're up there at codingblocks.net, check out our fantastic show notes, discussion examples, and more. Load your feedback, questions, and rants

Starting point is 02:14:29 up into a big bag and come and just drop them in a slack. Boom. By going to codingblocks.net slash slack and sending yourself an invite. Make sure you follow on Twitter to at codingblocks or head over to codingblocks.net and you'll find all our social links at the top of the page boom

Starting point is 02:14:46 dude 33% I wasn't that far off hey you can lose and still be a winner. Dude. Oh, man. That's amazing. Bernie taught me that.

Starting point is 02:15:17 We got a show to do, guys. There's two. There's two answers. I was only off by seven. But it was the majority. Oh, man. That's so funny. All right. I think I'm good. I can't hazmat.

CODACE Plant Stand

Coding Blocks - Designing Data-Intensive Applications – Storage and Retrieval

In this episode, Allen is back, Joe knows his maff, and Michael brings the jokes, all that and more as we discuss the internals of how databases store and retrieve the data we save as we continue our ...deep dive into Designing Data-Intensive Applications.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Coding Blocks - Designing Data-Intensive Applications – Storage and Retrieval

In this episode, Allen is back, Joe knows his maff, and Michael brings the jokes, all that and more as we discuss the internals of how databases store and retrieve the data we save as we continue our ...deep dive into Designing Data-Intensive Applications.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.