Coding Blocks - Nuts and Bolts of Apache Kafka

Starting point is 00:00:00 you're listening to coding blocks episode 236 that's right we're talking stuff oh about that almost a little bit now uh yeah whatever it's a healthy number uh and uh hey uh if you want to hear more you can check us out on itunes spotify more listening using your favorite podcast app and hey leave a review if you can yeah that's amazing all right hey you can also follow us on that x at cody blocks um where we are super active all the time so with that i'm alan underwood i'm josek michael outlaw oh hey and also somebody recently said that it actually helped them out they didn't know we had a Slack channel.

Starting point is 00:00:45 How about that? So, yeah, let's do it up front here. Go join our Slack. It's CodyBlocks.net slash Slack. Easy to get signed up and truly some awesome people in there. And you've probably heard us refer to them many times throughout these shows. Maybe the most meaningful thing we've done in our lives. Probably. Probably, probably brought, I don't know how many it is now, but at one point it was over

Starting point is 00:01:10 7,000. So, you know, there's a lot of people in those channels. Hey, so on this episode before, you know, we'll get into it here in a minute, but we're continuing on the Apache Kafka sort of intro to it, just getting terminology. And just so you know, all the pieces involved, we're going to pick that back up. But before we do that,

Starting point is 00:01:30 we need to outlaw to practice his proper noun pronunciations. Here we go. So, uh, from iTunes, thank you very much. Anging jellies. And I was getting some questionable looks.

Starting point is 00:01:46 Okay. And from Spotify, Nick Brooker. Not terrible. Not terrible. Not terrible. First try. Nice. All right.

Starting point is 00:01:57 Wow. That was, can you, I don't know, make it a little more fun, though? I mean, come on. Right? I thought it was fun when you get it right. Let's hear it. What should I have said instead? The Angine Jellies one definitely threw me.

Starting point is 00:02:14 Yeah, but you did it right. I didn't know I was expecting it. I was hoping for maybe some Angelinas or something. I don't know. Something totally wrong. Yeah, I mean, this is why you do it. I got nothing. angelina's or something i don't know it's something totally wrong yeah i mean i got this is why you do it i got nothing yeah i got nothing hey i don't don't think i like can't come up with that stuff on the fly like or you know like i i rehearsed it like normally when i

Starting point is 00:02:34 try to say these names and i mess them up that's happening in real time like that's that's not like something i put together in scripted thought like oh this will be funny you know i get asked this a lot but we actually we don't pre-plan our mistakes uh they just happen naturally yeah i just legitimately can't say proper nouns we're full of mistakes hey also i we didn't put these in the notes but we got a couple of you know really nice replies on the last episode in the comment section on episode two one to 35 uh so yeah so this is one that i don't even think i could pronounce i don't know if it's you ilian he said you know thank you and he loved your picture that you did of the inverted frog and the scorpion joe which was awesome and then tanya g said that she's been listening for more than 10 years

Starting point is 00:03:31 and to keep it up so that's awesome and then we also got an email that i thought was pretty funny i think jay-z you found it yeah when i was asking about how we get our uh mp3 uh our audio files so small they're uh frequently much smaller than other shows, like 90% smaller. And we actually do have a pretty long process, which we used to have on the website, I think. Did we? I don't think so. I don't think we've ever documented it. I've got the notes I use to make it, so we can drop our secrets in there.

Starting point is 00:04:02 Yeah, but I mean, I want to say usually our files are smaller than 60 megabytes for over two hours of content, which is definitely smaller than what a lot of people put out. I mean, I think it's probably just the bitrate that's the most important factor. Bitrate and MP3, not using some sort of lossless type thing because, I mean, we're not trying to keep you know dynamics between instruments and all that so i assumed mp3 was the standard so yeah that's why i said just the the bitrate but yeah you're fair enough if you're gonna like put it out in a different format then maybe you know it's gonna be a larger file yeah sure. And then the one last bit of news here is Atlanta DevCon is coming in about three months.

Starting point is 00:04:49 So September 7th, that's atldevcon.com. So if you're in the area, definitely go check that out. It's always super inexpensive and a day full of good stuff. I wonder if they've announced the topics yet. You know, I wonder if they're going to call for speakers. Maybe they're still up you know i wonder if they're gonna call for speakers maybe there's still something in i think they're still calling for speakers well sponsors yeah i'm not seeing the topics yet so yeah forget i brought it up all right it's dead to us all right so picking up where we left off from last time we're going on Kafka again and this time let's talk about Kafka topics and what they actually mean so Kafka topics are

Starting point is 00:05:33 are where the data is written to and I think we might have left off with that last time so it's worth just reiterating this these are like the folders on on the brokers where the data is written and they are partitioned that means that they are distributed or can be if you've only got one broker obviously it's not distributed right everything's going on to one but let's say that you have a three broker setup or whatever these will be written across multiple brokers into what they call buckets, more or less. And any type of new events that are written to Kafka are appended to those partitions. So I find that interesting because I know the three of us have looked at the folders on the

Starting point is 00:06:18 brokers and it says it's appended to them. As far as I can tell, it just writes new files, doesn't it? I think it actually adds to the file. So like um i don't know exactly when it breaks it up but the it will have multiple files in there so you don't have like a hundred thousand files if you have a hundred thousand messages but it does some sort of like logical breaking up into segments i think it calls them and each one of those is a file and then it kind of keeps buffers open to them and just kind of keeps adding to it um yeah tough is awesome that well the thing i was going to add to that though too is that uh you one of the concerns that you have to be aware of in a kafka configuration and this is getting kind of in the weeds but relates to your question is the number of open

Starting point is 00:07:00 file handles that you're going to have so for every topic that you that you're going to have. So for every topic that you have, you're going to have at least two file handles open per broker. And if you did what you suggested by having like a bunch of additional files, then that you would make that problem worse. Right? Right. But so that's why I say like, depending on, it's a concern, depending on like how many topics you're going to have on your on your broker the partitions and whatnot oh i should have said not for every topic i should have said for every partition you're going to have the two um oh yeah good point and then it's got to be writing reading and writing to reading from and writing to those partitions simultaneously yeah

Starting point is 00:07:41 and i forget what the two one of them is the actual data and i forgot what the other one is now off the top of my head i didn't realize there was gonna be a test but um yeah so like you know if you were gonna have like uh i think there's been some crazy ones like um there was the was it the trillion dollar cloud flare Kafka set up or something like that. And then Uber is big one, like depending on like how much usage is your Kafka environment is going to go through. You have to be aware of that and you,

Starting point is 00:08:12 you can update that if that is a problem for you in like a Linux environment, you can change the number of open file descriptors that you can have, can be open at one time, but you have to be aware that there is a default max limit in various linux flavors and then you also have to be aware of the overhead that each of those would would take up if you're getting into these extreme situations you know right if you're just getting started because you're listening to this and trying things out, you're not hitting those limits yet. Yeah, and I think we've talked about how difficult

Starting point is 00:08:48 it can be to add partitions later. It's not the end of the world, but it requires some planning and some doing. So I think when we first got started, we were like, well, let's just give everything a thousand partitions. That'll be fine. But because of the overhead of those file handles and stuff, you run into limits and realize that there's

Starting point is 00:09:03 some guidelines there for a reason. Sizing cluster is a kind of kind of a big deal yeah i'm not i'm looking ahead in the show notes because i'm wondering if that's going to be apparent as to why so we'll see how far we get as to like why that would be a problem to add or remove partitions i don't think it's in this one because this is more just about the different pieces, not necessarily didn't dive deep on any of them. So it's important to know about these partitions though, because the way that this data is written and the fact that it is partitioned across multiple brokers is what allows Kafka to scale, right? So we've talked about the Microsoft one a long time ago, their Kafka broker setup. I forget if they had hundreds of brokers or whatever, but that is why it can do it is

Starting point is 00:09:55 because they can break up the data like this and allow multiple readers and writers. So here's an important thing to note. And again, I don't think that this gets into the details here, but we can sort of ad hoc this. When you write events to Kafka, any event that uses the same keying, so let's say that you're keying by your customer identifier and some other piece of data, that's going to be your key. All that data gets written to the same broker, right? To the same area, the same spot on disk. And that's important to know from something that we've all sort of talked and called it hotspotting. There's probably other, there's other terms for it, but I'll let you guys fill in the blanks there. Well, it's that keying that I was referring to is like, that's, that's what makes the adding or removing the partitions, uh, you know, tedious because that, cause you're, you're going to define like a keying structure so that each message comes in, it's going to deterministically always route

Starting point is 00:11:03 to a specific broker and partition and then that's where it can be that that's how the system knows when you go to do a read exactly where to go to get it because it knew exactly where it went to write it but that's why like when you go to change then those numbers then it's like okay well you have to rekey things yeah it's it's super important because uh super important because if this broker gets the event or that broker gets the event, they both put them in the same partition so that when you read a partition,

Starting point is 00:11:34 it doesn't matter which broker you read from, you always get the same results. So to enable that consistency, it has to have a deterministic and repeatable, basically, algorithm for selecting that partition i think the default is uh some sort of round robin where i'll try to kind of randomly uh go in order based on uh check some uh based on the key and if you don't give it a key uh you can pass null for a key then it basically just check some actual payload and figure it out from there so it's still

Starting point is 00:12:02 consistent based on the message and that's super important to understand because that's basically how all of the scaling you know is going to be focused on that concept right of of being able to deterministically always know where you wrote it and where you would read it based on some kind of key yeah and so when you go ahead i was gonna say well so if you're going from 10 partitions you know and you've been doing round robin for a while now you add an 11th you can't just start with that you have to rebalance cross all of them and you have to do that all in the same brokers and the same time they have to coordinate to make sure they're all like okay we got the same amount now now the messages and you

Starting point is 00:12:44 want to do that with no downtime too and so there's just some tricky stuff that needs to happen and it ends up being a couple scripts it's not that bad right we've done it before on really large topics and kakura does a great job of like still accepting new data and then knowing how to magically kind of pivot over and we've talked about strategies on how you could do that um before like snapshot isolation stuff where it'll kind of copy data and then catch up and keep doing that until it's done sorry i didn't mean to cut you off no this is where that um keying though is important like when you when you decide what the key is going to be for the messages that you write into your topic the hot spotting that you referred to alan would be like if you chose poorly on that. So for example,

Starting point is 00:13:26 there's the three of us. And if we were writing to a Kafka topic and we decided that the key was going to be the user, well, depending on your circumstances, maybe that's a fine strategy. But if Alan is way chattier than the rest of us, and let's say he's putting out a trillion messages an hour, then that, that Kafka system is going to be writing all of that traffic to that single, uh, you know, or to a deterministic set of like, uh, you know, broker. So it's going to, uh, potentially hotspot to one specific, uh, topic partition, which could be, you know, onto a single broker. And therefore, you're not really taking advantage of the, you're not parallelizing the rights like you wanted to do. So that's why you

Starting point is 00:14:15 have to, you need to consider how you want to key your data. So a couple of things to add on to that. So I was actually thinking a very similar example to what you were saying. So in that case, what Outlaw was getting at is, let's say that you had three brokers, right? And each one of us was going to a specific broker. I'm the chattiest with the trillion messages an hour. My broker's busy, right? And then Outlaw's doing 10 messages an hour and Jay-Z's doing 50 an hour. Their two brokers aren't doing anything. So mine's probably going to sort of bottleneck a little bit while theirs aren't even really doing anything at all, right? And so that keying strategy matters. And what Jay-Z said is, you know, I think by default it does round robin. With Kafka and your producers, you can actually specify a key strategy, right? Like, so you can sort of tell it

Starting point is 00:15:08 how you want it to place data so that if you know you have big customers or something like that, to route things, split them up in different ways. So there are ways to get around that. And I should say round robin is not the right term for it. It's some sort of deterministic thing that attempts to basically balance it out. Right, right.

Starting point is 00:15:28 Based on a check sum. So it's not like just saying, okay, I did this one last time. Let me do the one next time. But the result is similar to that. Yep. And the other thing I wanted to call out, because it's not obvious to anybody that doesn't live in this world. When they said that if you go to change the number of partitions on something, they mentioned that you have to rekey that. Well, rekeying that the important part here is you're actually moving the data on disks, right? So if you add a number, if you add more

Starting point is 00:15:58 partitions, or if you reduce the number of partitions, you're actually having to rewrite that data to new locations. And that's why it gets tricky because to balance it out, it's rekeying and putting that stuff into new buckets or whatever. So it's all, it all sounds complicated. They do a pretty good job of allowing you to do this stuff behind the scenes, but it is, it is a little bit frustrating when you're like, oh, well, I'm just going to plan for this this and then you realize that your plan wasn't good and then you have to go back and fix things up yeah you know i just uh and looking that up i also realized i was wrong about how null keys are handed and they are handled ram round robin so there's no guarantee if you don't pass a key that it's

Starting point is 00:16:38 going to end up on the same partitions i don't know how that works with like multiple brokers or um that's interesting i guess there's got gotta be some sort of routing going on. I'll have to dig into that a little bit offline, but just wanted to get in before the comments. All right. So here's Kafka guarantees reads of events within a partition are always read in the order that they are written. So that's really important to note.

Starting point is 00:17:04 That's one of the beauties and the simplicities of Kafka is it's just a queue. So when the rights come in, they're going to go into that topic in the order they were received. Now, I think there are some newer ish features of Kafka that allow you to, to time align things. And i think it will try and place it in order for you but if in just a regular setup the order it comes in is the order that you're going to get it out on a read um fault tolerance and high availability topics can be replicated even across regions and data centers now i have a note on this next line, and this is actually going to go to one of the tips of the week that I have in a little bit. But if you are doing your own Kafka

Starting point is 00:17:52 cluster, right, whether it's on bare metal or on VMs or whatever, you probably don't need to worry about this next thing I'm going to say. However, if you are running Kafka clusters in the cloud and you're going across regions or different availability zones or something like that, there's usually a cost to move data between regions, even within the same cloud provider, right? So if you're an AWS, GCP, Azure, whatever, if you're moving from US East to US West or something like that, and data is being replicated across brokers, you're paying a data transfer fee for the data that gets written there. So be aware if you do go to do this, that you need to look into your cloud costs for every bit of things, right? It's not just data storage. It's network transfer, CPU, all that comes into play. It can get really expensive and you don't even realize it, right?

Starting point is 00:18:54 You get the bill and you're like, whoa, why is this double what I thought? And then you realize that you're getting hit with all these data transfer fees. Actually, they call it ingress and egress, right? That's what you're getting charged for. All right. So next up here, the typical, typical Kafka setup, a simple one is you're going to have at least a replication factor of three, you know, for high availability and high throughput. And that's for production type workloads. You're probably not doing it in development,

Starting point is 00:19:26 even though it probably would be a bad idea to have it. And again, replication is performed at the partition level of a topic. So I think we should describe what that means then, right? Because I don't know, we've gotten into like replication as it relates to Kafka. So when you write a message to Kafka,

Starting point is 00:19:44 to a Kafka topic, you can specify different strategies for like, do you just write it and forget about it? Or do you need to wait for at least one broker to respond that it has successfully acknowledged the right and committed it? Or do you need to write to wait for all brokers that are part of the replication strategy to respond that they have successfully committed that that that message to disk? And if you're going to indicate like you mentioned, replication factor three. So what that means is if you have a single topic and that topic is going to be split into three partitions, I'm sorry, let me choose a different number. If you have a single topic and that topic is going to be split into 10 partitions, then each single partition is going to be written across

Starting point is 00:20:37 the number of brokers, which in the case of a replication factor three, that's going to be at least three brokers that that's going to be written across. So really, from a sizing perspective, you are three times sizing that topic from a storage perspective, right? But what that also means to consider is depending on the size of your cluster, if you only have three brokers, then that means that for every right to Kafka, all of your cluster has to be involved, which you might not want, depending on your needs. You probably don't want that. You probably want other brokers available to handle rights to other partitions and whatnot. So something to consider in your Kafka sizing and configuration strategies. Yeah. So if you didn't totally follow what he said, you might have five brokers,

Starting point is 00:21:36 but a replication factor of three, so that you have more brokers than what you have replication. So that each one's handling, you know, different things. And so, so you have more available than, than what's just what you're doing your replication on. I will say that from my own thinking, I don't know.

Starting point is 00:21:58 I'm curious if you guys have, you know, can convince me otherwise, but I, I can't think of a reason why you would want the replication factor to equal the size of your cluster no it seems like a bad idea oh yeah yeah you wouldn't be able to take down one of the nodes or yeah one of the brokers right yeah agreed all right i'm gonna let somebody else pick up this last one i I just did this full section here. All right, we're talking about APIs. I was reading them up on backup strategies.

Starting point is 00:22:30 How do you backup your Kafka cluster? I've been looking at the wrong show notes this whole time. No wonder. Well, the reason I mentioned it, this is something we've talked about before, is when you've got a database, this is not a database, but basically a data store, and it's got replication built in and all sorts of stuff built in around the kind of redundancy and availability,

Starting point is 00:22:56 what do you do for backups? I know it's really common in our DBMS world to have a backup strategy where you do a full, an incremental, whatever. But when it comes to other data stores that are kind of similar bms worlds to have like a backup strategy where you do a full and incremental whatever but when it comes to like other data stores that are kind of similar and have this stuff built in then you have to decide whether you want to do that and what the what mechanisms you want to use because there's not like a simple backup command there's no uh you know incremental that you can do so you can do interesting file levels type stuff like rsync i was just looking at

Starting point is 00:23:24 different strategies that people do in order to get around that. Um, don't people use mirror maker for Kafka? That is definitely a strategy. Yeah. That's probably the most common one. But when you look at it,

Starting point is 00:23:36 it's, it's almost like just acting like another broker kind of forwards the stuff off. And it's very like natural to the way Kafka operates in general, but it's just kind of interesting. the problem with mirror maker though is that would require you to have another cluster so yes if you if that works for your cost needs and your configuration needs and whatnot then that's great but if all you want is just a backup of the data and not necessarily have to have a running second cluster then you know mirror maker isn't going to be for you yeah yeah i was thinking it was particularly interesting

Starting point is 00:24:12 because like rdbms you know like back in the day like a sql server or postgres something like that i think they had the notion that was okay for databases to go down for a while and if you had to back up to disk and then restore from disk like oh no that stinks we'll have to come in late tonight to do it and it is i think that kafka in particular is uh such a different world like when you're dealing with like near real time tons of events like you probably don't want to go down you don't ever want it to go down uh and so you know it's much more common to have like a separate cluster running in another cloudy environment that you just kind of back up to and so i was kind of wondering if uh if relational databases like if a new relational database came up today

Starting point is 00:24:49 came out today would it still deal with backups in the same way and i think you're solving different problems though right like i don't know that you're going to use that relational database for the same kind of real-time solutions that you're going to use something like a kafka for right oh yeah for like i mean we talked about in the last episode, um, Uber as an example of a Kafka cluster. And I think we made the point where we were saying that like the users of Uber are the consumers when you're like, we were just theorizing on like how Uber might use it. So, well, you know, when you're, when you view like where the other drivers are, you're consuming like where that information is, you're not producing it versus the drivers might actually be producing data as like, Hey, here I am. So you could see

Starting point is 00:25:30 where like, if suddenly you took that, that data system down, you know, now the users don't know, well, I don't know which driver is closest to me. I don't even know that it doesn't even look like there's a driver anywhere near me. I guess I'll go to lift. So it would have a huge financial impact to, uh, in the use case of like an Uber where if that system was no longer available because you're doing like routine maintenance or, you know, something worse.

Starting point is 00:25:57 Um, and, and, and there might be like other, you know, more life critical type situations too, that I'm not thinking about and yeah and also on the sql server thing right like so you could take it down but they do have

Starting point is 00:26:13 online backup modes right so differential backups and that kind of stuff so databases have come a long ways but i do think you know that outlaw had had said the trillion dollar uh i meant to say trillion message yeah yeah but like in that case are is it even feasible to try and quote unquote back that up as opposed to having another cluster running that's just a failover type thing in a database type world right yeah that's where like failover type thing in a database type world, right? Yeah, that's where your mileage is going to vary. You need to be hyper aware of what your use cases are and what your requirements are. So if you can afford to take that hit of downtime of, oh, hey, my Kafka cluster went down, no worries.

Starting point is 00:27:02 Maybe you're running in the cloud and that region went down, no worries. You're like, maybe that region, maybe you're running in like the cloud and that region went down. You're like, okay, well, we're just going to, you know, spin up our, we're going to helm install all of our stuff over into this other region and spin that up. And I'll just, uh, backup, I'll just restore the data into those topics based off the last, you know, however I snapshotted it. If that, if that works for you, then that might be a more cost effective, uh, you know, disaster recovery type strategy for your needs. But like in the case that Alan said, you know, if you're getting a trillion messages and maybe like you're an Uber, right. Or whatever. And you can't afford that downtime downtime then you want a situation where you can just say you

Starting point is 00:27:46 know i'm flipping over to this cluster that's now the live cluster on the west coast and we'll repair the east coast in the meantime and you know or whatever you know man i can't wait to get to one of my tips of the week because it would actually answer one of these things i need to do it early you know i i can't do it early we need people to stick around to the end of the show. I'm going to tease it right now. Yeah, you got to stay to the end because there is actually a pretty good solution to this that I think would sort of blow all three of our minds. It did mine when I first read it. So since we're like off topic just for a minute then, I was sitting here looking at the previous show notes. I'm like, man, nothing is lining up.

Starting point is 00:28:24 I don't know where we are because everything's already been crossed out. But then I was sitting here looking at the previous show notes. I'm like, man, nothing is lining up. And I don't know where we are. Everything's already been crossed out. Like we've already talked about that. And I'm like, I'm just going to wing it. You know, I don't. This is weird. I went back to it while we were talking because I was looking for something from the last episode, and that's what made me open it. And then I didn't realize that I was still on the wrong one.

Starting point is 00:28:51 That's so funny, man. We've only been doing this for, what, 10-plus years? Yeah, but it is a Saturday morning, so all of us are— Yeah, I'm missing cartoon time. Yeah, right? We're done. One more thing on the backup thing sorry i keep bringing this up but i was thinking about um relational databases and other kind of more traditional or just period databases like i keep implying the kafka is a database it's not a

Starting point is 00:29:15 database it doesn't function well as one for a variety of reasons we'll probably get into at some point um but oh sorry go ahead um but uh another reason is just different use cases so like if i've got a relational database there are a lot of times when i want to take a backup of that database and give it to somebody so here's a database of all our tax information all our products whatever take it and run it in a different environment uh kafka is yeah basically a big queue right so it's meant to be like this more kind of organic living ecosystem. And the data that's in there isn't necessarily important, like historically, like you wouldn't give someone like, here are my queues for, you know, this last five minutes, you know, here you

Starting point is 00:29:56 go. It just doesn't really make sense in the same way. You know, there are different use cases and there's compaction, you keep data around around but just not really designed for that well that's what i was going to say like what you said is mostly true except for the case of compaction where it is sort of always giving you the latest state of of something right which yeah might be useful and that depends on your compaction strategy that too right yeah i mean it's all which is another thing that you have to take into account in regards to sizing how much disk is going to be required for a given topic partition. So it's very simple and it's not. Yeah.

Starting point is 00:30:32 And what's kind of funny is like you keep wanting to compare it. I keep wanting to compare it to other things. Like I keep wanting to say, well, it's kind of like a queue. And in some ways it's like a database. And what ends up happening is like it's it's kind of become its own beast at this point so like at this at this point if you go looking for like cloud native kafka solutions you'll find aws kinesis and you'll find gcp pubso and whatever you know azure calls it and so these things that like mid panda is another that meets the protocol

Starting point is 00:31:00 kafka and in a lot of ways has kind of become its its own kind of definition of a particular data store because it is so configurable it can be used in so many different ways and so i think that's part of why we say it's a little bit like this and a little bit like that and a little bit like this but when you boil it all down it's it's it's its own beast well i think like to your your comment a minute ago about like it being so confusing, but it's not like, I think it, once you get the handle of like, here are the levers I need to, to concern myself with in sizing this,

Starting point is 00:31:32 then it's not so bad in the beginning when you're coming out at fresh, then it, it can be. But I think for me, the biggest, um, difference was there might be database administrators out there for like an Oracle or SQL or Postgres or whatever that thought about this. But imagine if like on every table of your

Starting point is 00:31:54 database, you concerned yourself with, okay, how much space is each column going to need? And now how many columns is that going to have? How many rows do I expect in that table? Okay. That's the size of that table. Now let me move on to the next table. And you're going to rinse and repeat for every table in your database so that you can get an estimate of what size your, your database is going to be. But I haven't, I don't know of anyone that's ever done that on, on a database. Typically it's just like kind of estimated, you know, based on some, you know, trial and error type of usage of it. And you move on about your day. And the difference here, though, with Kafka is that in a SQL world, you can't really like say like, oh, well, that entire row, I need to be concerned with that entire row and how much size it is on average, because, you know, that entire row might not be getting returned back in a query that often. And it might not even be written as a whole unit

Starting point is 00:32:53 in any one time either. And so you're only concerned about like the query performance, because that's the thing that's going to return back the amount of data and like how much data does it have to like search over and scan over versus in a Kafka world. When you are reading and writing from those topics, you're saying, here's the full thing. This, you know, we call it a message in a Kafka world, but I'm trying to like relate this to the to the row example in a table. So like here's the entire row. Boom. And now when I do the read, I'm getting the entire thing back. And so that's why from a Kafka perspective, your concern is based on, well, how much IO can I get

Starting point is 00:33:30 out of that system, both in terms of like disk IO and network IO, like how fast can I read and write from it, period. That's the concern. And these levers about like the partition strategy and the replication factor and the, you know, what kind of compaction am I going to do? What kind of retention am I going to do? That's why these, these types of things can matter. Cause that's ultimately what's going to decide how much is on disc. And, you know, obviously you need some overhead on that disc to be able to write more records, et cetera. So, but I was thinking too, like you made a comment a moment ago and I don't know if you guys had this experience, you know, going back years, you could never pull this off

Starting point is 00:34:11 today. This would not fly today. Today, I think that we as a society have like matured to a level of, you know, when it comes to security or personally identifiable information, you know, there's now like GDPR rules and various GDPR like rules in various countries. But I remember, you know, first starting out as a developer, you know, I worked in a services organization and it was quite common for us that like any customer that we had, they would be like, Hey, we want you to build us a new website or whatever, you know, a new, a new widget and whatever we were building. It was quite common that they would give us a dump of their data. Like, and sometimes I'd be like, here's, here's a copy of our database.

Starting point is 00:35:01 Yeah. And, and then that's what we would use as the basis for future work. Even if that work meant, um, Hey, you're on Oracle and we're going to move you to SQL server or something like that, or vice versa. Right?

Starting point is 00:35:13 Like we need to know, we need to have, we need to know like how to convert from one to the other or just like where all the data is. Like that was pretty common. That would never fly today. No, you couldn't do that.

Starting point is 00:35:26 No. Yep. I remember, remember uh one place we used to everyone would just restore a production copy of the database to their local to on the database and back then we had credit card numbers too like unencrypted this is how long it was how long ago it was which is just crazy to me now we all just had this information it was like yeah this is like three years ago you know yeah i worked i worked in another shop where we had this this thing called the small backup is what we called it and the dba had this process that would run daily that would uh take a backup of the production database but it would limit like how many rows were in certain tables and some tables it wouldn't bother at all, and that's why it was called the small backup.

Starting point is 00:36:07 And that backup was put on a file share that every developer, when they got in in the morning, the first thing they'd do is grab the latest small backup, and that's what we would dev against. Man. Yeah, times have changed. Yeah. A little. A little.

Starting point is 00:36:23 So pessimistic. That's so uncharacteristic of you. The boomer hour. Right? Yeah. A little. A little. So pessimistic. That's so uncharacteristic of you. The boomer hour. Right. All right. So I guess we can talk about Kafka APIs. Now we can resume the show notes back on line three where we left off. Wait, on last show notes?

Starting point is 00:36:39 Okay. Right. Which show notes are we talking about? This particular episode show notes. Okay. Hold on. Give me a sec. That makes more sense.

Starting point is 00:36:46 All right. It looks like next up is Kafka APIs. And there are a couple of different sets of APIs here that Kafka provides. We've got an admin API, which is used for managing things and expecting topics, seeing how much data is in them, how many partitions you have, brokers have uh information on which partitions things like that other kafka objects and you know admin type stuff uh you got the producer api which is meant for uh applications to basically write events to kafka topics and there are producer apis uh they're probably apis for all these in just about every language you can imagine now which wasn't always the case.

Starting point is 00:37:26 But like JavaScript, C Sharp, all those, you're going to find producer APIs. And also same with the next one, which is consumer APIs, which is how you read. And both of these have a lot of configurations. You can do that kind of to pick things up. So it's like you would think, you know, naively, if you've not been familiar with this, you might think like producer API, we got to produce and that's it. Like what else is there? But there's a lot of configurations, a lot of interesting things you can do with those. And there's nothing to stop an application from being both a consumer and a producer or even being multiple producers and multiple consumers and mixing things up in interesting ways.

Starting point is 00:38:02 I mean, example of that. I'm sorry. Yeah. Well, I was going to say also, go ahead. Well, I was going to add that like an example of the producer API, like you mentioned, like you think they would just be writing to it. But like that decision of do I need to wait for every broker to commit to writing whatever that replication factor is that I need to, you know,

Starting point is 00:38:22 once something is written, do I need to wait for each of those brokers to respond that they committed it before I move something is written do i need to wait for each of those brokers to respond that they committed it before i move on or do i need to retry that's an example of what that producer api is providing for you oh yeah and schema and yeah all sorts of stuff i had i had an additional one there as well do you write out every time you get an event do you automatically write it to the as the producer do you write it to the broker or do you write out every time you get an event, do you automatically write it to the, as the producer, do you write it to the broker or do you wait some period of time to send it in

Starting point is 00:38:50 a batch? Or do you wait for some amount of bites to be met? So there's, there's all kinds of little tweaks that can have massive, massive impacts on your performance that are left up to the producer to decide how they want to do it. Yeah. It's really tricky tricky and a lot of the uh this is one of the things really be like 80 configs and like if you change this one this other one doesn't apply and so you just got to do some reading there and uh

Starting point is 00:39:15 have some fun with it do some testing totally do some testing yep and then um kafka stream streams uh i did some of this i think i think you did a lot more of it than me, Alan. I don't know if you want to talk about it. Yeah, so the Kafka Streams API, this one's pretty interesting. Again, we're talking about Kafka as a platform, which is why they bring this in. But this is what allows you to take in these data events and turn them into streaming applications, which are basically microservices. And some of the key functionality that they have there are data transformation, stateful operations like aggregations, joins, windowing, like those all get fairly complex. But the important part to note here is Kafka streams is built into the Kafka

Starting point is 00:39:59 ecosystem. So if you have Kafka, you also have the ability to write these streaming applications without actually bringing in other frameworks, other platforms, that kind of stuff. I will say we did spend quite a bit of time doing this. And I think what they did here is they borrowed from some of the other big streaming platforms to sort of figure out how they wanted to do things. And some of it turns out that it just doesn't work well with, with some approaches that you need to do in streaming applications. So if you have like,

Starting point is 00:40:40 if you have really stateful streaming applications, meaning that like you're trying to, let's say every event message that comes in for a customer, you're counting it or something. You're keeping like a running tally of like how much money they've spent. Exactly. Yeah, that's a great example. When you're doing things like that, that's a stateful operation that you need to keep in those streaming applications. And the way that Kafka streams likes to do that is it writes that data back to another Kafka topic. So Kafka streams API, the way that it's

Starting point is 00:41:11 implemented, it's very much as you do things, write it to a new topic. So it's very chatty back and forth to the brokers and it keeps very little state in the applications themselves. So that's a little bit deeper dive into Kafka streams, but it does allow you to do things, you know, basically reactively as events come in, you can, you could do data manipulation with them. And, and when I say joining, it's pretty crazy, the kind of stuff that you could do joining data that's coming in, you know, thousands and thousands of messages a second and join in doing, you know, data transformation. So pretty cool way to do real time processing. You know, I think a good example there is fraud detection. Like you imagine that you've got a topic that takes in data about credit card transactions. So it gets

Starting point is 00:41:57 that in, maybe it has a credit card number, you know, some sort of a token or something, and it looks up the credit card for it. It looks, maybe it runs some sort of like check to see if that card has been used in fraud or if that is a high quality card or if the card is valid. You know, it runs these different checks and each one of those can end up being like a node in a distributed acyclic graph.

Starting point is 00:42:18 And it's writing back to Kafka topics that it creates in the backgrounds and kind of manages that sort of thing. And so, you know, at the far end, that pipeline, you can have things to say, okay, well now I know the person that did it. I know the card, I know the dollar amount. Let me make a decision on whether or not, uh, you know, we're going to allow this or not allow this transaction to go through. And, um, it does a lot of work for you, which is really nice, but you can imagine how how if you get bigger and bigger and bigger, it's making these decisions for you.

Starting point is 00:42:47 And it's kind of it's it's tuned for kind of smaller use cases. And so if you get to be like Amazon scale or something, you're probably going to need more control over how this caching works and how these lookups happen in order to be efficient, not crush other data streams and not run out of this space or whatever and that's where something like a flink or like an airflow apache airflow would come in and be able to give you a little bit more control but it's not going to be as native to kafka and so i think um kafka streams uh gets overlooked a lot of times because it is kind of like um in this small to medium space and then there's these really big players that are really well known that kind of get more attention. Yeah. And for what it's worth,

Starting point is 00:43:28 the primary reason we looked into Kafka streams was because it was part of the platform, right? Like there's, there's a lot of people that they look at something and they're like, wait a second. So you want to bring in 12 new technologies. We're going to have to maintain all these.

Starting point is 00:43:42 Yeah. And if you get it, you know know quote unquote for free with kafka as a platform then why not use it and that's kind of the route we took is hey all right so you're just using the same things but it turned out that it was not a great fit for some of the type stuff that we were trying to do there's a great use case though it's pretty cool to be able to say like uh take in this credit card join join it to this person, and then filter out these people and then pump it to this output. And that's kind of how the code reads, you know, which is really nice.

Starting point is 00:44:12 But yeah, it's just those details will get you. Yeah. See, Kafka Connect API, one that I have mixed feelings about, but mostly love for the most part. It's really cool. It allows for the reusable, important export. They call them connectors that you can use to either sync or I forget what the other word is, source data or sync data to external systems. That's what it's built for. But it's also just kind of a distributed task runner.

Starting point is 00:44:40 So you can build your own or people have done interesting things that can operate on it that kind of do other things but for the most part it's it's primarily been designed and tuned for getting data out of one spot and putting it into another and it's got built-in stuff probably doing message transform so if you want to um maybe like elevate or change the format of like a json message or something and like extract this little bit out and make this top level uh lowercase these items uh things like that it's also an appropriate space for like configurations for that import and export so if you want to say like uh you know use this replica here's the database credentials it's got a nice mechanism for hiding that stuff from the config so you can kind of say like here is the file where the password is or whatever so it's not exposed uh somewhere it's got a whole api around it so you can have rest calls for

Starting point is 00:45:29 example to like pause it or start it or restart it or re-snapshot for example if you're doing like a database backup or something with it all right so real quick some of the one of the use cases for this and the reason why it's so important is like he said, it allows you to read from sources and write the sinks. So he mentioned the use case earlier of like fraud detection with credit cards, right? If you have, excuse me, if you have a streaming application and you want a credit card to be joined to a user, well, in order to do that in a Kafka streams world,

Starting point is 00:46:05 or even a Flink or something like that, you're going to need that data readily available. And typically the way you're going to do that is you're going to sync those users to Kafka so that you have them right there because it's super fast. Well, to do that, you might have one of these connectors set up to read from your Oracle database or your Postgres or your SQL server or whatever. And anytime something happens in that database, like on that user's table, it'll automatically be synced over to Kafka, right? So that's using the source type thing and writing to it so that then you could use that information in your Kafka streams application or your streaming app, right? So it's sort of a way, probably more or less for people to be able to share data from various data sources within their organization

Starting point is 00:46:55 without giving access for everybody to their database server, right? So they can move it into Kafka and hey, the data in Kafka you're allowed to use. Right. But you're not going to connect to my live transactional order database. Right. Like you're not allowed to do that.

Starting point is 00:47:12 So that's why these connectors are sort of important. And the mixed feelings, I definitely want you to dive in on that a little bit, because I have some opinions on that that I'm curious if they line up with yours. Go ahead, Allah. You were about to say something. Well, I was going to say, like, it kind of relates to some of the engineering blog stuff that we've talked about from Uber, where, like, they have one massive data lake. And then, you know, depending on the needs of a given team or application, they'll have, like, subsets of that data that gets, you know, copied over to their specific use case. And they can put it in whatever format they want. So you can kind of envision where a Kafka Connect, a Kafka plus Kafka Connect type of scenario can help you achieve that type of goal.

Starting point is 00:47:55 Yes. Sharing data in a well-thought-out way. Yeah, totally. So what don't you like there, Jay-z yeah this really isn't a fault of connect you know i think it's uh has to do more with where it sits in the mix so uh this is a complicated piece of technology that's hooked up to kafka which is out of the gate which is complicated and then if you're hooking up say two different databases one to pull data from one to put it in that's complicated you got a bunch of configs there and things go wrong sometimes so like one example that i liked is that um like if you ran into like

Starting point is 00:48:30 a message like a row in a database that was too big and exceeded like the the payload maximum whatever like somebody has some giant row in a database and the connector crashes it doesn't restart by default there's no way to tell it to auto restart like it doesn't know what to do so it just shuts down so then like say you wake up in the morning and realize, Oh no, my data is not flowing. Something happened. Um, you have to figure out first of all, what does that mean for your application and your use cases? If you know, you didn't get data in for the last couple of hours or whatever. So, you know, there might be some cleanup or some work you have to do there or just other things you have to fix. Then you have to decide, you know,

Starting point is 00:49:03 well, am I able to just resume this? Do I need to increase this payload? Do I need to reissue the cursor? So one problem we had with this, we've seen before is like some databases will have like a cursor to kind of tail the log like a replica, which is super efficient. But if that cursor is stale or inactive for too long, then it can lose its place because the replication log, the right-of-head log or whatever, depending on your database that you're using for, can get passed and can lose that spot out of buffer. And so then when you restart the connector, it can't pick up from where it was because that information is lost. So you have to re-snapshot, which is changing these configurations, which are declarativeative which i love declarative declares great we've talked about lots of times but sometimes you want to do procedural type things like okay well let me run this configuration now in snapshot mode and when it's done we'll put it back over to tail mode and then we'll restart it and stop it and these are procedural

Starting point is 00:49:57 things that don't that aren't captured really well in that declarative configuration kind of based approach so that's the pain and then we've had some issues just with like um probably misusing it and um having clusters that get too big we've had weird problems with like uh sometimes uh we call it zombies uh the connector will die or get started up it's part of like an upgrade or something and the old one's still going and so now maybe data is getting kind of confused or you know things bad things are happening and then you have to really think about what that means for your application and that's it's really tough to figure out all the consequences of data not syncing which is really more of a cdc problem than kafka connect which is why i say it's not really kafka connect stuff that frustrates me

Starting point is 00:50:44 although it gets the blame. It's just that it sits in this really critical and finicky part of your architecture. I think so. That's beautiful. And the reason why I wanted him to chime in is because he's probably dealt with it more than any of us and probably most people on our teams that we work with. Debezium. So it's funny like and that one specifically came to mind because debesium is one of the open source connectors that allows you to basically get things from

Starting point is 00:51:13 like mongo right and and using change data capture get everything into kafka and the reason why i wanted to bring it up is he didn't even hit on this. Like he hit on like real issues just then with Kafka Connect, like the things getting out of sync or the right ahead log getting past the cursor or whatever. But there were things that for sure we just misconfigured initially, right? And that misconfiguration will absolutely destroy you. And you have no idea until you realize there's a problem, right? And then you go in and you're like, oh, I didn't set this one bit or I didn't change this one flag or whatever. And again, the way that it's sold is it's the simple solution for moving data. But the reality is, and everybody should know this from dealing with different

Starting point is 00:52:05 database systems or whatever, these things are complicated, right? I mean, super complicated. There's no such thing as simple when it comes to state. I don't care what level of the application you're talking about. For sure. And I think that's one of the more important things that Joe just pointed out there is this, when you use it, is a critical piece of what you're doing, right? Moving data from one system to another. If you're going to do it, you have a good reason for it. So you need it to work. And when it doesn't, it causes so much pain that you just hate dealing with it. But the flip side is if you were to write your own data polar and synchronizer yourself,

Starting point is 00:52:49 you're going to run into more problems, right? So it's impossible. I mean, I remember we ran into something at one point where we were syncing data from Mongo and we were pulling from a replica, not realizing that it wasn't going to get all the data right like we were told hey if you use the replica that's it but it was only getting the data that that particular replica cared about so there's there's so many pitfalls and so many things that

Starting point is 00:53:15 you can do wrong that aren't necessarily the fault of connect but it's going to look like it when it doesn't work so yeah fixing it is really tough like one example too that i like to i think we've had some sort of problem like i mentioned so we restarted the snapshot which was even janky i think it was postgres so the way you do in postgres you designate the snapshot table and you insert a record into it and then it manages the state and it was a whole big debacle and then once we got it figured out it did exactly what it said it was going to do when it re-snapshotted the data and put it into the same topic. And now that topic had the data from the old replica and the new. And the producers got kind of confused.

Starting point is 00:53:52 And there's a compaction process that will kind of clean things up depending on how you have it configured. If you have it configured that way, we didn't. And so there was like down these duplicate messages being replayed. And like some applications are affected by it because they use the database that we're syncing to. And some applications weren't affected because they were using this original source and things got out of sync. And it just it was, you know, a fun day to figure that out. And it's just it's just tough. And these kind of things are it's like a distributed level so it's hard to have like good abstractions like built around programmer programming kind of interface type things because it's your logic is really distributed

Starting point is 00:54:29 amongst like eight different config files there and you're not even talking about you know code here you're just talking about these different systems yeah totally hey one thing that i want to bring up here and this is part of why kafka be painful, like connect all of it. There's very little UI. There's very little way to see the state of your system, right? So at the top of this, he mentioned the admin API. So Jay-Z is very familiar with this because he actually wrote an application to leverage that admin API so that he could check on the status of brokers, check on the status of topics, check on the status of connectors, all that kind of stuff.

Starting point is 00:55:13 Because by default, they kind of give you some crappy command line shell scripts to read from a topic or to check the status of something but for the most part there's not a lot of great visualization or tools from the kafka project itself to be able to sort of see the state of your system and that's one of the painful things with kafka connect as well yeah i was definitely going to add the like from you know the kafka project there aren't but because there are some out there but um yeah they're're, they're third party. Yeah. And they usually want to charge it. And have you guys found this? This actually drives me a little bit crazy about the Kafka ecosystem is the third party ones, the nice ones, it's contact for a quote. Like there's no,

Starting point is 00:56:01 there's no, Hey, here's the pricing, right? It's not a hundred dollars a month or or what or a hundred dollars a broker or something that's contact us for a quote it's like yeah no i don't that's like the whole if you gotta ask you you don't have enough like i don't like that yeah if you gotta ask the price you can't afford it yeah exactly i don't like that and a lot of the solutions and a lot of the reasoning that they give for not having more of interfaces that they're saying, well, you shouldn't be logging in and looking at thousands of topics manually. You should have metrics and alerts and all this stuff, which is true.

Starting point is 00:56:35 Fine. But if I'm working on my laptop, I don't want to be running Prometheus and Grafana just to go see if I have data in my topic. I don't want to go shell in, get the list of topics, grab for the one I think it's named. There it is. Okay, pull info. Okay, now I see it. Let me check this partition. You know, it's like all this, you know,

Starting point is 00:56:52 shelling around and running commands and stuff. It's just not really tuned for that kind of developer experience. So that was frustrating. And that's why we keep bringing up the UI. But once it's out in production, that's where it kind of shines. Yeah, for sure. I mean, we've mentioned it. Even with the negatives we've said here, we've had this stuff running for several years and mostly pretty seamless, right?

Starting point is 00:57:17 There have been issues like what you talked about with the Postgres, ReqSeq, and all that. But for the most part, it just sort of works it's one of those things that as long as you configured it to have the number of partitions and all that kind of stuff that you care about you kind of just forget about it which is yeah awesome i can't say that about many systems no and uh we got a link here for uh various connectors that are available and what i really like about this is that a lot of times the connectors are maintained by the product itself so like for example uh mongo uh makes their own connectors and they provide connectors for kafka same with a bunch of different database vendors so you're getting the people that really know the most about

Starting point is 00:57:58 how to tail the replication or you know how to how to interact with their systems in the most efficient way uh which is something that you probably wouldn't have if you were just trying to roll something by hand and you're trying to like use a watermark to query top 100 records greater than whatever and i'll save my watermark you know like all that stuff just goes away and uh i'm always really impressed when you can just kind of act as replica and just get data streamed in and it just works and when it does work it's beautiful and when it didn't work, you probably messed something up, but you're going to have some fun fixing it. Yeah, it's not beautiful when it doesn't work.

Starting point is 00:58:31 Yeah, and it's important to point out that link is an affiliate link, so just hit us up for pricing. And when you use that link, you'll get $5 off your next Kafka cluster. That's amazing. It is pretty cool though they have they're literally dozens and dozens of these connectors source and syncs so so um if you would like to be uh heralded at the top of the show like uh anjing jellies and nick uh we're at the start of this one you too should leave a review if you haven't already you can find some helpful links at codingbox.net slash review and there's some easy buttons there where you can leave your one star and um you know whatever star count you

Starting point is 00:59:19 want you know but you gotta start at one you gotta start someplace, Alan. I see your faces. Yeah. So with that, we head into my favorite portion of the show. It's time for Mental Blogs! Blogs! Blogs! Blogs! Okay. All right.

Starting point is 00:59:41 So according to TechHouse trademark rules of engagement joe you are up first i think i'm on a streak too right yeah yeah you are yeah let's see if you can continue it nope all right i have faith in you it depends on whether you get 18th century literature or not if he if he if he gets that he wins like that's i already anything i had i learned in eighth grade were covered. I already see the category I would pick. So, you know, I think you have potential here. That's what I'm saying. To keep this streak going. All right.

Starting point is 01:00:17 The search for foreign lands. Each response will be a country whose name is actually hidden in a word in the clue. All right. For the love of Pete or TV dramas in a nutshell or adventurous women or rain. And lastly, cringeworth worthy office lingo. I mean, definitely that one.

Starting point is 01:00:51 What was it? Do we know what love Pete was? I assume that was just Pete in the name or something. Uh, there is no, uh, additional description of that category. It just,

Starting point is 01:01:01 if it's about the adventures of Pete and Pete, I got that one. It's, it's just for the, I'm going to say not, Pete, I got that one. It's just for the... I'm going to say not, but... All right. Let's go with cringy lingo because that's my specialty. It's one of my superpowers.

Starting point is 01:01:14 Okay. And a five, of course. Okay. It's a three-word phrase meaning at one's own expense and a needlessly wordy way of saying unavailable one's own expense or a wordy word word wordy way of saying you're unavailable um out of pocket that is correct i was wondering if you were going to pull that one off i i was like well i got scared there for

Starting point is 01:01:53 a minute that's weird um for the love p just to like clue you in on like what that one was for example since you questioned it give you i'll give you the one uh pointer his first kiss with future ex kim kardashian was in a 2021 snl sketch and he was dressed as aladdin i i don't know his name i assume it's uh yeah the pete who married Kim Kardashian. All right. Well, you're not wrong. It is Who is Pete Davidson? And my favorite one, I was surprised you didn't pick this one. I just have to read this one pointer.

Starting point is 01:02:36 TV dramas in a nutshell. The survivors of Oceanic Flight 815 band together to battle mysterious forces on a tropical island lost there you go see you should have picked that one well here's the here's the pete uh trivia that it should have been uh this is the name of pete's tattoo on the adventures of pete and pete is it pete are you no the tattoo i don't even know yeah but i'll give you a hint it was it was a woman that he could make dance by flexing. It was like a nine-year-old. But the answer is supposed to be a Pete.

Starting point is 01:03:10 All right. Well, the answer is Petunia. But that's not Pete. It's kind of. It's kind of. All right, fine. All right, Alan, here you go. You ready?

Starting point is 01:03:23 This first category has your name written all over it. You ready? Competitive cheerleading. You see a good cheerleader. Yeah, I thought so. You're always so optimistic. Yeah. Science.

Starting point is 01:03:37 And the cartwheeling. Yeah. You ever seen him cartwheel? Professional. That's right. Science museums. Six degrees of actual bacon. Sad songs. Mother goose police blotter.

Starting point is 01:03:56 And you'll give us the nursery rhyme in question. And this one, I'm assuming I'm supposed to read these letters out so this is going to be n-i-a-l ain't a river in egypt the letters n-i-a-l will appear in each correct response and let's do the the nursery rhyme blotter thing for five police water okay officers responded to anonymous reports that a local man was sequestering his wife inside a large gourd oh man uh

Starting point is 01:04:43 oh Oh, man. Oh, man. Oh. What is the name of that? Something pumpkin. I cannot think of the name of it i don't know man all right joe for the steel i don't know oh man what a missed opportunity peter peter pumpkin eater eater i knew it was a pumpkin so I couldn't remember it. So close. I had a wife and couldn't keep her. Wait, did he put her in a gourd? He put her in a pumpkin cell. What?

Starting point is 01:05:31 And then he kept her very well. Yeah, I remembered it. I could not get the name of it. Somehow I missed that part of that. You did better than me. I don't know that I could have gotten that one. I might have gone with sad songs instead. I thought about that,

Starting point is 01:05:47 but that was sort of depressing and I'm an optimist this morning. Yeah. Well, the one pointer for that, this weepy Sinead O'Connor ballad might not be about lost love. Prince is rumored to have written it about his housekeeper. Nothing compares to you yeah and if you haven't heard the chris cornell live version of that at uh sirius xm i suggest you go out to the internet and find that actually

Starting point is 01:06:16 i'll put a link to that in the show notes it's so good cool i love that rendition um all right All right, Jay-Z, your categories are U.S. World Capitals. And world capitals is in quotes. U.S. World Capitals. Okay, so like Paris, Texas. Okay. There we go, yeah. Veterinary Medicine. The Wardrobe Department.

Starting point is 01:06:41 Musical Instrument Makers. That's half the battle. In this category, you'll need to name a historic battle that we are going to show. We can't do this one. This is a visual. We're going to show you every other letter of it, if that makes sense. Yeah, no way. I guess I would have to show show you the letters I get or say the letters.

Starting point is 01:07:07 And then the last, the last one, which is probably the best one. Um, Roget's butt. How do you spell that? R O G E T. Why?

Starting point is 01:07:19 You assume I mispronounced it? No. Uh, what's the etymology? I just watched the thing. I just watched the spelling bee the other day. And it was just,'s just you can't ask how to spell it then they'd be like how do you spell snake and you'd be like what's the origin can you use it in a sentence right yeah snake just spell it i could have told you that i could have given you the half the battle one

Starting point is 01:07:39 it's literally i i misunderstood what they were trying to say before now. I would literally tell you every other letter of the battle. Yeah, there's no way. I'm terrible with that stuff. I'm going to go with music instruments, although I'm afraid it's not going to be guitars the whole time. But maybe I'll get lucky. I'll go with five. Okay. A musician.

Starting point is 01:08:01 Now, keep in mind, I am well known for my ability to read proper nouns. So this one's on you. All right. A musician himself, Nodu Mulek, also made some of these instruments for Ravi Shankar. And this one was a visual. Oh, and I can't see what the visual was. That might not be the best thing. I'm going to give you a different one since that one was a visual.

Starting point is 01:08:34 You can't because you can't see it. And neither can I. Can I guess it anyway? Sure. And I just won't give you credit if it's right. Well, that's up to you. Yeah, I won't give credit. Is it's right well that's up to you yeah i won't get credit uh is it a sitar it is dang it i should have let you have it i'm sorry yeah that's all

Starting point is 01:08:52 right all right um yeah all right do you want to pick a different one uh are they all pictures for that category i will tell you the three-pointer is the okay the rest aren't uh let's go for uh ah geez let's go with um i'm gonna hate myself for this but uh what am i doing musical instruments for four okay we got this again this is probably not gonna be paris texas which is like that and like there's rome and athens i know about okay uh again on you for the proper noun this is your fault yep the nippon gaki company was the corporate ancestor of this large instrument maker and motorcycle manufacturer? Yamaha. That is correct.

Starting point is 01:09:49 That's so ridiculous, man. I was sitting here on Stradivarius waiting to bust it out. I can't remember the piano when I was trying to remember Stravinsky something. Anyway, doesn't matter. Well, I'll go ahead and give you a hint. Stratocaster was not part of any of it

Starting point is 01:10:12 should have been yeah and and for the record neither was paris texas oh wow okay yeah but i will give you like the one pointer for the u.s world capital capitals this lone star state capital moonlights as the live music capital of the world so now you can kind of get a flavor of where they were going with it. So Texas? Yeah, but you got to name the city. Austin. Yes, that's correct. That's Austin is a world capital? It's the world capital of music.

Starting point is 01:10:37 Live music world capital. Okay, I see what you're saying now. I see. It wasn't a transplanted city capital. What about Nashville, though? New Orleans? I mean, come on. I take some issue with that.

Starting point is 01:10:50 Yeah. I mean, it's not on here, but I would see how Nashville, Broadway Street would definitely have like a strong contender. If you've never been there, any bar you go to has like three or four levels of different live music all right so i have a question i get one more question right that that is your question and uh yes i get five points cool all right well actually actually that that it's the final jeopardy and you have nothing to bet well see that's what i was going to say because we talked about this before when one person gets two and the other person gets one there's no chance of ever getting anywhere so we're always supposed to get two each okay fine yeah so there we go so then do we want to do it where the person who gets that extra round can choose from any of the previous topics given

Starting point is 01:11:42 yeah we can do that too that's fun yeah well remember what any of them or i should get to pick the topic yeah you really should that's that's kind of evil all right you guys decide how you want me to do it and then i'll read off where the topics are give me give me the mother the the mother goose thing for four i'm gonna get this one. Okay. Please received multiple reports at 10 PM of a man running through town and tapping on windows in his nightgown. Really? Peter Piper. Nope.

Starting point is 01:12:20 Joe for the steel. Uh, I mean, don't get the, um, Jack Jack candlestick, whatever. That's it. I mean, don't get the jack jack candlestick, whatever. That's it. No, I don't know. We Willie Winkle. I don't even know that one. All right.

Starting point is 01:12:34 We Willie Winky. Sorry. I hadn't heard of that one either. We Willie Winky. All right. So I just got toasted. Yeah, pretty much. All right.

Starting point is 01:12:42 Let me go ahead and make note of these zeros. Well, here you go. How do you spell it all so we can both lose okay so so uh yeah um really this is only matters for jay-z but the final round is artist and you you already gave me your point value, Joe. I'm assuming I think I already know what it is. Yeah. Exhumed in 2017 to settle a paternity suit, his mustache had preserved its classic 10 past 10 position, according to the Spanish press. What was the first word? Paternity something. according to the Spanish press.

Starting point is 01:13:26 What was the first word? Paternity something. Exhumed in 2017. Wow. To settle a paternity suit. His mustache had preserved its classic 10 past 10 position according to the Spanish press. Okay. This was a message, I think.

Starting point is 01:13:52 But thank you. It's not Picasso, is it? He's not Spanish. I don't know. No, that's what I'm trying to think. Yeah, I don't know. Dun, dun, dun, dun, dun. yeah i don't know oh joe gave me an answer picasso that is wrong dang? Leading the witness. Yeah.

Starting point is 01:14:25 I'm going to say that Alan sabotaged you on that one on purpose. The answer was Salvador Dali. Oh, my gosh. Wow. Oh, my gosh. I'm so terrible. Okay. Yep.

Starting point is 01:14:42 Yep. All righty. So, on to Kafka. Let's talk about use cases yeah so what we got number one there hello message queues usually talking about replacing something like active mq or uh rabbit mq so this goes back like we keep referring to this thing as a queue, right? So, you know, it's basically like, what are the competitors you're going to think about? Like if you were even considering a message queue kind of system, what are the competitors are going to think about?

Starting point is 01:15:14 And RabbitMQ is probably like the bigger of those, like, and not between ActiveMQ and RabbitMQ. I guess it depends on whether you're in a cloud world or maybe doing more on-premise type stuff active mq seems to be used more in in the cloud i think but yeah either way okay yeah so the the message brokers are often used for responsive types of processing and decoupling systems. I think we kind of already covered that. But Kafka is usually a great alternative that scales, generally has faster throughput, and offers more functionality for the reasons we've talked about already. The available APIs that are there, the ability to write streams, applications, whether you're using the Kafka APIs or you're using

Starting point is 01:16:06 something else like Beam or Flink, but then the ability to scale those brokers out and scale the reads and writes out across the size of your cluster, it just scales well. Yep. I concur.

Starting point is 01:16:24 I forgot where we were website activity previous episodes show notes yeah yeah this is a great one so like alan said website activity tracking so you i'm sure everyone's seen like google analytics by now or um microsoft's got one we've talked about like open telemetry several times but um it's a this is a really good way of tracking activity, which would be like someone scrolls down. They click. They go to this page. They go to that page. They check out.

Starting point is 01:16:50 They click this button. That kind of stuff is really nice for Kafka because it's just going to be the stream of events that flows into Kafka and gets saved. And then you can look at it later for figuring out what happened or analytics type purposes, planning sales, yada, yada. Am I remembering it wrong where like LinkedIn, Kafka was originally created by people at LinkedIn for the purposes of being able to show you that little bitty link on your LinkedIn page of like who's viewed you, who has looked at your profile? Am I remembering it correctly?

Starting point is 01:17:23 No, I want to say that was uh pino i think what that was maybe i'm wrong no because kafka way predates pino yeah i don't know i don't know i mean it was created at linkedin yeah it was it was created at linkedin for sure all right i'm gonna look that up because i thought it was yeah there's a nice article i just found on it i'm still kind of trying to get to the key they're trying to get to real-time processing so while they're looking that up i i don't know on that one another one is metrics for aggregating statistics from distributed applications,

Starting point is 01:18:06 right? So that's using like the real-time streaming and aggregation windowing type thing. That's one. Okay. So the one-liner, and this is on a LinkedIn page. Well, I guess, yeah. Kafka's origin story at LinkedIn, the problem they originally set out to solve was low latency ingestion of large amounts of event data from the LinkedIn website and infrastructure into a Lambda architecture that harnessed Hadoop and real-time event processing systems. So the key was the real-time processing. Yep. Another one of the use cases is log aggregation. So rather than using logs that were written to HDFS or a file system or cloud storage or something like that,

Starting point is 01:19:00 writing it to Kafka, and the primary reason for it, and this makes a ton of sense, is you abstract away the file system completely. So if you're using something like cloud storage, then you're going to have to use an S3 protocol or a GCS protocol or whatever, Azure blob storage, or if you're writing to a file system, then you're doing network file shares. If you're doing to Kafka, you just produce and consume using their protocol and that's it it's it's just gets rid of the files completely so that makes a ton of sense yep and also just like

Starting point is 01:19:32 iot devices if you've got a bunch of like temperature sensors or something that's a great way to kind of get that together i guess that's more like metrics so and logs so we already covered that my bad i'd be well you can take the next one all right stream processing taking events and further enriching those which is like the fraud detection logs so we already covered that my bad i'd be well you can take the next one all right stream processing taking events and further enriching those which is like the fraud detection example we gave and we talked about kafka streams and flink and beam and there's other solutions of spark that are common uh for i mean kind of spark tilde or uh asterisk um i guess even event sourcing is something we've talked about a few times, which is basically storing the state changes.

Starting point is 01:20:07 So like a ledger, like a bank account or something would be an example here. It's used commonly where it's like we add $500, now you spend $600, you add $100. And at some point you can say, okay, what's the balance? And you take a look at that snapshot in time and say, this is the balance at the time. But you can imagine how keeping track of those transactions is really important. But I kind of take issue with that example, though. Have we used a bank example for Kafka? Because that seems like not a great use,

Starting point is 01:20:39 especially depending on what your compaction and retention settings are for that, you could potentially lose history of those transactions. And for financial purposes, you don't want to lose that. Yeah, and I think that's a really good point. You wouldn't want to keep the data there permanently, but if you just wanted to stand this in front of a more traditional database, then I think it makes sense to just kind of have every credit card swipe go into Kafka at some point. And then later you can have something kind of slowly pull that stuff out and arrange it and save it.

Starting point is 01:21:09 I guess for the event sourcing, this is why I keep going back to that Uber example though, because you don't care about where the driver was an hour ago. You care about where the driver is now. So the, so the event is the driver's current location and periodically he's saying, I'm here now, I'm here now, I'm here now i'm here now i'm here now what right and and you're just getting like you know as the as the user of the app you're being able to track that that driver as he moves around nearby you but you don't care to see the history of where

Starting point is 01:21:37 the driver was well no but for event sourcing so it's a different use case but for event sourcing, so it's a different use case. But for event sourcing, while maybe you wouldn't use Kafka for your event sourcing for transactions from a bank, it's actually a good example, though, right? Like you started with $100. You added $20, took out $30, added $50, took out $40, whatever. Event sourcing is just the whole notion of replaying events from the beginning to get to the state that you want right now and so it can be used for that uber is probably taking like your coordinates from gps or something at the time so it's like a snapshot of where you are but an event sourcing example would be like you drove 500 feet straight took a left and drove 20 feet straight yeah it's replaying events is all it is and so it's it's perfect for that type of thing

Starting point is 01:22:25 yep and the final one here is a commit log which we talked about uh you know replication cdc change data capture basically using kafka as like an external commit log or write a head log or transaction log whatever you want to call it basically keeping track of uh all the events that let up and you can use that to like sync databases or something. So you can imagine a database that if you were building a database today, you could consider kind of outsourcing your log and bring in Kafka as a dependency in order to kind of keep Kafka there

Starting point is 01:22:53 as that backbone for replicas. Probably not a great idea. You're going to lose some efficiency there because all the other things it could do and configurations you have to make, but it wouldn't be a bad idea to get you going and kind of uh outsource part of it well let's expand on that idea though because because i had never considered that but think about this type of thing let's just brainstorm this live and see where this goes if you if you wanted to the advantage of having the relational database is it would allow you to do your reads to write

Starting point is 01:23:29 custom queries to read just certain parts of the data and everything. You could do maybe some aggregations over portions of that data as you're redoing the read. There are legit use cases that relational databases have, namely that you could relate this data to that data type of thing, right? So if you wanted to use Kafka as the, in front of that, because the downside to relational database is you can't parallelize the writes, but you can parallelize the reads, right? In your traditional databases. But if you were to put Kafka in front of that to where like

Starting point is 01:24:11 anything that wanted to read from your database could read directly from the database, but to write to it, it would instead have to go through this process. Like it would, it would write to Kafka and then there'd be some kind of a streams app on the Kafka side. Oh, I ran out of space on my zoom or on my thing. So we're using the backup recording. We're going to see how this works. And this is going to be nasty because the recording started midway through. It sure did. Good thing I did it. This is going to be a cluster. And not the good kind of Kafka cluster. So talking through this, you write to the Kafka queue. Some kind of streams app would then be able to write in like batch or whatever to your database,

Starting point is 01:25:02 whichever one is in charge of the rights at that time right i mean that kind of seems like a pretty cool thing if that's you know if you really needed to harness the power of that relational database right like that's where i think the key would come in yeah remember we talked a little bit about when we were doing the distributed uh designing data of intensive uh applications book uh one of the things we talked a little bit about when we were doing the distributed designing data of intensive applications book. One of the things we talked about was this kind of, I think it was the LSM tree databases. I forget what it stands for now, logs, something, merge. But basically, it would take in data and keep it kind of in a buffer. And then as that buffer filled, it would push it down into more persistent storage.

Starting point is 01:25:43 And so like Elastic kind of works that way where it takes in data and then over time it like routes it down into more persistent uh storage and so like elastic kind of works that way where it takes in data and then over time it like routes it to the appropriate place but it what it does is that lets you say okay i got the data faster to the producer and it's like i've got it from here i'll get it to the right place and when that read comes in it goes and gets routed to whatever data uh notice should come from hey um so just to catch everybody up, it didn't run out of space yet. I misread the message out of the corner of my eye. It is that it has limited space available. So my recording is still good, fortunately.

Starting point is 01:26:19 But I have like 14 minutes left. So I'm just saying like maybe we are at a good stopping point. Yeah. Yeah. And just saying like maybe we were at a good stopping point. Yeah. Yeah. And we just like move on. Yeah. Let's do the tip of the week. Yeah.

Starting point is 01:26:31 Let's do the tip of the week. All right. So I got 12 minutes for tip of the week. I know what you're saying. It's my portion. It's my favorite portion of the show. It's the tip of the week. All right.

Starting point is 01:26:41 Moving on. All right. Well. Speak fast. I got a good one for you. Remy galego sorry sorry remy about the name pronunciation is a music producer that makes music under a variety of names uh the algorithm is probably the most uh most well known of their names they also do name music under their own name and music under the name of sorry again boucle infini uh remy's french and that boucle is actually my favorite but all almost all the music is like instrumental kind of um either idm type music uh or synth wave if you're familiar with those terms but they also have a hard rack edge oftentimes and the person also makes a lot of video game music and so the the two albums i'm going to

Starting point is 01:27:29 recommend uh are from this person and it's just great coding music uh if you like uh either dancey type music or kind of hard rocky type music because it kind of straddles the line and uh the two soundtracks for these games are the last spell excellent game and also hell is other demons which is also an excellent game so we'll have a link there to uh youtube that's got just a combination of videos that are like tagged with that person including those two albums and they're excellent very cool all right And then I got another one here. We have talked about canines, Kubernetes-focused 2E terminal user interface that we've talked about several times.

Starting point is 01:28:14 I can be used to look up information about things other than just Kubernetes resources. Like we've talked about colon helm before and colon events or sorry that's the new one so helm you can look up helm packages even though that's not a native kubernetes construct there's also one for uh events so colon events and events are pretty useful figure figuring out why uh something didn't happen like a pod was killed because it ran out of it exceeded the the memory boundary for too long,

Starting point is 01:28:45 or a scale down event happened on a node. And so that's something that, because it's not like a Kubernetes resource, really, you don't really think too much of it, but it's something that you can view in canines. There are other resources. You can hit the question mark and see all of them. And it is dynamic.

Starting point is 01:29:01 So if you install like an operator, for example, like Kafka, then it'll install a Kafka Connect resource. And then you can go colon Kafka Connect or whatever that name is and see that in there. So that's pretty cool. And there's actually a couple other ones I think we've mentioned before that are just kind of interesting, like Popeye, X-Ray, and monitoring. You can do a colon and you can see all those with that question mark there,

Starting point is 01:29:22 which just show interesting information about your cluster. And it's all just kind of built in for free. That's great. You should use it. Excellent. The events was a new one for me. That's awesome. All right.

Starting point is 01:29:34 So I had I had a few. I'm going to narrow them down because I don't want a lot of recording. So the first one. So the first one is warp stream so i even found out about this because uh micro g had sent this i think in our episode discussion in slack but all right here's here's the gist of it and this goes to the backup thing that we were talking about earlier out earlier. WarpStream, the whole idea is you will not run a Kafka cluster. What WarpStream did was take the Kafka protocol, so producers, consumers, all that kind of stuff, and then they rebuilt it on top of object storage in the cloud. So the beauty is you don't have to worry

Starting point is 01:30:23 about partition sizes. You don't have to worry about disk drive sizes. You don't have to worry about partition sizes. You don't have to worry about disk drive sizes. You don't have to worry about any of that stuff. Everything gets written. Let's say that you're an AWS. Everything gets written to an S3 object. And then your producer's writing just like it's writing to Kafka, but it's getting written to S3. Your consumers are consuming.

Starting point is 01:30:44 They don't know it, but they're consuming from S3. So anytime something new is written, they're getting that. The only real downside, as far as I can tell with this is the latency is higher, right? So if you're writing to Kafka, your latency is milliseconds at most, right? From the time that you write to the time that you get that read notification in your consumer. With something like S3 or Azure Blob Storage or GCP, GCS, it can be closer to a second. So as long as you can take a second as opposed to milliseconds for that time in between the write and being notified of it, that's really the primary downside.

Starting point is 01:31:24 Otherwise, you get all the benefit or as far as I could tell, I haven't, I haven't deep dived this, but you get most of the benefits of Kafka with almost none of the downsides. So super duper cool thing. I have an article link here. I actually want to do an episode on this later because I would love to sort of deep dive, find out more but super interesting i wonder if it supports all the same apis like a flink or beam or i would imagine it does because if you're doing flink or beam or any of those you're just using kafka as a source and if they're adhering to the kafka protocol you should be good, right? It's the key. That's it. So, again, super cool. And one of the other things this gets rid of,

Starting point is 01:32:13 so first off, it takes care of that backup problem we were talking about earlier because everything's written to, you know, super high available storage. And the other thing that it takes care of is not dealing with that inner region or cross-region ingress-egress cost because you're not doing that. You're just using the cloud blob. So when we were talking about backups or if you have multiple Kafka clusters, you're having to write across regions, and you're getting charged for that network communication

Starting point is 01:32:39 that's moving across boundaries. You don't get hit with that. But you still have to write into storage. You have have to write the storage but you're not going across things so your storage is highly available and depends on if your bucket's multi-region then if you do multi-region then it's already built in for it so um they do say i want to say on their site that doing it this way can also be a like an eighth the cost of running your own Kafka clusters. Again, if you can deal with the latency. So I've already eaten up three minutes.

Starting point is 01:33:15 Next one, I actually had put this in my tips before Outlaw had even mentioned it. There's the blog article also. I thought it was beautiful. MicroG had actually shared this also in the episode discussion, the Trillion Message Kafka cluster for Cloudflare. I'll have a link to that. And then the only other thing I wanted to share real quick was Jim Hummelsine, also in the episode discussion channel in our Slack community, had mentioned that Kafka reminded him of the mediator pattern in programming.

Starting point is 01:33:48 And it's exactly that, right? The mediator pattern is, hey, talk to me so that I can abstract you from all the various implementations. And that's exactly what Kafka is, right? Like, hey, you talk to me and then just everything that you want to put in here, you can, but then you don't have to know

Starting point is 01:34:04 about the database server. You don't have to know about the database server. You don't have to know about the web server or everything else, right? So pretty cool. That'd be every database thing then, right? Because the whole point of the SQL language was to abstract you away from the storage mechanisms. You don't have to know how to read that actual thing. SQL for sure, but if you want to talk to SQL Server,

Starting point is 01:34:24 then you're using ADO implementation for SQL Server. If you want to talk to sql server they're using ado implementation for sql server if you want to talk to postgres and you have your postgres stuff right so so kafka i guess is sort of the whole idea that you know if you talk to me then you can interact with all your other systems as long as that data is synced in here so but anyways all right so your turn outlaw i think you got like five minutes very good at math you, all right, so your turn, Outlaw. I think you've got like five minutes. Very good at math you are. All right, so the number one tip that I wanted to give was I didn't know about this. Someone on the team already knew about this, but we've talked about JQ in the past,

Starting point is 01:34:59 and it was a previous tip of the week, I believe in like episode 205 if I'm right. I think maybe. And somebody can check my math on that but i'm pretty sure um and i found this week yq which is the equivalent of jq but for yaml specifically but it does work with other uh file types too so csvs uh xml for example so So I'll put a link in that. It's built on top of JQ is my understanding. So it does require that JQ be available because behind the scenes it's using it. My understanding is it's converting from like YAML into JSON and thenq do the do the heavy lifting but it's not uh as complete as um jq in some of those regards but i'll have a link to that and then um i wanted to share some

Starting point is 01:35:55 links that mike rg gave uh i didn't realize that um he'd hit you up too but anyway you're getting a big shout out so there was a if you've ever i've used in mon in the past as like the way to um like if i log into a unix environment and i want to see like what's going on which i believe was a tool originally written by ibm for ai x if i recall correctly but um it was basically like a tool where you could see like visually see like what's happening on the system in regards to like CPU utilization, memory utilization, a disk IO or network IO.

Starting point is 01:36:30 And he shared a similar one. It's a little bit cooler in terms of the graphics that it uses written in rust called Zenith. So I'll have a link to that. We've talked about big O cheat sheets and, uh, there was a new version of it. That's kind of more simplified in a big O cheat sheet that include a link for. And then, um, on the topic of cheat sheets, there's the another yet another get cheat sheet that, uh, will include a link for that.

Starting point is 01:37:02 Um, you might find handy. Otherwise, subscribe. I think it stopped. Yeah. I don't know what he's saying. I just wanted to mess with you. I just wanted to mess with you. Oh, man. That's pretty good, right?

Starting point is 01:37:18 You got me. Yeah. Yeah. All right. Later. Bye.

Coding Blocks - Nuts and Bolts of Apache Kafka

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.