Coding Blocks - Nuts and Bolts of Apache Kafka
Episode Date: June 9, 2024Topics, Partitions, and APIs oh my! This episode we’re getting further into how Apache Kafka works and its use cases. Also, Allen is staying dry, Joe goes for broke, and Michael (eventually) gets on... the right page. The full show notes are available on the website at https://www.codingblocks.net/episode236 News Kafka Topics Kafka APIS Use Cases Tip […]
Transcript
Discussion (0)
you're listening to coding blocks episode 236 that's right we're talking stuff
oh about that almost a little bit now uh yeah whatever it's a healthy number uh and uh hey uh
if you want to hear more you can check us out on itunes spotify more listening using your favorite
podcast app and hey leave a review if you can yeah that's amazing all right
hey you can also follow us on that x at cody blocks um where we are super active all the time
so with that i'm alan underwood
i'm josek michael outlaw oh hey and also somebody recently said that it actually helped them out
they didn't know we had a Slack channel.
How about that?
So, yeah, let's do it up front here.
Go join our Slack.
It's CodyBlocks.net slash Slack.
Easy to get signed up and truly some awesome people in there.
And you've probably heard us refer to them many times throughout these shows.
Maybe the most meaningful thing we've done in our lives.
Probably. Probably, probably brought, I don't know how many it is now, but at one point it was over
7,000.
So, you know, there's a lot of people in those channels.
Hey, so on this episode before, you know, we'll get into it here in a minute, but we're
continuing on the Apache Kafka sort of intro to it, just getting terminology.
And just so you know,
all the pieces involved,
we're going to pick that back up.
But before we do that,
we need to outlaw to practice his proper noun pronunciations.
Here we go.
So,
uh,
from iTunes,
thank you very much.
Anging jellies.
And I was getting some questionable looks.
Okay.
And from Spotify, Nick Brooker.
Not terrible.
Not terrible.
Not terrible.
First try.
Nice.
All right.
Wow.
That was, can you, I don't know, make it a little more fun, though?
I mean, come on.
Right?
I thought it was fun when you get it right.
Let's hear it.
What should I have said instead?
The Angine Jellies one definitely threw me.
Yeah, but you did it right.
I didn't know I was expecting it.
I was hoping for maybe some Angelinas or something.
I don't know.
Something totally wrong.
Yeah, I mean, this is why you do it. I got nothing. angelina's or something i don't know it's something totally wrong yeah i mean i got
this is why you do it i got nothing yeah i got nothing hey i don't don't think i like can't
come up with that stuff on the fly like or you know like i i rehearsed it like normally when i
try to say these names and i mess them up that's happening in real time like that's that's not like
something i put together in scripted thought like oh this will be funny you know i get asked this a lot
but we actually we don't pre-plan our mistakes uh they just happen naturally yeah i just legitimately
can't say proper nouns we're full of mistakes hey also i we didn't put these in the notes but
we got a couple of you know really nice replies on the last episode in the comment section on episode two one to 35
uh so yeah so this is one that i don't even think i could pronounce i don't know if it's you ilian
he said you know thank you and he loved your picture that you did of the inverted frog and
the scorpion joe which was awesome and then tanya g said that she's been listening for more than 10 years
and to keep it up so that's awesome and then we also got an email that i thought was pretty funny
i think jay-z you found it yeah when i was asking about how we get our uh mp3 uh our audio files
so small they're uh frequently much smaller than other shows, like 90% smaller.
And we actually do have a pretty long process, which we used to have on the website, I think.
Did we?
I don't think so.
I don't think we've ever documented it.
I've got the notes I use to make it, so we can drop our secrets in there.
Yeah, but I mean, I want to say usually our files are smaller than 60 megabytes for over two hours of content,
which is definitely smaller than what a lot of people put out.
I mean, I think it's probably just the bitrate that's the most important factor.
Bitrate and MP3, not using some sort of lossless type thing because, I mean,
we're not trying to keep you know dynamics between
instruments and all that so i assumed mp3 was the standard so yeah that's why i said just the
the bitrate but yeah you're fair enough if you're gonna like put it out in a different format then
maybe you know it's gonna be a larger file yeah sure. And then the one last bit of news here is Atlanta DevCon is coming in about three months.
So September 7th, that's atldevcon.com.
So if you're in the area, definitely go check that out.
It's always super inexpensive and a day full of good stuff.
I wonder if they've announced the topics yet.
You know, I wonder if they're going to call for speakers. Maybe they're still up you know i wonder if they're gonna call for
speakers maybe there's still something in i think they're still calling for speakers well sponsors
yeah i'm not seeing the topics yet so yeah forget i brought it up all right it's dead to us
all right so picking up where we left off from last time we're going on Kafka again and this time let's talk about Kafka topics and what they actually mean so Kafka topics are
are where the data is written to and I think we might have left off with that last time
so it's worth just reiterating this these are like the folders on on the brokers
where the data is written and they are partitioned that means that they are distributed or can be
if you've only got one broker obviously it's not distributed right everything's going on to one but
let's say that you have a three broker setup or whatever these will be written across multiple
brokers into what they call buckets, more or less.
And any type of new events that are written to Kafka are appended to those partitions.
So I find that interesting because I know the three of us have looked at the folders on the
brokers and it says it's appended to them. As far as I can tell, it just writes new files, doesn't it?
I think it actually adds to the file. So like um i don't know exactly when it breaks it up but the it will have
multiple files in there so you don't have like a hundred thousand files if you have a hundred
thousand messages but it does some sort of like logical breaking up into segments i think it
calls them and each one of those is a file and then it kind of keeps buffers open to them and
just kind of keeps adding to it um yeah tough is awesome that well the thing i was going to add to that though
too is that uh you one of the concerns that you have to be aware of in a kafka configuration
and this is getting kind of in the weeds but relates to your question is the number of open
file handles that you're going to have so for every topic that you that you're going to have. So for every topic that you have, you're
going to have at least two file handles open per broker. And if you did what you suggested by
having like a bunch of additional files, then that you would make that problem worse. Right?
Right. But so that's why I say like, depending on, it's a concern, depending on like how many
topics you're going to have on your on your
broker the partitions and whatnot oh i should have said not for every topic i should have said
for every partition you're going to have the two um oh yeah good point and then it's got to be
writing reading and writing to reading from and writing to those partitions simultaneously yeah
and i forget what the two one of them is the actual data and i
forgot what the other one is now off the top of my head i didn't realize there was gonna be a test
but um yeah so like you know if you were gonna have like uh i think there's been some crazy ones
like um there was the was it the trillion dollar cloud flare Kafka set up or something like that.
And then Uber is big one,
like depending on like how much usage is your Kafka environment is going to go
through.
You have to be aware of that and you,
you can update that if that is a problem for you in like a Linux environment,
you can change the number of open file descriptors that you can have,
can be open at one time,
but you have to be aware that there is a default max limit
in various linux flavors and then you also have to be aware of the overhead that each of those
would would take up if you're getting into these extreme situations you know right if you're just
getting started because you're listening to this and trying things out, you're not hitting those limits yet. Yeah, and I think
we've talked about how difficult
it can be to add partitions later.
It's not the end of the world,
but it requires some planning and some doing.
So I think when we first got started, we were like,
well, let's just give everything a thousand partitions.
That'll be fine. But because of the overhead
of those file handles and stuff, you run into
limits and realize that there's
some guidelines there for a reason. Sizing cluster is a kind of kind of a big deal yeah i'm not i'm looking
ahead in the show notes because i'm wondering if that's going to be apparent as to why so we'll
see how far we get as to like why that would be a problem to add or remove partitions i don't think
it's in this one because this is more just about the different pieces,
not necessarily didn't dive deep on any of them. So it's important to know about these partitions
though, because the way that this data is written and the fact that it is partitioned across multiple
brokers is what allows Kafka to scale, right? So we've talked about the Microsoft one a long time ago, their Kafka broker setup.
I forget if they had hundreds of brokers or whatever, but that is why it can do it is
because they can break up the data like this and allow multiple readers and writers.
So here's an important thing to note.
And again, I don't think that this gets into the details here, but we can sort of ad hoc this. When you write events to Kafka, any event that uses the same keying, so let's say that you're keying by your customer identifier and some other piece of data, that's going to be your key. All that data
gets written to the same broker, right? To the same area, the same spot on disk. And that's
important to know from something that we've all sort of talked and called it hotspotting.
There's probably other, there's other terms for it, but I'll let you guys fill in the blanks there. Well, it's that keying that I was referring to is like, that's, that's what makes the adding
or removing the partitions, uh, you know, tedious because that, cause you're, you're going to define
like a keying structure so that each message comes in, it's going to deterministically always route
to a specific broker and partition
and then that's where it can be that that's how the system knows when you go to do a read exactly
where to go to get it because it knew exactly where it went to write it but that's why like
when you go to change then those numbers then it's like okay well you have to rekey things
yeah it's it's super important because uh super important because if this broker gets the event
or that broker gets the event,
they both put them in the same partition
so that when you read a partition,
it doesn't matter which broker you read from,
you always get the same results.
So to enable that consistency,
it has to have a deterministic and repeatable,
basically, algorithm for selecting that partition
i think the default is uh some sort of round robin where i'll try to kind of randomly uh go in order
based on uh check some uh based on the key and if you don't give it a key uh you can pass null for
a key then it basically just check some actual payload and figure it out from there so it's still
consistent based on the message and that's super
important to understand because that's basically how all of the scaling you know is going to be
focused on that concept right of of being able to deterministically always know where you wrote it
and where you would read it based on some kind of key yeah and so when you go ahead
i was gonna say well so if you're going from 10 partitions you know and you've been doing
round robin for a while now you add an 11th you can't just start with that you have to rebalance
cross all of them and you have to do that all in the same brokers and the same time they have to
coordinate to make sure they're all like okay we got the same amount now now the messages and you
want to do that with no downtime too and so there's just some tricky stuff that needs to happen and it ends up being
a couple scripts it's not that bad right we've done it before on really large topics and kakura
does a great job of like still accepting new data and then knowing how to magically kind of pivot
over and we've talked about strategies on how you could do that um before like snapshot isolation
stuff where it'll kind of copy data and then catch up and
keep doing that until it's done sorry i didn't mean to cut you off no this is where that um keying though is important like when you when you decide what the key is going to be for the
messages that you write into your topic the hot spotting that you referred to alan would be like
if you chose poorly on that. So for example,
there's the three of us. And if we were writing to a Kafka topic and we decided
that the key was going to be the user, well, depending on your circumstances,
maybe that's a fine strategy. But if Alan is way chattier than the rest of us, and let's say he's
putting out a trillion messages an hour, then that, that Kafka system is going to be writing all of that traffic to that single,
uh, you know, or to a deterministic set of like, uh, you know, broker. So it's going to,
uh, potentially hotspot to one specific, uh, topic partition, which could be, you know,
onto a single broker. And therefore, you're not really taking
advantage of the, you're not parallelizing the rights like you wanted to do. So that's why you
have to, you need to consider how you want to key your data. So a couple of things to add on to that.
So I was actually thinking a very similar example to what you were saying. So in that case, what Outlaw was getting at is, let's say that you had three brokers, right? And each one of us was going to a specific broker. I'm the chattiest with the trillion messages an hour. My broker's busy, right? And then Outlaw's doing 10 messages an hour and Jay-Z's doing 50 an hour. Their two brokers aren't doing anything.
So mine's probably going to sort of bottleneck a little bit while theirs aren't even really
doing anything at all, right?
And so that keying strategy matters.
And what Jay-Z said is, you know, I think by default it does round robin.
With Kafka and your producers, you can actually specify a key strategy, right?
Like, so you can sort of tell it
how you want it to place data
so that if you know you have big customers
or something like that,
to route things, split them up in different ways.
So there are ways to get around that.
And I should say round robin is not the right term for it.
It's some sort of deterministic thing that attempts to basically balance it out.
Right, right.
Based on a check sum.
So it's not like just saying, okay, I did this one last time.
Let me do the one next time.
But the result is similar to that.
Yep.
And the other thing I wanted to call out, because it's not obvious to anybody that doesn't live in this world.
When they said that if you go to change the number of partitions on something, they mentioned that you have to rekey that. Well, rekeying that the important part here
is you're actually moving the data on disks, right? So if you add a number, if you add more
partitions, or if you reduce the number of partitions, you're actually having to rewrite
that data to new locations. And that's
why it gets tricky because to balance it out, it's rekeying and putting that stuff into new
buckets or whatever. So it's all, it all sounds complicated. They do a pretty good job of allowing
you to do this stuff behind the scenes, but it is, it is a little bit frustrating when you're like,
oh, well, I'm just going to plan for this this and then you realize that your plan wasn't good and then you have to go back and fix things up yeah you know
i just uh and looking that up i also realized i was wrong about how null keys are handed and they
are handled ram round robin so there's no guarantee if you don't pass a key that it's
going to end up on the same partitions i don't know how that works with like multiple brokers or
um that's interesting i guess there's got gotta be some sort of routing going on.
I'll have to dig into that a little bit offline,
but just wanted to get in before the comments.
All right.
So here's Kafka guarantees reads of events within a partition are always read
in the order that they are written.
So that's really important to note.
That's one of
the beauties and the simplicities of Kafka is it's just a queue. So when the rights come in,
they're going to go into that topic in the order they were received. Now, I think there are some
newer ish features of Kafka that allow you to, to time align things. And i think it will try and place it in order for you but if in just
a regular setup the order it comes in is the order that you're going to get it out on a read
um fault tolerance and high availability topics can be replicated even across regions and data
centers now i have a note on this next line, and this is actually going to go
to one of the tips of the week that I have in a little bit. But if you are doing your own Kafka
cluster, right, whether it's on bare metal or on VMs or whatever, you probably don't need to worry
about this next thing I'm going to say. However, if you are running Kafka clusters in the cloud and you're going
across regions or different availability zones or something like that, there's usually a cost
to move data between regions, even within the same cloud provider, right? So if you're an AWS,
GCP, Azure, whatever, if you're moving from US East to US West or something like that, and data is being
replicated across brokers, you're paying a data transfer fee for the data that gets written there.
So be aware if you do go to do this, that you need to look into your cloud costs for every bit of things, right? It's not just data storage. It's network transfer,
CPU, all that comes into play. It can get really expensive and you don't even realize it, right?
You get the bill and you're like, whoa, why is this double what I thought? And then you realize
that you're getting hit with all these data transfer fees. Actually, they call it ingress and egress, right?
That's what you're getting charged for.
All right.
So next up here, the typical, typical Kafka setup, a simple one is you're going to have
at least a replication factor of three, you know, for high availability and high throughput.
And that's for production type workloads.
You're probably not doing it in development,
even though it probably would be a bad idea to have it.
And again,
replication is performed at the partition level of a topic.
So I think we should describe what that means then,
right?
Because I don't know,
we've gotten into like replication as it relates to Kafka.
So when you write a message to Kafka,
to a Kafka topic, you can
specify different strategies for like, do you just write it and forget about it? Or do you need to
wait for at least one broker to respond that it has successfully acknowledged the right and
committed it? Or do you need to write to wait for all brokers that are part of the replication strategy to respond that they have successfully committed that that that message to disk?
And if you're going to indicate like you mentioned, replication factor three.
So what that means is if you have a single topic and that topic is going to be split into three partitions,
I'm sorry, let me choose a different number. If you have a single topic and that topic is
going to be split into 10 partitions, then each single partition is going to be written across
the number of brokers, which in the case of a replication factor three, that's going to be at
least three brokers that that's going to be written across. So really, from a sizing perspective, you are three times sizing that topic from a storage
perspective, right? But what that also means to consider is depending on the size of your cluster,
if you only have three brokers, then that means that for every right to Kafka,
all of your cluster has to be involved, which you might not want, depending on your needs.
You probably don't want that. You probably want other brokers available to handle rights to other
partitions and whatnot. So something to consider in your Kafka sizing and configuration strategies.
Yeah. So if you didn't totally follow what he said, you might have five brokers,
but a replication factor of three, so that you have more brokers than what you have replication.
So that each one's handling, you know,
different things.
And so,
so you have more available than,
than what's just what you're doing your replication on.
I will say that from my own thinking,
I don't know.
I'm curious if you guys have,
you know,
can convince me otherwise,
but I,
I can't think of a reason why you would want the replication factor to equal the size of your cluster no it seems like a bad idea oh yeah yeah you wouldn't
be able to take down one of the nodes or yeah one of the brokers right yeah agreed all right i'm
gonna let somebody else pick up this last one i I just did this full section here. All right, we're talking about APIs.
I was reading them up on backup strategies.
How do you backup your Kafka cluster?
I've been looking at the wrong show notes this whole time.
No wonder.
Well, the reason I mentioned it, this is something we've talked about before,
is when you've got a database,
this is not a database, but basically a data store,
and it's got replication built in and all sorts of stuff built in
around the kind of redundancy and availability,
what do you do for backups?
I know it's really common in our DBMS world to have a backup strategy
where you do a full, an incremental, whatever.
But when it comes to other data stores that are kind of similar bms worlds to have like a backup strategy where you do a full and incremental whatever but when
it comes to like other data stores that are kind of similar and have this stuff built in
then you have to decide whether you want to do that and what the what mechanisms you want to
use because there's not like a simple backup command there's no uh you know incremental that
you can do so you can do interesting file levels type stuff like rsync i was just looking at
different strategies that people do in order to get
around that.
Um,
don't people use mirror maker for Kafka?
That is definitely a strategy.
Yeah.
That's probably the most common one.
But when you look at it,
it's,
it's almost like just acting like another broker kind of forwards the stuff
off.
And it's very like natural to the way Kafka operates in general,
but it's just kind of interesting. the problem with mirror maker though is that would require you to have another
cluster so yes if you if that works for your cost needs and your configuration needs and whatnot
then that's great but if all you want is just a backup of the data and not necessarily have to have a running second cluster then you
know mirror maker isn't going to be for you yeah yeah i was thinking it was particularly interesting
because like rdbms you know like back in the day like a sql server or postgres something like that
i think they had the notion that was okay for databases to go down for a while and if you had
to back up to disk and then restore from disk like oh no that stinks
we'll have to come in late tonight to do it and it is i think that kafka in particular is uh such
a different world like when you're dealing with like near real time tons of events like you
probably don't want to go down you don't ever want it to go down uh and so you know it's much
more common to have like a separate cluster running in another cloudy environment that you
just kind of back up to and so i was kind of wondering if uh if relational databases like if a new relational database came up today
came out today would it still deal with backups in the same way and i think you're solving different
problems though right like i don't know that you're going to use that relational database
for the same kind of real-time solutions that you're going to use something like a kafka for
right oh yeah for like i mean we talked about in the last episode, um, Uber as an example of a Kafka cluster. And I think we made
the point where we were saying that like the users of Uber are the consumers when you're like,
we were just theorizing on like how Uber might use it. So, well, you know, when you're, when you
view like where the other drivers are, you're consuming like where that information is, you're
not producing it versus the drivers might actually be producing data as like, Hey, here I am. So you could see
where like, if suddenly you took that, that data system down, you know, now the users don't know,
well, I don't know which driver is closest to me. I don't even know that it doesn't even look like
there's a driver anywhere near me. I guess I'll go to lift. So it would have a huge financial impact to,
uh,
in the use case of like an Uber where if that system was no longer available
because you're doing like routine maintenance or,
you know,
something worse.
Um,
and,
and,
and there might be like other,
you know,
more life critical type situations too,
that I'm not thinking about
and yeah and also on the sql server thing right like so you could take it down but they do have
online backup modes right so differential backups and that kind of stuff so databases have come a
long ways but i do think you know that outlaw had had said the trillion dollar uh i meant to say
trillion message yeah yeah but like in that case are is it even feasible to try and quote unquote
back that up as opposed to having another cluster running that's just a failover type thing in a
database type world right yeah that's where like failover type thing in a database type world, right?
Yeah, that's where your mileage is going to vary.
You need to be hyper aware of what your use cases are and what your requirements are.
So if you can afford to take that hit of downtime of, oh, hey, my Kafka cluster went down, no worries.
Maybe you're running in the cloud and that region went down, no worries. You're like, maybe that region, maybe you're running in like the
cloud and that region went down. You're like, okay, well, we're just going to, you know,
spin up our, we're going to helm install all of our stuff over into this other region
and spin that up. And I'll just, uh, backup, I'll just restore the data into those topics
based off the last, you know, however I snapshotted it. If that, if that works for you, then that might be a more cost effective,
uh, you know, disaster recovery type strategy for your needs. But like in the case that Alan
said, you know, if you're getting a trillion messages and maybe like you're an Uber,
right. Or whatever. And you can't afford that downtime downtime then you want a situation where you can just say you
know i'm flipping over to this cluster that's now the live cluster on the west coast and we'll
repair the east coast in the meantime and you know or whatever you know man i can't wait to get to
one of my tips of the week because it would actually answer one of these things i need to do
it early you know i i can't do it early we need people to stick around to the end of the show. I'm going to tease it right now.
Yeah, you got to stay to the end because there is actually a pretty good solution to this that I think would sort of blow all three of our minds.
It did mine when I first read it.
So since we're like off topic just for a minute then, I was sitting here looking at the previous show notes.
I'm like, man, nothing is lining up.
I don't know where we are because everything's already been crossed out. But then I was sitting here looking at the previous show notes. I'm like, man, nothing is lining up.
And I don't know where we are.
Everything's already been crossed out.
Like we've already talked about that. And I'm like, I'm just going to wing it.
You know, I don't.
This is weird.
I went back to it while we were talking because I was looking for something from the last episode, and that's what made me open it.
And then I didn't realize that I was still on the wrong one.
That's so funny, man.
We've only been doing this for, what, 10-plus years?
Yeah, but it is a Saturday morning, so all of us are—
Yeah, I'm missing cartoon time.
Yeah, right? We're done.
One more thing on the backup thing
sorry i keep bringing this up but i was thinking about um relational databases and other kind of
more traditional or just period databases like i keep implying the kafka is a database it's not a
database it doesn't function well as one for a variety of reasons we'll probably get into at
some point um but oh sorry go ahead um but uh another reason is just different use cases so like if i've got a
relational database there are a lot of times when i want to take a backup of that database and give
it to somebody so here's a database of all our tax information all our products whatever take it
and run it in a different environment uh kafka is yeah basically a big queue right so it's meant to
be like this more kind of organic living ecosystem.
And the data that's in there isn't necessarily important, like historically, like you wouldn't
give someone like, here are my queues for, you know, this last five minutes, you know, here you
go. It just doesn't really make sense in the same way. You know, there are different use cases and
there's compaction, you keep data around around but just not really designed for that well that's
what i was going to say like what you said is mostly true except for the case of compaction
where it is sort of always giving you the latest state of of something right which yeah might be
useful and that depends on your compaction strategy that too right yeah i mean it's all
which is another thing that you have to take into account in regards to sizing how much disk is going to be required for a given topic partition.
So it's very simple and it's not.
Yeah.
And what's kind of funny is like you keep wanting to compare it.
I keep wanting to compare it to other things.
Like I keep wanting to say, well, it's kind of like a queue.
And in some ways it's like a database.
And what ends up happening is like it's it's kind
of become its own beast at this point so like at this at this point if you go looking for like
cloud native kafka solutions you'll find aws kinesis and you'll find gcp pubso and whatever
you know azure calls it and so these things that like mid panda is another that meets the protocol
kafka and in a lot of ways has kind of become its its own kind of definition of a
particular data store because it is so configurable it can be used in so many different ways and so
i think that's part of why we say it's a little bit like this and a little bit like that and a
little bit like this but when you boil it all down it's it's it's its own beast well i think like to
your your comment a minute ago about like it being so confusing, but it's not like, I think it,
once you get the handle of like,
here are the levers I need to,
to concern myself with in sizing this,
then it's not so bad in the beginning when you're coming out at fresh,
then it,
it can be.
But I think for me,
the biggest,
um,
difference was there might be database administrators out there for like an Oracle or
SQL or Postgres or whatever that thought about this. But imagine if like on every table of your
database, you concerned yourself with, okay, how much space is each column going to need? And now
how many columns is that going to have? How many rows do I expect in that table? Okay. That's the size of that table. Now let me move on to the next table.
And you're going to rinse and repeat for every table in your database so that you can get an
estimate of what size your, your database is going to be. But I haven't, I don't know of anyone that's
ever done that on, on a database. Typically it's just like kind of estimated, you know, based on some, you know,
trial and error type of usage of it. And you move on about your day. And the difference here,
though, with Kafka is that in a SQL world, you can't really like say like, oh, well, that entire row, I need to be concerned with that entire row and how much size it is on average, because, you know, that entire row might not be getting
returned back in a query that often. And it might not even be written as a whole unit
in any one time either. And so you're only concerned about like the query performance,
because that's the thing that's going to return back the amount of data and like how much data
does it have to like search over and scan over versus in a Kafka world.
When you are reading and writing from those topics, you're saying, here's the full thing.
This, you know, we call it a message in a Kafka world, but I'm trying to like relate this to the to the row example in a table.
So like here's the entire row.
Boom.
And now when I do the read, I'm getting the entire thing back. And so that's why from a Kafka perspective, your concern is based on, well, how much IO can I get
out of that system, both in terms of like disk IO and network IO, like how fast can I read and
write from it, period. That's the concern. And these levers about like the partition strategy
and the replication factor and the, you know, what kind of
compaction am I going to do? What kind of retention am I going to do? That's why these, these types of
things can matter. Cause that's ultimately what's going to decide how much is on disc. And, you
know, obviously you need some overhead on that disc to be able to write more records, et cetera.
So, but I was thinking too, like you made a comment a moment ago and I don't
know if you guys had this experience, you know, going back years, you could never pull this off
today. This would not fly today. Today, I think that we as a society have like matured to a level
of, you know, when it comes to security or personally identifiable information, you know, there's now
like GDPR rules and various GDPR like rules in various countries. But I remember, you know,
first starting out as a developer, you know, I worked in a services organization and it was
quite common for us that like any customer that we had,
they would be like, Hey, we want you to build us a new website or whatever, you know, a new,
a new widget and whatever we were building. It was quite common that they would give us
a dump of their data. Like, and sometimes I'd be like, here's, here's a copy of our database.
Yeah. And, and then that's what we would use as the basis for future work.
Even if that work meant,
um,
Hey,
you're on Oracle and we're going to move you to SQL server or something
like that,
or vice versa.
Right?
Like we need to know,
we need to have,
we need to know like how to convert from one to the other or just like
where all the data is.
Like that was pretty common.
That would never fly today.
No,
you couldn't do that.
No. Yep. I remember, remember uh one place we used to everyone would just restore a production copy of the database to their local to
on the database and back then we had credit card numbers too like unencrypted this is how long it
was how long ago it was which is just crazy to me now we all just had this information it was like
yeah this is like three years ago you know yeah i worked i worked in another shop where we had this this thing called the small backup
is what we called it and the dba had this process that would run daily that would uh take a backup
of the production database but it would limit like how many rows were in certain tables and
some tables it wouldn't bother at all,
and that's why it was called the small backup.
And that backup was put on a file share
that every developer, when they got in in the morning,
the first thing they'd do is grab the latest small backup,
and that's what we would dev against.
Man. Yeah, times have changed.
Yeah.
A little.
A little.
So pessimistic.
That's so uncharacteristic of you. The boomer hour. Right? Yeah. A little. A little. So pessimistic. That's so uncharacteristic of you.
The boomer hour.
Right.
All right.
So I guess we can talk about Kafka APIs.
Now we can resume the show notes back on line three where we left off.
Wait, on last show notes?
Okay.
Right.
Which show notes are we talking about?
This particular episode show notes.
Okay.
Hold on.
Give me a sec.
That makes more sense.
All right.
It looks like next up is Kafka APIs.
And there are a couple of different sets of APIs here that Kafka provides.
We've got an admin API, which is used for managing things and expecting topics, seeing how much data is in them, how many partitions you have, brokers have uh information on which partitions things
like that other kafka objects and you know admin type stuff uh you got the producer api which is
meant for uh applications to basically write events to kafka topics and there are producer apis
uh they're probably apis for all these in just about every language you can imagine now
which wasn't always the case.
But like JavaScript, C Sharp, all those, you're going to find producer APIs.
And also same with the next one, which is consumer APIs, which is how you read.
And both of these have a lot of configurations.
You can do that kind of to pick things up.
So it's like you would think, you know, naively, if you've not been familiar with this, you might think like producer API, we got to produce and that's it.
Like what else is there?
But there's a lot of configurations, a lot of interesting things you can do with those.
And there's nothing to stop an application from being both a consumer and a producer or even being multiple producers and multiple consumers and mixing things up in interesting ways.
I mean, example of that.
I'm sorry.
Yeah.
Well, I was going to say also, go ahead.
Well, I was going to add that like an example of the producer API,
like you mentioned, like you think they would just be writing to it.
But like that decision of do I need to wait for every broker to commit to writing
whatever that replication factor is that I need to, you know,
once something is written,
do I need to wait for each of those brokers to respond that they committed it before I move something is written do i need to wait for each
of those brokers to respond that they committed it before i move on or do i need to retry that's
an example of what that producer api is providing for you oh yeah and schema and yeah all sorts of
stuff i had i had an additional one there as well do you write out every time you get an event do
you automatically write it to the as the producer do you write it to the broker or do you write out every time you get an event, do you automatically write it to the,
as the producer,
do you write it to the broker or do you wait some period of time to send it in
a batch?
Or do you wait for some amount of bites to be met?
So there's,
there's all kinds of little tweaks that can have massive,
massive impacts on your performance that are left up to the producer to decide
how they want to do it.
Yeah. It's really tricky tricky and a lot of the uh this is one of the things really be like 80 configs and like if you
change this one this other one doesn't apply and so you just got to do some reading there and uh
have some fun with it do some testing totally do some testing yep and then um kafka stream streams
uh i did some of this i think i think you did a lot more of it than me, Alan.
I don't know if you want to talk about it.
Yeah, so the Kafka Streams API, this one's pretty interesting.
Again, we're talking about Kafka as a platform, which is why they bring this in.
But this is what allows you to take in these data events and turn them into streaming applications, which are basically microservices. And some of the key functionality that they have
there are data transformation, stateful operations like aggregations, joins, windowing, like those
all get fairly complex. But the important part to note here is Kafka streams is built into the Kafka
ecosystem. So if you have Kafka, you also have the ability to write these streaming applications
without actually bringing in other frameworks, other platforms, that kind of stuff. I will say
we did spend quite a bit of time doing this. And I think what they did here is they borrowed from
some of the other big
streaming platforms to sort of figure out how they wanted to do things.
And some of it turns out that it just doesn't work well with,
with some approaches that you need to do in streaming applications.
So if you have like,
if you have really stateful streaming applications,
meaning that like you're trying to, let's say every event message that comes in for a customer, you're counting it or something.
You're keeping like a running tally of like how much money they've spent.
Exactly.
Yeah, that's a great example.
When you're doing things like that, that's a stateful operation that you need to keep in those streaming applications.
And the way that Kafka streams likes to do that
is it writes that data back to another Kafka topic. So Kafka streams API, the way that it's
implemented, it's very much as you do things, write it to a new topic. So it's very chatty
back and forth to the brokers and it keeps very little state in the applications themselves. So
that's a little bit deeper dive into Kafka streams, but it does allow you to do things, you know, basically reactively as events come in,
you can, you could do data manipulation with them. And, and when I say joining,
it's pretty crazy, the kind of stuff that you could do joining data that's coming in, you know,
thousands and thousands of messages a second and join in doing, you know, data transformation. So pretty cool way
to do real time processing. You know, I think a good example there is fraud detection. Like
you imagine that you've got a topic that takes in data about credit card transactions. So it gets
that in, maybe it has a credit card number, you know, some sort of a token or something,
and it looks up the credit card for it. It looks, maybe it runs some sort of like check
to see if that card has been used in fraud
or if that is a high quality card
or if the card is valid.
You know, it runs these different checks
and each one of those can end up being like a node
in a distributed acyclic graph.
And it's writing back to Kafka topics
that it creates in the backgrounds
and kind of manages that sort of thing.
And so, you know, at the far end, that pipeline, you can have things to say,
okay, well now I know the person that did it. I know the card, I know the dollar amount.
Let me make a decision on whether or not, uh, you know, we're going to allow this or not
allow this transaction to go through. And, um, it does a lot of work for you,
which is really nice, but you can imagine how how if you get bigger and bigger and bigger, it's making these decisions for you.
And it's kind of it's it's tuned for kind of smaller use cases.
And so if you get to be like Amazon scale or something, you're probably going to need more control over how this caching works and how these lookups happen in order to be efficient, not crush other data streams and not run out of this space or whatever and that's where something like a flink or like an airflow apache airflow would come in
and be able to give you a little bit more control but it's not going to be as native to kafka and
so i think um kafka streams uh gets overlooked a lot of times because it is kind of like
um in this small to medium space and then there's these really big players that are really well known that
kind of get more attention.
Yeah.
And for what it's worth,
the primary reason we looked into Kafka streams was because it was part of
the platform,
right?
Like there's,
there's a lot of people that they look at something and they're like,
wait a second.
So you want to bring in 12 new technologies.
We're going to have to maintain all these.
Yeah.
And if you get it,
you know know quote unquote
for free with kafka as a platform then why not use it and that's kind of the route we took is hey
all right so you're just using the same things but it turned out that it was not a great fit for some
of the type stuff that we were trying to do there's a great use case though it's pretty cool to be able
to say like uh take in this credit card join join it to this person, and then filter out these people and then pump it to this output.
And that's kind of how the code reads, you know, which is really nice.
But yeah, it's just those details will get you.
Yeah.
See, Kafka Connect API, one that I have mixed feelings about, but mostly love for the most part.
It's really cool.
It allows for the reusable, important export.
They call them connectors that you can use to either sync or I forget what the other word is, source data or sync data to external systems.
That's what it's built for.
But it's also just kind of a distributed task runner.
So you can build your own or people have done interesting things that can operate on it that kind of do other things but for the most part it's it's primarily been designed and tuned
for getting data out of one spot and putting it into another and it's got built-in stuff probably
doing message transform so if you want to um maybe like elevate or change the format of like a json
message or something and like extract this little bit out and make this top level uh lowercase these items uh things like that
it's also an appropriate space for like configurations for that import and export
so if you want to say like uh you know use this replica here's the database credentials
it's got a nice mechanism for hiding that stuff from the config so you can kind of say like here
is the file where the password is or whatever so it's not exposed uh somewhere it's got a whole api around it so you can have rest calls for
example to like pause it or start it or restart it or re-snapshot for example if you're doing like
a database backup or something with it all right so real quick some of the one of the use cases
for this and the reason why it's so important is like he said, it allows you to read from sources and write the sinks.
So he mentioned the use case earlier of like fraud detection with credit cards,
right? If you have, excuse me,
if you have a streaming application and you want a credit card to be joined to a
user, well,
in order to do that in a Kafka streams world,
or even a Flink or something like that, you're going to need that data readily available.
And typically the way you're going to do that is you're going to sync those users to Kafka
so that you have them right there because it's super fast. Well, to do that, you might have
one of these connectors set up to read from your Oracle database or your Postgres or your SQL server or whatever.
And anytime something happens in that database, like on that user's table, it'll automatically
be synced over to Kafka, right? So that's using the source type thing and writing to it so that
then you could use that information in your Kafka streams application or your streaming app, right? So it's sort of a way, probably more or less for people to be able to share data
from various data sources within their organization
without giving access for everybody
to their database server, right?
So they can move it into Kafka and hey,
the data in Kafka you're allowed to use.
Right.
But you're not going to connect to my live transactional order database.
Right.
Like you're not allowed to do that.
So that's why these connectors are sort of important.
And the mixed feelings, I definitely want you to dive in on that a little bit, because I have some opinions on that that I'm curious if they line up with yours.
Go ahead, Allah.
You were about to say something. Well, I was going to say, like, it kind of relates to some of the engineering blog stuff that we've talked about from Uber, where, like, they have one massive data lake.
And then, you know, depending on the needs of a given team or application, they'll have, like, subsets of that data that gets, you know, copied over to their specific use case.
And they can put it in whatever format they want.
So you can kind of envision where a Kafka Connect,
a Kafka plus Kafka Connect type of scenario can help you achieve that type of goal.
Yes.
Sharing data in a well-thought-out way.
Yeah, totally.
So what don't you like there, Jay-z yeah this really isn't a fault
of connect you know i think it's uh has to do more with where it sits in the mix so uh this is a
complicated piece of technology that's hooked up to kafka which is out of the gate which is
complicated and then if you're hooking up say two different databases one to pull data from one to
put it in that's complicated you got a bunch of configs there and things go wrong sometimes so like one example that i liked is that um like if you ran into like
a message like a row in a database that was too big and exceeded like the the payload maximum
whatever like somebody has some giant row in a database and the connector crashes it doesn't
restart by default there's no way to tell it to auto restart like it doesn't know what to do so
it just shuts down so then like say you wake up in the morning and realize, Oh no,
my data is not flowing. Something happened. Um, you have to figure out first of all,
what does that mean for your application and your use cases? If you know, you didn't get data in for
the last couple of hours or whatever. So, you know, there might be some cleanup or some work
you have to do there or just other things you have to fix. Then you have to decide, you know,
well, am I able to just resume this? Do I need to increase this payload? Do I need to reissue the cursor? So one problem
we had with this, we've seen before is like some databases will have like a cursor to kind of tail
the log like a replica, which is super efficient. But if that cursor is stale or inactive for too
long, then it can lose its place because the replication log, the right-of-head log or whatever, depending on your database that you're using for, can get passed and can lose that spot out of buffer.
And so then when you restart the connector, it can't pick up from where it was because that information is lost.
So you have to re-snapshot, which is changing these configurations, which are declarativeative which i love declarative declares great we've talked about lots of times but sometimes you want to do procedural type things
like okay well let me run this configuration now in snapshot mode and when it's done we'll
put it back over to tail mode and then we'll restart it and stop it and these are procedural
things that don't that aren't captured really well in that declarative configuration kind of based approach so that's the pain and
then we've had some issues just with like um probably misusing it and um having clusters
that get too big we've had weird problems with like uh sometimes uh we call it zombies uh the
connector will die or get started up it's part of like an upgrade or something and the old one's
still going and so now maybe data is getting kind of confused or you know things bad things are happening and
then you have to really think about what that means for your application and that's it's really
tough to figure out all the consequences of data not syncing which is really more of a cdc problem
than kafka connect which is why i say it's not really kafka connect stuff that frustrates me
although it gets the blame.
It's just that it sits in this really critical and finicky part of your architecture.
I think so.
That's beautiful.
And the reason why I wanted him to chime in is because he's probably dealt with it more than any of us and probably most people on our teams that we work with.
Debezium.
So it's funny like and that one specifically came to mind because
debesium is one of the open source connectors that allows you to basically get things from
like mongo right and and using change data capture get everything into kafka and the reason why i
wanted to bring it up is he didn't even hit on this. Like he hit on like real issues just then with Kafka Connect, like the things getting out of sync or the right ahead log getting past the cursor or whatever.
But there were things that for sure we just misconfigured initially, right?
And that misconfiguration will absolutely destroy you.
And you have no idea until you realize there's a problem, right?
And then you go in and you're like, oh, I didn't set this one bit or I didn't change this one flag or whatever.
And again, the way that it's sold is it's the simple solution for moving data.
But the reality is, and everybody should know this from dealing with different
database systems or whatever, these things are complicated, right? I mean, super complicated.
There's no such thing as simple when it comes to state. I don't care what level of the application
you're talking about. For sure. And I think that's one of the more important things that
Joe just pointed out there is this, when you use it, is a critical
piece of what you're doing, right? Moving data from one system to another. If you're going to
do it, you have a good reason for it. So you need it to work. And when it doesn't, it causes so much
pain that you just hate dealing with it. But the flip side is if you were to write your own data
polar and synchronizer yourself,
you're going to run into more problems, right?
So it's impossible.
I mean, I remember we ran into something at one point
where we were syncing data from Mongo
and we were pulling from a replica,
not realizing that it wasn't going to get all the data right
like we were told hey if you use the replica that's it but it was only getting the data that
that particular replica cared about so there's there's so many pitfalls and so many things that
you can do wrong that aren't necessarily the fault of connect but it's going to look like it when it
doesn't work so yeah fixing it is really tough like one example too that i like to i think we've had some sort of problem like i mentioned so we restarted the
snapshot which was even janky i think it was postgres so the way you do in postgres you
designate the snapshot table and you insert a record into it and then it manages the state and
it was a whole big debacle and then once we got it figured out it did exactly what it said it was
going to do when it re-snapshotted the data and put it into the same topic.
And now that topic had the data from the old replica and the new.
And the producers got kind of confused.
And there's a compaction process that will kind of clean things up depending on how you have it configured.
If you have it configured that way, we didn't.
And so there was like down these duplicate messages being replayed.
And like some applications are affected by it because they use the database that we're syncing to.
And some applications weren't affected because they were using this original source and things got out of sync.
And it just it was, you know, a fun day to figure that out.
And it's just it's just tough.
And these kind of things are it's like a distributed level so it's hard to have like good abstractions like built around programmer programming kind of interface type things because it's your logic is really distributed
amongst like eight different config files there and you're not even talking about you know code
here you're just talking about these different systems yeah totally hey one thing that i want
to bring up here and this is part of why kafka be painful, like connect all of it.
There's very little UI. There's
very little way to see the state of your system, right?
So at the top of this, he mentioned the admin API.
So Jay-Z is very familiar with this because he actually wrote an application
to leverage that admin API so that he could check on the status of brokers, check on the status of topics, check on the status of connectors, all that kind of stuff.
Because by default, they kind of give you some crappy command line shell scripts to read from a topic or to check the status of something but for the most part there's not a
lot of great visualization or tools from the kafka project itself to be able to sort of see the state
of your system and that's one of the painful things with kafka connect as well yeah i was
definitely going to add the like from you know the kafka project there aren't but because there
are some out there but um yeah they're're, they're third party. Yeah.
And they usually want to charge it. And have you guys found this?
This actually drives me a little bit crazy about the Kafka ecosystem is the
third party ones, the nice ones, it's contact for a quote. Like there's no,
there's no, Hey, here's the pricing, right?
It's not a hundred dollars a month or or
what or a hundred dollars a broker or something that's contact us for a quote it's like yeah no
i don't that's like the whole if you gotta ask you you don't have enough like i don't like that
yeah if you gotta ask the price you can't afford it yeah exactly i don't like that
and a lot of the solutions and a lot of the reasoning that they give for not having more of interfaces that they're saying,
well, you shouldn't be logging in and looking at thousands of topics manually.
You should have metrics and alerts and all this stuff, which is true.
Fine.
But if I'm working on my laptop, I don't want to be running Prometheus and Grafana just to go see if I have data in my topic.
I don't want to go shell in, get the list of topics, grab for the one I think it's named.
There it is.
Okay, pull info.
Okay, now I see it.
Let me check this partition.
You know, it's like all this, you know,
shelling around and running commands and stuff.
It's just not really tuned for that kind of developer experience.
So that was frustrating.
And that's why we keep bringing up the UI.
But once it's out in production, that's where it kind of shines.
Yeah, for sure.
I mean, we've mentioned it.
Even with the negatives we've said here, we've had this stuff running for several years and mostly pretty seamless, right?
There have been issues like what you talked about with the Postgres, ReqSeq, and all that.
But for the most part, it just sort of works it's one of those things that as
long as you configured it to have the number of partitions and all that kind of stuff that you
care about you kind of just forget about it which is yeah awesome i can't say that about many systems
no and uh we got a link here for uh various connectors that are available and what i really
like about this is that a lot of times the connectors are maintained by the product itself so like for example uh mongo uh
makes their own connectors and they provide connectors for kafka same with a bunch of
different database vendors so you're getting the people that really know the most about
how to tail the replication or you know how to how to interact with their systems in the most
efficient way uh which is something that you probably wouldn't have if you were just trying to roll something
by hand and you're trying to like use a watermark to query top 100 records greater than whatever
and i'll save my watermark you know like all that stuff just goes away and uh i'm always really
impressed when you can just kind of act as replica and just get data streamed in and it just works
and when it does work it's beautiful and when it didn't work, you probably messed something up,
but you're going to have some fun fixing it.
Yeah, it's not beautiful when it doesn't work.
Yeah, and it's important to point out that link is an affiliate link,
so just hit us up for pricing.
And when you use that link, you'll get $5 off your next Kafka cluster.
That's amazing.
It is pretty cool though they have they're literally dozens and dozens of these connectors source and syncs so so um if you would like to be uh heralded at the
top of the show like uh anjing jellies and nick uh we're at the start of this one you too should leave a review
if you haven't already you can find some helpful links at codingbox.net slash review and there's
some easy buttons there where you can leave your one star and um you know whatever star count you
want you know but you gotta start at one you gotta start someplace, Alan. I see your faces.
Yeah.
So with that, we head into my favorite portion of the show.
It's time for Mental Blogs!
Blogs!
Blogs!
Blogs! Okay.
All right.
So according to TechHouse trademark rules of engagement joe you are up first i think
i'm on a streak too right yeah yeah you are yeah let's see if you can continue it nope all right
i have faith in you it depends on whether you get 18th century literature or not if he if he if he
gets that he wins like that's i already anything i had i learned in eighth grade were covered. I already see the category I would pick.
So, you know, I think you have potential here.
That's what I'm saying.
To keep this streak going.
All right.
The search for foreign lands.
Each response will be a country whose name is actually hidden in a word in the clue.
All right.
For the love of Pete or TV dramas in a nutshell or adventurous women or rain.
And lastly,
cringeworth worthy office lingo.
I mean,
definitely that one.
What was it?
Do we know what love Pete was?
I assume that was just Pete in the name or something.
Uh,
there is no,
uh,
additional description of that category.
It just,
if it's about the adventures of Pete and Pete,
I got that one.
It's,
it's just for the, I'm going to say not, Pete, I got that one. It's just for the...
I'm going to say not, but...
All right.
Let's go with cringy lingo because that's my specialty.
It's one of my superpowers.
Okay.
And a five, of course.
Okay.
It's a three-word phrase meaning at one's own expense and a needlessly wordy way of saying unavailable
one's own expense
or a wordy word word wordy way of saying you're unavailable
um out of pocket that is correct
i was wondering if you were going to pull that one off i i was like well i got scared there for
a minute that's weird um for the love p just to like clue you in on like what that one was for
example since you questioned it give you i'll give you the one uh pointer his first kiss with future
ex kim kardashian was in a 2021 snl sketch and he was dressed as aladdin
i i don't know his name i assume it's uh yeah
the pete who married Kim Kardashian. All right. Well, you're not wrong.
It is Who is Pete Davidson?
And my favorite one, I was surprised you didn't pick this one.
I just have to read this one pointer.
TV dramas in a nutshell.
The survivors of Oceanic Flight 815 band together to battle mysterious forces on a tropical island
lost there you go see you should have picked that one well here's the here's the pete uh trivia that
it should have been uh this is the name of pete's tattoo on the adventures of pete and pete is it
pete are you no the tattoo i don't even know yeah but i'll give you a hint it was it was a woman
that he could make dance by flexing.
It was like a nine-year-old.
But the answer is supposed to be a Pete.
All right.
Well, the answer is Petunia.
But that's not Pete.
It's kind of.
It's kind of.
All right, fine.
All right, Alan, here you go.
You ready?
This first category has your name written all over it.
You ready?
Competitive cheerleading.
You see a good cheerleader.
Yeah, I thought so.
You're always so optimistic.
Yeah.
Science.
And the cartwheeling.
Yeah.
You ever seen him cartwheel?
Professional.
That's right.
Science museums.
Six degrees of actual bacon. Sad songs.
Mother goose police blotter.
And you'll give us the nursery rhyme in question.
And this
one, I'm assuming I'm supposed to read these letters out so this is going to be
n-i-a-l ain't a river in egypt the letters n-i-a-l will appear in each correct response
and let's do the the nursery rhyme blotter thing for five police water okay
officers responded to anonymous reports that a local man
was sequestering his wife inside a large gourd
oh man uh
oh Oh, man. Oh, man.
Oh.
What is the name of that?
Something pumpkin. I cannot think of the name of it i don't know man all right joe for the steel i don't know oh man what a missed opportunity
peter peter pumpkin eater eater i knew it was a pumpkin so I couldn't remember it. So close. I had a wife and couldn't keep her.
Wait, did he put her in a gourd?
He put her in a pumpkin cell.
What?
And then he kept her very well.
Yeah, I remembered it.
I could not get the name of it.
Somehow I missed that part of that.
You did better than me.
I don't know that I could have gotten that one.
I might have gone with sad songs instead.
I thought about that,
but that was sort of depressing and I'm an optimist this morning.
Yeah.
Well,
the one pointer for that,
this weepy Sinead O'Connor ballad might not be about lost love.
Prince is rumored to have written it about his housekeeper.
Nothing compares to you yeah and if you haven't heard the chris cornell
live version of that at uh sirius xm i suggest you go out to the internet and find that actually
i'll put a link to that in the show notes it's so good cool i love that rendition um all right All right, Jay-Z, your categories are U.S. World Capitals.
And world capitals is in quotes.
U.S. World Capitals.
Okay, so like Paris, Texas.
Okay.
There we go, yeah.
Veterinary Medicine.
The Wardrobe Department.
Musical Instrument Makers.
That's half the battle.
In this category, you'll need to name a historic battle that we are going to show.
We can't do this one.
This is a visual.
We're going to show you every other letter of it, if that makes sense.
Yeah, no way.
I guess I would have to show show you the letters I get or say the letters.
And then the last,
the last one,
which is probably the best one.
Um,
Roget's butt.
How do you spell that?
R O G E T.
Why?
You assume I mispronounced it?
No.
Uh,
what's the etymology?
I just watched the thing.
I just watched the spelling bee the other day. And it was just,'s just you can't ask how to spell it then they'd be like how
do you spell snake and you'd be like what's the origin can you use it in a sentence right yeah
snake just spell it i could have told you that i could have given you the half the battle one
it's literally i i misunderstood what they were trying to say before now. I would literally tell you every other letter of the battle.
Yeah, there's no way.
I'm terrible with that stuff.
I'm going to go with music instruments, although I'm afraid it's not going to be guitars the whole time.
But maybe I'll get lucky.
I'll go with five.
Okay.
A musician.
Now, keep in mind, I am well known for my ability to read proper nouns.
So this one's on you.
All right.
A musician himself, Nodu Mulek, also made some of these instruments for Ravi Shankar.
And this one was a visual.
Oh, and I can't see what the visual was.
That might not be the best thing.
I'm going to give you a different one since that one was a visual.
You can't because you can't see it.
And neither can I.
Can I guess it anyway?
Sure.
And I just won't give you credit if it's right.
Well, that's up to you.
Yeah, I won't give credit. Is it's right well that's up to you yeah i won't get
credit uh is it a sitar it is dang it i should have let you have it i'm sorry yeah that's all
right all right um yeah all right do you want to pick a different one uh are they all pictures
for that category i will tell you the three-pointer is the okay the rest aren't uh let's go for uh
ah geez let's go with um i'm gonna hate myself for this but uh
what am i doing musical instruments for four okay we got this
again this is probably not gonna be paris texas which is like that and like there's rome and athens i know about okay uh again on you for the proper noun
this is your fault yep the nippon gaki company was the corporate ancestor of this large instrument maker and motorcycle manufacturer?
Yamaha.
That is correct.
That's so ridiculous, man.
I was sitting here on
Stradivarius waiting to bust it out.
I can't remember the piano when I was trying to remember
Stravinsky something. Anyway, doesn't matter.
Well, I'll go ahead and give you a hint.
Stratocaster was not part of
any of it
should have been yeah and and for the record neither was paris texas oh wow okay yeah but i will give you like the one pointer for the u.s world capital capitals this lone star state capital
moonlights as the live music capital of the world so now you can kind of get a flavor of where they were going with it.
So Texas?
Yeah, but you got to name the city.
Austin.
Yes, that's correct.
That's Austin is a world capital?
It's the world capital of music.
Live music world capital.
Okay, I see what you're saying now.
I see.
It wasn't a transplanted city capital.
What about Nashville, though?
New Orleans?
I mean, come on.
I take some issue with that.
Yeah.
I mean, it's not on here, but I would see how Nashville, Broadway Street would definitely have like a strong contender.
If you've never been there, any bar you go to has like three or four levels of different live music all right so i have a question i get one more question right that that is your question
and uh yes i get five points cool all right well actually actually that that it's the final
jeopardy and you have nothing to bet well see that's what i was going to say because we talked
about this before when one person gets two and the other person gets one there's no chance of ever getting anywhere so we're
always supposed to get two each okay fine yeah so there we go so then do we want to do it where
the person who gets that extra round can choose from any of the previous topics given
yeah we can do that too that's fun yeah well remember
what any of them or i should get to pick the topic yeah you really should that's that's kind
of evil all right you guys decide how you want me to do it and then i'll read off where the topics
are give me give me the mother the the mother goose thing for four i'm gonna get this one. Okay. Please received multiple reports at 10 PM of a man running through town and
tapping on windows in his nightgown.
Really?
Peter Piper.
Nope.
Joe for the steel.
Uh,
I mean, don't get the, um, Jack Jack candlestick, whatever. That's it. I mean, don't get the jack jack candlestick, whatever.
That's it.
No, I don't know.
We Willie Winkle.
I don't even know that one.
All right.
We Willie Winky.
Sorry.
I hadn't heard of that one either.
We Willie Winky.
All right.
So I just got toasted.
Yeah, pretty much.
All right.
Let me go ahead and make note of these zeros.
Well, here you go. How do you spell it all so we can both lose okay so so uh yeah um really this is only matters for jay-z
but the final round is artist and you you already gave me your point value, Joe.
I'm assuming I think I already know what it is.
Yeah.
Exhumed in 2017 to settle a paternity suit, his mustache had preserved its classic 10 past 10 position,
according to the Spanish press.
What was the first word? Paternity something. according to the Spanish press.
What was the first word?
Paternity something.
Exhumed in 2017.
Wow.
To settle a paternity suit.
His mustache had preserved its classic 10 past 10 position according to the Spanish press.
Okay.
This was a message, I think.
But thank you.
It's not Picasso, is it?
He's not Spanish.
I don't know.
No, that's what I'm trying to think.
Yeah, I don't know.
Dun, dun, dun, dun, dun. yeah i don't know oh joe gave me an answer
picasso that is wrong dang? Leading the witness. Yeah.
I'm going to say that Alan sabotaged you on that one on purpose.
The answer was Salvador Dali.
Oh, my gosh.
Wow.
Oh, my gosh.
I'm so terrible.
Okay.
Yep.
Yep.
All righty.
So, on to Kafka.
Let's talk about use cases yeah so what we got number one there hello
message queues usually talking about replacing something like active mq or uh rabbit mq so
this goes back like we keep referring to this thing as a queue, right? So, you know, it's basically like, what are the competitors you're going to think about?
Like if you were even considering a message queue kind of system, what are the competitors
are going to think about?
And RabbitMQ is probably like the bigger of those, like, and not between ActiveMQ and
RabbitMQ.
I guess it depends on whether you're in a cloud world or maybe doing
more on-premise type stuff active mq seems to be used more in in the cloud i think but yeah either
way okay yeah so the the message brokers are often used for responsive types of processing and decoupling systems.
I think we kind of already covered that.
But Kafka is usually a great alternative that scales, generally has faster throughput, and offers more functionality for the reasons we've talked about already.
The available APIs that are there, the ability to write streams, applications, whether you're using the Kafka APIs or you're using
something else like Beam
or Flink, but then
the ability to scale those brokers out
and scale the reads and writes out
across the size of your cluster,
it just scales
well.
Yep. I concur.
I forgot where we were website activity previous episodes show notes
yeah yeah this is a great one so like alan said website activity tracking so you i'm sure everyone's
seen like google analytics by now or um microsoft's got one we've talked about like open telemetry
several times but um it's a this is a really good way of tracking activity, which would be like someone scrolls down.
They click.
They go to this page.
They go to that page.
They check out.
They click this button.
That kind of stuff is really nice for Kafka because it's just going to be the stream of events that flows into Kafka and gets saved.
And then you can look at it later for figuring out what happened or analytics type purposes, planning sales, yada, yada.
Am I remembering it wrong where like LinkedIn,
Kafka was originally created by people at LinkedIn for the purposes of being able to
show you that little bitty link on your LinkedIn page of like who's viewed you,
who has looked at your profile?
Am I remembering it correctly?
No, I want to say that was uh pino i think
what that was maybe i'm wrong no because kafka way predates pino yeah i don't know i don't know
i mean it was created at linkedin yeah it was it was created at linkedin for sure
all right i'm gonna look that up because i thought it was
yeah there's a nice article i just found on it i'm still kind of trying to get to the key
they're trying to get to real-time processing
so while they're looking that up i i don't know on that one
another one is metrics for aggregating statistics from distributed applications,
right? So that's using like the real-time streaming and aggregation windowing type thing.
That's one. Okay. So the one-liner, and this is on a LinkedIn page. Well, I guess, yeah.
Kafka's origin story at LinkedIn, the problem they originally set out to solve was low latency ingestion of large amounts of event data from the LinkedIn website and infrastructure into a Lambda architecture that harnessed Hadoop and real-time event processing systems.
So the key was the real-time processing.
Yep.
Another one of the use cases is log aggregation.
So rather than using logs that were written to HDFS
or a file system or cloud storage or something like that,
writing it to Kafka,
and the primary reason for it,
and this makes a ton of sense,
is you abstract away the file system completely. So if you're using something like cloud storage,
then you're going to have to use an S3 protocol or a GCS protocol or whatever, Azure blob storage,
or if you're writing to a file system, then you're doing network file shares. If you're doing to
Kafka, you just produce and consume using their protocol and that's it it's
it's just gets rid of the files completely so that makes a ton of sense yep and also just like
iot devices if you've got a bunch of like temperature sensors or something that's a
great way to kind of get that together i guess that's more like metrics so
and logs so we already covered that my bad i'd be well you can take the next one all right
stream processing taking events and further enriching those which is like the fraud detection logs so we already covered that my bad i'd be well you can take the next one all right stream
processing taking events and further enriching those which is like the fraud detection example
we gave and we talked about kafka streams and flink and beam and there's other solutions of
spark that are common uh for i mean kind of spark tilde or uh asterisk um i guess even event
sourcing is something we've talked about a few times, which is basically storing the state changes.
So like a ledger, like a bank account or something would be an example here.
It's used commonly where it's like we add $500, now you spend $600, you add $100.
And at some point you can say, okay, what's the balance?
And you take a look at that snapshot in time and say, this is the balance at the time.
But you can imagine how keeping track of those transactions is really important.
But I kind of take issue with that example, though.
Have we used a bank example for Kafka?
Because that seems like not a great use,
especially depending on what your compaction and retention settings are for that,
you could potentially lose history of those transactions.
And for financial purposes, you don't want to lose that.
Yeah, and I think that's a really good point.
You wouldn't want to keep the data there permanently,
but if you just wanted to stand this in front of a more traditional database,
then I think it makes sense to just kind of have every credit card swipe go into Kafka at some point.
And then later you can have something kind of slowly pull that stuff out and arrange it and save it.
I guess for the event sourcing,
this is why I keep going back to that Uber example though,
because you don't care about where the driver was an hour ago.
You care about where the driver is now. So the,
so the event is the driver's current location and periodically he's saying,
I'm here now, I'm here now, I'm here now i'm here now i'm here now what right
and and you're just getting like you know as the as the user of the app you're being able to track
that that driver as he moves around nearby you but you don't care to see the history of where
the driver was well no but for event sourcing so it's a different use case but for event sourcing, so it's a different use case. But for event sourcing, while maybe you wouldn't use Kafka for your event sourcing for transactions from a bank, it's actually a good example, though, right?
Like you started with $100.
You added $20, took out $30, added $50, took out $40, whatever.
Event sourcing is just the whole notion of replaying events from the beginning to get to the state that you want
right now and so it can be used for that uber is probably taking like your coordinates from gps or
something at the time so it's like a snapshot of where you are but an event sourcing example would
be like you drove 500 feet straight took a left and drove 20 feet straight yeah it's replaying
events is all it is and so it's it's perfect for that type of thing
yep and the final one here is a commit log which we talked about uh you know replication cdc change
data capture basically using kafka as like an external commit log or write a head log or
transaction log whatever you want to call it basically keeping track of uh all the events
that let up and you can use that to like sync databases or something. So you can imagine a database
that if you were building a database today,
you could consider kind of outsourcing your log
and bring in Kafka as a dependency
in order to kind of keep Kafka there
as that backbone for replicas.
Probably not a great idea.
You're going to lose some efficiency there
because all the other things it could do
and configurations you have to make,
but it wouldn't be a bad idea to get you going and kind of uh outsource part of it well let's expand on that idea though
because because i had never considered that but think about this type of thing let's just
brainstorm this live and see where this goes if you if you wanted to the advantage of having the relational database is it would allow you to do your reads to write
custom queries to read
just certain parts of the data and everything.
You could do maybe some aggregations
over portions of that data as you're redoing the read.
There are legit use cases that relational databases have, namely that you could relate this data to that data
type of thing, right? So if you wanted to use Kafka as the, in front of that, because the
downside to relational database is you can't parallelize the writes, but you can parallelize the reads,
right? In your traditional databases. But if you were to put Kafka in front of that to where like
anything that wanted to read from your database could read directly from the database, but to
write to it, it would instead have to go through this process. Like it would, it would write to
Kafka and then there'd be some kind of a streams app on the Kafka side. Oh,
I ran out of space on my zoom or on my thing. So we're using the backup recording. We're going to
see how this works. And this is going to be nasty because the recording started midway through.
It sure did. Good thing I did it. This is going to be a cluster. And not the good kind of Kafka cluster.
So talking through this, you write to the Kafka queue.
Some kind of streams app would then be able to write in like batch or whatever to your database,
whichever one is in charge of the rights at that time right i mean that kind of seems like a pretty cool thing if that's you know if you really
needed to harness the power of that relational database right like that's where i think the key
would come in yeah remember we talked a little bit about when we were doing the distributed uh
designing data of intensive uh applications book uh one of the things we talked a little bit about when we were doing the distributed designing data of intensive applications book.
One of the things we talked about was this kind of, I think it was the LSM tree databases.
I forget what it stands for now, logs, something, merge.
But basically, it would take in data and keep it kind of in a buffer.
And then as that buffer filled, it would push it down into more persistent storage.
And so like Elastic kind of works that way where it takes in data and then over time it like routes it down into more persistent uh storage and so like elastic kind of works that
way where it takes in data and then over time it like routes it to the appropriate place but it
what it does is that lets you say okay i got the data faster to the producer and it's like i've
got it from here i'll get it to the right place and when that read comes in it goes and gets
routed to whatever data uh notice should come from hey um so just to catch everybody up, it didn't run out of space yet.
I misread the message out of the corner of my eye.
It is that it has limited space available.
So my recording is still good, fortunately.
But I have like 14 minutes left.
So I'm just saying like maybe we are at a good stopping point.
Yeah. Yeah. And just saying like maybe we were at a good stopping point. Yeah.
Yeah.
And we just like move on.
Yeah.
Let's do the tip of the week.
Yeah.
Let's do the tip of the week.
All right.
So I got 12 minutes for tip of the week.
I know what you're saying.
It's my portion.
It's my favorite portion of the show.
It's the tip of the week.
All right.
Moving on.
All right.
Well.
Speak fast.
I got a good one for you. Remy galego sorry sorry remy about the name pronunciation is a music producer that makes music under a variety of names uh the algorithm is probably
the most uh most well known of their names they also do name music under their own name and music under the name of sorry again boucle infini uh remy's french and that boucle is actually my favorite but
all almost all the music is like instrumental kind of um either idm type music uh or synth wave
if you're familiar with those terms but they also have a hard rack edge oftentimes and the person also makes a lot of video game music and so the the two albums i'm going to
recommend uh are from this person and it's just great coding music uh if you like uh either
dancey type music or kind of hard rocky type music because it kind of straddles the line
and uh the two soundtracks for these games are the last spell
excellent game and also hell is other demons which is also an excellent game so we'll have a link
there to uh youtube that's got just a combination of videos that are like tagged with that person
including those two albums and they're excellent very cool all right And then I got another one here.
We have talked about canines,
Kubernetes-focused 2E terminal user interface that we've talked about several times.
I can be used to look up information
about things other than just Kubernetes resources.
Like we've talked about colon helm before
and colon events or sorry that's
the new one so helm you can look up helm packages even though that's not a native kubernetes
construct there's also one for uh events so colon events and events are pretty useful figure figuring
out why uh something didn't happen like a pod was killed because it ran out of it exceeded the the
memory boundary for too long,
or a scale down event happened on a node.
And so that's something that,
because it's not like a Kubernetes resource,
really, you don't really think too much of it,
but it's something that you can view in canines.
There are other resources.
You can hit the question mark and see all of them.
And it is dynamic.
So if you install like an operator,
for example, like Kafka,
then it'll install a Kafka Connect resource.
And then you can go colon Kafka Connect or whatever that name is and see that in there.
So that's pretty cool.
And there's actually a couple other ones I think we've mentioned before
that are just kind of interesting, like Popeye, X-Ray, and monitoring.
You can do a colon and you can see all those with that question mark there,
which just show interesting information about your cluster.
And it's all just kind of built in for free.
That's great.
You should use it.
Excellent.
The events was a new one for me.
That's awesome.
All right.
So I had I had a few.
I'm going to narrow them down because I don't want a lot of recording.
So the first one.
So the first one is warp stream so i even found out about this because uh micro g
had sent this i think in our episode discussion in slack but all right here's here's the gist of it
and this goes to the backup thing that we were talking about earlier out earlier. WarpStream, the whole idea is you will not run a Kafka cluster. What WarpStream did
was take the Kafka protocol, so producers, consumers, all that kind of stuff, and then
they rebuilt it on top of object storage in the cloud. So the beauty is you don't have to worry
about partition sizes. You don't have to worry about disk drive sizes. You don't have to worry about partition sizes.
You don't have to worry about disk drive sizes.
You don't have to worry about any of that stuff.
Everything gets written.
Let's say that you're an AWS.
Everything gets written to an S3 object.
And then your producer's writing just like it's writing to Kafka, but it's getting written to S3.
Your consumers are consuming.
They don't know it, but they're
consuming from S3. So anytime something new is written, they're getting that. The only real
downside, as far as I can tell with this is the latency is higher, right? So if you're writing
to Kafka, your latency is milliseconds at most, right? From the time that you write to the time
that you get that read notification in your consumer.
With something like S3 or Azure Blob Storage or GCP, GCS, it can be closer to a second.
So as long as you can take a second as opposed to milliseconds for that time in between the
write and being notified of it, that's really the primary downside.
Otherwise, you get all the benefit or as far as I could tell, I haven't, I haven't deep dived this,
but you get most of the benefits of Kafka with almost none of the downsides. So super duper
cool thing. I have an article link here. I actually want to do an episode on this later
because I would love to sort of deep dive, find out more but super interesting i wonder if it supports all the same apis like a flink or beam or
i would imagine it does because if you're doing flink or beam or any of those you're just using
kafka as a source and if they're adhering to the kafka protocol you should be good, right? It's the key. That's it.
So, again, super cool.
And one of the other things this gets rid of,
so first off, it takes care of that backup problem we were talking about earlier because everything's written to, you know, super high available storage.
And the other thing that it takes care of is not dealing with that inner region
or cross-region ingress-egress cost
because you're not doing that.
You're just using the cloud blob.
So when we were talking about backups or if you have multiple Kafka clusters,
you're having to write across regions,
and you're getting charged for that network communication
that's moving across boundaries.
You don't get hit with that.
But you still have to write into storage. You have have to write the storage but you're not going across things
so your storage is highly available and depends on if your bucket's multi-region then if you do
multi-region then it's already built in for it so um they do say i want to say on their site that
doing it this way can also be a like an eighth the cost of running your own Kafka clusters.
Again, if you can deal with the latency.
So I've already eaten up three minutes.
Next one, I actually had put this in my tips before Outlaw had even mentioned it.
There's the blog article also.
I thought it
was beautiful. MicroG had actually shared this also in the episode discussion, the Trillion
Message Kafka cluster for Cloudflare. I'll have a link to that. And then the only other thing I
wanted to share real quick was Jim Hummelsine, also in the episode discussion channel in our Slack
community, had mentioned that Kafka reminded him
of the mediator pattern in programming.
And it's exactly that, right?
The mediator pattern is, hey, talk to me
so that I can abstract you
from all the various implementations.
And that's exactly what Kafka is, right?
Like, hey, you talk to me
and then just everything that you want to put in here,
you can, but then you don't have to know
about the database server. You don't have to know about the database server.
You don't have to know about the web server or everything else, right?
So pretty cool.
That'd be every database thing then, right?
Because the whole point of the SQL language
was to abstract you away from the storage mechanisms.
You don't have to know how to read that actual thing.
SQL for sure, but if you want to talk to SQL Server,
then you're using ADO implementation for SQL Server. If you want to talk to sql server they're using ado
implementation for sql server if you want to talk to postgres and you have your postgres stuff right
so so kafka i guess is sort of the whole idea that you know if you talk to me then you can
interact with all your other systems as long as that data is synced in here so but anyways all
right so your turn outlaw i think you got like five minutes very good at math you, all right, so your turn, Outlaw. I think you've got like five minutes.
Very good at math you are.
All right, so the number one tip that I wanted to give was I didn't know about this.
Someone on the team already knew about this, but we've talked about JQ in the past,
and it was a previous tip of the week, I believe in like episode 205 if I'm right. I think maybe.
And somebody can check my math on that but i'm pretty sure um and
i found this week yq which is the equivalent of jq but for yaml specifically but it does work
with other uh file types too so csvs uh xml for example so So I'll put a link in that.
It's built on top of JQ is my understanding.
So it does require that JQ be available because behind the scenes it's using it.
My understanding is it's converting from like YAML into JSON and thenq do the do the heavy lifting but it's not uh as complete as
um jq in some of those regards but i'll have a link to that and then um i wanted to share some
links that mike rg gave uh i didn't realize that um he'd hit you up too but anyway you're getting
a big shout out so there was a if you've ever
i've used in mon in the past as like the way to um like if i log into a unix environment and i
want to see like what's going on which i believe was a tool originally written by ibm for ai x if
i recall correctly but um it was basically like a tool where you could see like visually see
like what's happening on the system in regards to like CPU utilization,
memory utilization,
a disk IO or network IO.
And he shared a similar one.
It's a little bit cooler in terms of the graphics that it uses written in
rust called Zenith.
So I'll have a link to that.
We've talked about big O cheat sheets and, uh, there was a new version of it.
That's kind of more simplified in a big O cheat sheet that include a link for.
And then, um, on the topic of cheat sheets, there's the another yet another get cheat
sheet that, uh, will include a link for that.
Um, you might find handy. Otherwise, subscribe.
I think it stopped.
Yeah.
I don't know what he's saying.
I just wanted to mess with you.
I just wanted to mess with you.
Oh, man.
That's pretty good, right?
You got me.
Yeah.
Yeah.
All right.
Later.
Bye.