Coding Blocks - Site Reliability Engineering – (Still) Monitoring Distributed Systems
Episode Date: June 6, 2022We finished. A chapter, that is, of the Site Reliability Engineering book as Allen asks to make it weird, Joe has his own pronunciation, and Michael follows through on his promise....
Transcript
Discussion (0)
You're listening to Coding Blocks, episode 186.
Oh, you know what?
I should have come in with the same gusto as last time.
Dang it, I forgot.
I think you scared people.
Did I?
Did I?
I don't know.
Hey, you're listening to Coding Blocks.
Yeah, there you go.
Whoa.
So, subscribe to us on iTunes, Spotify, Stitcher, wherever you like to find your podcasts by
now.
Man, if we're not there by now, I mean, we're like doubly
there on Stitcher, so surely you
can find us, right? We are on Amazon
now, too. I've got to figure out what's going on
there. I don't relish that.
I didn't know that. Yeah, it's frustrating.
Well, some places you can find us twice.
So, I mean, you know, that's how nice we are.
That's right. So,
make sure, while you're looking around
for us, you can check us out at coding
blocks.net you can find all our show notes examples discussions and more you can send
your feedback questions and rants to comment at coding blocks.net and uh at coding blocks on
twitter is how you can find us on twitter and if you go social uh top page net.
I'm Joe Zach.
I feel like some packets got out of order there.
That was you guys, not me.
I don't know.
Weird.
Okay, well, I'm Michael Outlaw.
And I'm Alan Underwood.
This episode is sponsored by Retool.
Stop wrestling with UI libraries,
hacking together data sources,
and figuring out access controls, and instead start shipping apps that move your business forward.
And shortcut, you shouldn't have to project manage your project management.
Okay, so we're going to pick up with the second half of monitoring distributed systems in this particular episode. So we'll be wrapping that up.
But first, as we like to do, we like to get to some quick podcast news.
And first up, we have Outlaw reading the reviews.
Okay.
Why is it?
It's always me.
You're always like picking on me to like read the names, but I'm going to try it.
We don't have many.
We don't have many right now, right?
We don't.
I'm going to try it from From iTunes, thank you very much.
Los Paz.
Right?
Sounds good, yeah.
It's either that or Los Paz.
It could be that.
That one, yeah.
Okay.
I don't know.
Los Paz.
Honestly, I gave it my best already
so anything else
after this
might just be insulting
like that
those were my two best guesses
as to like how it would be pronounced
a lost pass
all right
so
man
I don't know
that's the news
to share with you
so
sorry
morale
we talked about
last episode
shared a great post on SRE
and TOIL, and wrote
another great post. We had a discussion in Codingbox
Slack talking about onboarding, mentoring,
hiring junior programmers, which is
kind of a controversial topic. A lot of companies
don't hire juniors. They don't want to hire new
people. He wrote a
really great post about basically why you
should make the case for hiring junior
developers.
And it was really good.
He came up with a kind of constructive scenario, basically talked about like, I mean, you got
to read the article, but I will have a link in the show notes, but basically kind of comparing
like what it would mean for you to work a lot of extra time and how much productivity
you would get out of that compared to hiring a junior and spending your time kind of raising them
up and how over time, you know, basically the productivity gains you
get from hiring a junior and training them up was going to beat
any sort of, you know, extra hours you're putting in and which one's healthier
and saner, you know, which one's better strategy. So it makes a good
case for it and it has some great tips for like onboarding stuff like that which one gets me on the mountain bike faster
uh definitely well i mean over time uh you'll get there with the juniors
juniors equal mountain biking there you go so he's on board all right and also wanted to mention we
got an email from zach asking about message brokers.
Like we've never done an episode on them.
We've talked about Kafka and the fact that we use it.
We've talked about RabbitMQ and other things.
So I think we're probably going to get one on the schedule here and we'll do a deep dive
into message queues and why you might choose one over the other.
And I mean, there's several out there, so it is a pretty good and deep topic. But Zach, if you have like a specific question that you want
to hit us up with in the interim, go ahead and shoot us an email over and, you know, we'll try
and answer any questions. Maybe I'll just reply, you know, I could do that too. That would make
way too much sense, wouldn't it? So, so yeah, anyways, that'll be upcoming.
And with that, I guess we can go ahead and dive into the nitty gritty of the second half of this
particular topic on monitoring distributed systems. So first up we have instrumentation
and performance. And honestly, before we even jump into this, I kind of like it that we're hitting some of this stuff because I know that in our professional careers, like these are things that we've been dealing with a lot of times are just add as many things you can find, right? Like, oh, there's some latency there.
Well, we need to track latency.
We need to alert on latency and we need to, and it's like, whoa, wait a second.
Has it been a problem?
Have we had a problem?
If we haven't, let's not just make problems, right?
Like we don't, we don't want to create things that we have to go chase for no apparent reason.
So that out of the way. And we also don't want to just have to look at more dashboards and widgets on
the, on dashboards and panels on them just for the sake of it,
which is kind of the whole point of this chapter was to like focus in on,
you know, what you, you're going to monitor. So just as a quick
reprise of the, of the previous chapter though, like where we ended with the four golden signals, that if you were going to monitor nothing else, that the four golden signals, according to Google, that you would look at would be latency, traffic, errors, and saturation.
Yep.
All right. So with that, what they say at the beginning of this is you need to be careful and not just track your times like these latencies and things as just medians or means, because we've talked about it in previous episodes.
If you're just doing the mean, then you could get some highly inaccurate things because your tails could be way off in another direction.
You're not going to know about it,
right?
Yeah.
Your,
your outliers can be,
can be lost and they can,
uh,
throw off,
you know,
what's really happening there in the system.
They might say,
make it look good or bad depending on like what,
what the thing is that you're trying to measure.
Yep.
Totally.
And it can also mess up your median too,
depending on what's happening on those tails.
So those you need to be very careful about.
And this is actually something that I think I mentioned with Prometheus before was a better
way is to bucketize data in histograms.
And if you've never dealt with a histogram, first off, they're mind bending the first
time you look at them because you're like, what are you doing here? But if you think about just making buckets and then
counting how many times things happen in those buckets, it'll make a lot more sense. So the
example, go ahead. Well, I was just going to say, like, I mean, you could easily think about this
in like a classroom setting, right? Like, you know, you take a test, you take, you know, you
have a test in your class.
This is the number of students that made A's.
These are the number of students that made B's.
These are C's, et cetera, et cetera, et cetera.
Each one of those letter grades would represent a bucket.
And now you've, you know, you put a number to that.
And so now you could imagine a chart of those different buckets and what that might look like.
Yep. Now, the thing that's interesting about this is a lot of times histograms have to be predefined.
At least the truest term of histogram.
So for instance, like it's easy with grades, like what you said, right?
Like you have A, B, C, D, E, A, B, C, D, F.
You don't have an E and F.
And so you have those fixed number of buckets and you know those up front, which is good.
In histograms, like if you're doing something with Prometheus, if you're dealing with things like latencies, you kind of have to figure out what you want those buckets to be.
And here, like I said, they gave an example of like 0 to 10 milliseconds would be one bucket.
And then they sort of did factors of three after this. So,
um, from 10 milliseconds to, um, 300 milliseconds, 300 milliseconds to one second, et cetera.
And so, uh, factors of 30, I think is what that is actually. So, so when you set up these buckets,
every time a request comes in, that was five milliseconds milliseconds and you're going to put a tick mark
in zero to 10 milliseconds, you have one, right? So these counters make it to where you don't have
to keep all the low level detail around, right? These give you quick counters to where you can
easily aggregate that stuff over time periods and you can see the trends and you can see how
these things are working. I think I see there was was a mistake here in the notes it was a factor of three but the buckets that they gave in the example were
zero to ten ten to thirty thirty to a hundred a hundred to three hundred so it was roughly like
you know okay so it was okay i jacked that up all right so 30 to 100 milliseconds all right so
that's pretty good um now the thing is
though again when you're defining all these and this is a hint i guess on prometheus as well
you can define all your buckets up to a point to where you're like anything over this i just want
it to sort of go into a catch-all prometheus has that in their histogram so that if like let's say
that you wanted to cut off at 30 seconds, right?
Like anything over 30 seconds, they have a plus infinity that they throw in there.
And anything that went outside the bucket ranges you got would at least hit that.
So, you know, know your monitoring tooling systems and how those work.
But, you know, hopefully that'll give you a little bit of insight.
The next piece up that they have is choosing the appropriate resolution for measurements.
Now, I don't know about you guys.
When I was reading some of this, I don't know.
It kind of jumbled up in my head how they were talking about some of it.
But go ahead.
I was going to say, yeah, that for sure.
Even just looking at Prometheus, like, I think I understand, you know, math with chickens and, you know, numbers and stuff like that. But in Prometheus, I'm like, wait, what's an irate?
The way they kind of like put things together and like, you know, the terms they use and stuff are over my head in a lot of cases.
Yeah. In here, I think what they were trying to get to at the heart of it was if you're looking to measure something, look at your service level objectives and agreements and sort of go from there, right?
So they gave a couple of examples that I think help with this.
They said if you're targeting 99.9% uptime, then there's no reason for you to check your hard drive fullness more than twice a
minute, right? Like there's some monitoring systems that'll do it every second or every 15 seconds or
whatever you want it to be. I mean, you could force them to be more granular, but the reality is
you don't need that much data. You don't need that many data points. So,
you know, look at what your overall objectives are and work back
from that um go ahead i i mean i i understood what they were getting at but it was just also
such a bit of a mind you know melt for me like because i was like okay yeah i get that but then
also i don't want to like monitor too late. But, you know, that's kind of
their point is like, well, if the objective, if it doesn't matter anyways for the objective,
then you could afford to have a little bit of a hit there and it not work against you. So
rather than like alerting too often about something, it would almost be kind of like,
you know, the story of like the
boy that cried wolf kind of thing, right? Like if the alarm is going to, you know,
ping you too often because it's too aggressive and it doesn't really matter to your SLO or SLA
that, you know, that much, then why have that noise?
Right. Yeah. It's, I mean, it's going to be hard to, I guess, as, as probably all three of us are, we like data. And so the more the merrier, but not when you're actually trying to monitor the uptime or availability of a system, because the more data you have, the harder your CPU and everything has to work to analyze and aggregate that data.
So fewer data points can be actually better for your monitoring solution.
Well, also in this particular chapter, too, I mean, this is trying to focus the time of the human, right?
Right.
You know, Alan's favorite term.
So we're trying to make sure that, you know that when the human gets involved in whatever the problem is, that said human is focused in on the one thing. And so if you're getting alerts about the disk drive being full more often than you need to be because it doesn't matter in terms of your SLO or whatever then uh you know you're just wasting that person's time i'm sorry that human's time yes it worked
it's expensive too uh so we got that coming up in the notes i want to jump ahead a little bit here
but uh yeah those measurements are surprisingly expensive and some of the things like it's hard
to really figure out like ahead of time how much stuff is going to cost you because the way they price those things is just not very human friendly.
But when you get your first bill, you realize that it's a very real cost.
Yeah.
Oh, I shouldn't have been monitoring at per second intervals.
Yeah.
I mean, you get hit with both the cost of storing the measurement as well as when you're aggregating that stuff,
instead of if you're storing per second,
instead of per minute,
you're now aggregating,
you know,
60 data points for a particular metric.
And typically on these things,
you'll have more than one metric that's being,
you know,
aggregated.
So yeah,
it's interesting.
And then they also say a really good thing about these
histograms is because you're not keeping the raw measure around and because you're just doing a
counter in each one of these buckets, those are way faster to aggregate, which means it's way
less intensive on your monitoring system, right? So it could keep that thing from going down as well.
Yeah, a lot of time databases are actually designed to scale the data basically to some sort of resolution so they'll actually compress the
data. It's lossy, but much more efficient.
Yeah, and I mean, the reality is typically you don't need that low-level
crazy amount of detail, right? Yep, but the heart wants
it. The doesn't want it
and if if processing were infinitesimally um fast then it wouldn't matter you know i just
caught something though that we've said like i think we've mixed some some things here there
right because you were talking about prometheus earlier and then I rates came up, but now we are mixing Grafana and Prometheus.
So I'm sure somebody is like screaming at their iPod cause you know,
they're playing this on an iPod.
No,
I rate is a Prometheus thing.
Well,
I know in Grafana you can choose to do a Raider.
I rate,
that's the kind of thing that I was thinking about.
But um,
yeah,
I don't,
I mean Prometheus just kind of stores the stuff right yeah i thought
it was just the time series database for it all no it's also in the prom ql stuff so um irate is
one of the prom ql functions that you can use rate and irate oh yeah yeah i mean i guess that's
what drives it right yeah yeah yeah so we didn't see i suck at this stuff. And I only know this cause I was dealing with it recently.
Um,
so yeah,
this is the next piece that they have here.
And I really like this.
Keep it as simple as possible,
but no simpler,
man.
What a hard line to walk.
Like that stuff is so frustrating,
right?
Like how do you know if it's simple enough or how do you know if you didn't
have it simple enough or two,
too simple?
Like you won't know until you do it. i kind of want to start with the the perspective of like follow the four
golden signals and then for like any product or service that you know that might be tempted to
have a dashboard for like just start with those four things yeah totally and that really is the
answer yeah yeah you'll know it's too complex when you like start to
show someone else and you're like okay wait don't freak out this top half here is for whatever and
down here on the sides for sure yeah it's so it's really hard to keep it simple well yeah and
especially like in grafana where you can have like so many panels and it's just super easy to be like
you know what i'm gonna add another one but you know what this one doesn't matter as much so to like uh not inundate the reader i'm going to
like collapse this one into a tab but then like every time i go to it like i'm going to expand
that tab right and and then you know going back to the the data crunching that alan was talking
about earlier and like how the you know compute, compute intensive that can become, you know,
with Grafana,
you get like too many panels and with too large a time range.
And now you can start to crush Prometheus in the background as it's trying
to,
uh,
respond to all the queries for the different panels that are on there.
So,
yeah,
I don't know.
But I mean,
like it also begs the question of like,
you know, cause I said to create a dashboard for each of the products that you might want to do so so immediately you
could interpret that as like oh well um let's see i've got kafka so i'm going to have a dashboard
for the four signals for kafka and i'm going to have a postgres maybe.
So I'm going to have the four golden signals to monitor postgres,
blah,
blah,
blah,
blah.
But you know,
that's probably not the kind of product that they're talking about here.
You know,
they would be talking about like,
well,
what's the overall service doing?
Like what's the,
what's the product that like i'm
delivering that i'm making not the products that i'm using right to deliver it but the product that
i've that i've made and what's the four you know now monitor the four golden signals for the for
my product so that's really interesting i mean if you think about that like what you just said
and it's true right like if you've got kafka and 12 other
technologies you you want to monitor and measure all those things right it's so tempting it is
but the reality is all you have to care about is your slo right like what are you trying to deliver
is it you know i need to make sure that all my requests get back within a second and that's what
you need to be tracking and then based off what we're about to talk about here in a second, then that's what you need to be tracking. And then based off what we're
about to talk about here in a second, I think you work your way backward from that over time,
right? Like if something triggers, Hey, things went up above a second. Why now, now you go find
out what it was that caused it. And maybe without those other dashboards, you won't know, right? Like, and that's where,
that's where it hurts, right? Like if you don't have a dashboard for Kafka over here showing its
latencies and you don't have one for your Ram and your CPU, maybe you wouldn't have seen that. Well,
the CPU spiked to a hundred percent here and it went over five seconds, right? Like I, I don't
know when you start introducing those things, but I think based off a
previous conversation, you're already going to have those system metrics in place, right? But
those aren't going to be what you pay attention to. It's going to be your service metrics that
you're going to pay attention to, and you'll dig into the other ones when you need to.
Yeah, I mean, that's kind of what I'm thinking, what I'm envisioning, because like, you know,
take Datadog, for example, right? Like, you know, their whole pitch is like the single pane of glass type of experience, right? To monitor your thing.
And so what I'm advocating for is like, well, that, that single pane that you go to
should you, you know, based on what we're reading here, I'm thinking that like, okay,
you want that to be like the thing that you are building and that you are providing to the world,
right?
Whatever that is,
you,
that's not to say that you couldn't have other ones for deeper dives like
you were getting at,
you know,
for when you do need those,
but that's not your go-to dashboard.
That's not what you're watching.
That's not,
yeah,
that's not the dashboard that you had the team like focus on when they're the
on-call person, right?
Because they talked about having it where you would rotate.
It was like a quarter of the time or something like that.
Every two weeks, yeah.
Yeah.
Or every six weeks.
Depending on how large the team was, right?
Yeah. you know, I'm thinking that like you, you'd want to focus in on, you know, the overall dashboard
for your product. So I don't know, you know, let's pretend you, you wrote a new email service,
right? You know, yeah. In the background, there might be a database behind it and, you know,
you might want to care about the CPU and all that of the computers, but that wouldn't be your first dashboard that you
would go to. Yeah. And thinking in terms of a Grafana type world, I think what you would have
is you'd have those dashboards linked, right? So for instance, if you're looking at your service,
your email service, right? And all of a sudden its latencies just jump real high. You highlight that timeframe and say, click through, and then it takes you to another
dashboard and filters it to that time range that you selected on the previous one, that type of
thing, right? And then that way you can start looking at all the system metrics that were in
there. I think they were called the white box metrics, right? The ones that your CPU, your RAM,
all that kind of stuff. then then you can see well
what might have gone wrong in this time frame so information architecture is really hard like if
you have a strict hierarchical view it's well organized but a lot of times you really want to
see things that are kind of like cross purposes like it's almost like you want like minority
report like show me uh the database hard drives and also show me uh show me the topics okay we can see there's a correlation here let me drill into these things
you want to see that stuff at the same time so if you have a strict view starting a product and
kind of drilling out and like it kind of makes sense for some users not for others like you want
a business facing dashboard for your c you know ceo or whatever to look at uh you want a financial
thing for your cFO to look at.
Your SRE is going to want to see stuff totally different.
And the first kind of dashboard they go to doesn't necessarily fit in the same hierarchy.
It could be multiple kind of hierarchies.
And it's hard.
It's like the same things that are hard about information architecture,
the same things that are hard about kind of coding, like organization,
getting the right level of abstraction correct. That's all stuff that's
hard. It's good news, though, because this is a living system that you're going to
be keeping track of and working on alongside your real system.
Stuff can evolve as you do. But I think that means,
though, unless I misunderstood you, let's say the three of us have
some company. Let's go with the email example that I gave. So we, we start up some new email service, right?
And, and we have a dashboard that is showing like the overall health of, you know, using the four
golden signals of the, of the email system. Right. And when I'm on duty, that's the first dashboard
I go to and look at to make sure that like, there's no problems. And when Alan's on duty,
that's the first thing that he goes to look at it to verify there's no
problems.
But when you're on duty,
you're like,
well,
that's too high level for me.
I'm going to go look to see like what the database is doing and I'm going
to track that.
And so then you could on,
when you're on call for that,
that third of the time,
you could be missing overall problems.
Cause you're like, yeah, database looks fine.
But there might be problems elsewhere.
Right? So wouldn't that be wrong
what they're saying?
So I was talking about more about
different people, different roles. But if we're
all three SREs on
the same team, then ideally we would be
using the same dashboard and the same view because
otherwise, just like you said,
that's a problem that you and I are looking at total different things to achieve the same dashboard and the same view because you know just otherwise just like you said you know that's a that's a problem that like you and i are looking at total different
things to achieve the same kind of job um but i when i'm talking about is more like an incident
response kind of thing where you're trying to figure out why something's going wrong
and you're kind of trying to drill in and then this hierarchical view which is great for
organization uh around those four kind of tent poles uh it's just not so great to have that
stuff separate so you start having different tabs or maybe kind of create ad hoc dashboards to kind
of grab this from there and this from there and you're trying to kind of correlate stuff i'm just
saying it's basically cross-cutting concerns that sometimes you want to see together and other times
you really don't want to see together and if you just have one big pane of glass with like a lot
of stuff then you know it's hard to to really kind of have that information feel useful to someone who's not
as familiar with and doesn't like live in there. Yeah. I think, I think for the SREs,
if we're focusing on that, I think that, that having those four golden signals that you have to watch is the key, right? Just, I get what you're saying about other
people within the business will want different views of that. But I think for you maintaining
a service that needs to have a certain SLO, then you need to have the measures on screen
that matter to you, right? Like you don't even care if the database is getting pegged
as long as your requests and your responses are coming back in time.
Yeah, what I'm saying is, like, oh, sorry.
No, no, you're good.
I was going to say, like, you know, it makes sense to have one dashboard
that's kind of like, is my overall system healthy?
And then you kind of drill from there.
But do you have one dashboard with all of the CPUs from all of your services together?
No, you can't.
No, that doesn't make sense.
It makes more sense to bring them up by service.
So like, database is over here,
and it's got its CPU,
and it's got its latency,
it's got its saturation.
And you go over here,
and you know, your, I don't know,
your Elasticsearch,
or something like that,
your web app is, you know,
but sometimes you want to see that stuff together,
and you want to say like,
hey, well, the latency is bad on my web app,
and the CPU is high on the database like maybe there's a
correlation there and that doesn't work so well and they're on separate you know separate panes
that that means a human has to know that there's a there might be some sort of correlation between
these two systems and so you kind of have to pull them both up and wouldn't it be nice if you had
a single pane of glass that really focused on, you know, maybe just user experience or something that was kind of more purpose built for the things that are more common. So I don't know,
I'm just kind of thinking out loud about how hard it is to kind of organize stuff. Because if you
just break it down on simple lines, you know, simple is good. You should absolutely start there.
But you might find yourself wanting to kind of expand and maybe make more targeted views for
like drilling into kind of common problems that you have based on your systems
behavior.
Well,
I mean the,
but the whole point here,
right,
was as simple as possible.
No simpler.
So,
you know,
I kind of had this thought that like,
you know,
I think we've talked about like the,
the screenshot plugin before that,
um,
for Chrome where you can like take a,
uh,
picture of like the whole page,
even if it has to scroll,
it'll scroll the page for you and take the whole thing.
So if your dashboard,
if you needed to like share in like Slack or whatever your messaging platform
is,
you know,
a screenshot of it and you had to use that plugin so that it can scroll the
whole page,
then maybe your dashboard has too much information.
Yeah, totally. So imagine like, you know, if you haven't like, so, you know, can scroll the whole page then maybe your dashboard has too much information yeah totally
so imagine like uh you know if you haven't like so you know i would i said with a database and
a web app you know you're talking about one two three four systems that's okay but when you get
into like oh we have an ingestion pipeline with like 11 different nodes that do sort of like
processing and talk to different data stores or something then suddenly it's like well i want to
know where things stop in the pipeline and that that, you know, starts getting rough when you're talking about
like having, you know, 11 plus tabs open, trying to figure out stuff.
So this is, this is where I think we need to jump into what they say, because I think at the end of
it, we're going to tie into what you just mentioned when, when you're actually trying to troubleshoot
something, right? Yeah. Um, so when they say as simple as possible, no simpler,
they say it's real easy for monitoring to become super complex.
You're because you're alerting on just all kinds of thresholds and
measurements.
You might even have code in there to detect possible causes.
Then you've got dashboards that like what we were talking about,
multiple dashboards up. So the monitoring can become so complex. It's difficult to change, maintain,
and it becomes fragile. That sounds familiar. Remember, um, clean architecture. That's what
they said about the code, right? Like when the code is too coupled, that's exactly what happens,
right? Changes take way longer, all this stuff, it ties in directly
to that. Um, so what they said is there are a few guidelines that you can follow in order to sort of
keep these things simple. Um, first you need rules that find incidents to be simple. They need to be
predictable and reliable, um, data collection, the aggregation and alerting that is frequently used, and they said infrequently used, like less than once a quarter or something.
You need to potentially think about cutting that out of your system.
If it's never triggering, never hitting, then you don't need it.
It's just noise.
Data that's collected but not used on any dashboards or alerting, get rid of that too.
No reason to be collecting that data. Um, and then they said, and this is what goes into kind
of Joe, what you were talking about. And I think even, even outlaw like this, this whole thing
where you want to avoid pairing simple monitoring with other things like crash detection or logging or,
or any number of other things,
because then it gets very complicated,
right?
When you start chaining those things together to help you find the root causes
of things,
that's when your systems become super complex.
And basically what they said at the end of all that was try and keep those
systems decoupled.
It seems almost counterintuitive, but if you do keep those decoupled, you can change those and enhance those easier over time.
I think it's actually further down where they talk about – yeah, we'll get into it in a minute.
I won't bring it up. But so, so the problem is, is you want to keep your systems from being too complex because
you won't be able to effectively change them and enhance them over time.
Yeah.
So, so really it was advocating for against what, uh, Joe was talking about then.
Like you wouldn't have that ability to do the deep dives for the SRE.
Yep.
Yeah.
You'd, you'd keep it separate is kind of my take on
it so uh my my read of it so you keep it separate so you can kind of break that stuff apart and if
you need to kind of bring those things back together in order to follow some sort of trail
uh then there's better tools for that like you know maybe you're looking at uh
well distributed tracing is really the answer instead of like logs you know kind of so it's
a higher level but um you know maybe you have stuff in your playbook answer instead of like logs, you know, kind of, so it's a higher level, but,
um,
you know,
maybe you have stuff in your playbook for kind of like how to track that
stuff down or like,
Hey,
check this out.
You know,
if this looks bad,
go to step two or whatever,
you know,
so kind of help you navigate that and keep that stuff out of the
monitoring,
keep your monitoring dashboards and stuff like that.
Just unimpeded stick to the four basics,
keep it as simple as possible and no simpler. yeah which again is hard um all right so they have a section on
tying these principles together so google actually has a monitoring philosophy for their sres and
they said that it's actually hard to attain but it sets a good foundation for these goals so they
have some questions that you should ask before you set
up alerts because they said, do you want to avoid this pager duty burnout, which is really easy to
hit pretty quickly, right? So one, does the rule detect something that is urgent, actionable,
and is actually visibly noticeable by a user.
I think that's fantastic.
Will I ever be able to ignore this alert?
And how can I avoid ignoring this alert?
That's pretty interesting.
Does this alert definitely indicate negatively impacted users? And are there cases that should be filtered out due to any
number of circumstances like one of the ones that they gave was let's say that you were doing an
upgrade on an app and so it was draining users off one section um if those users being drained
off one then you should filter those out right like that shouldn't be a part of your metrics
which is interesting i mean that's that's a whole nother layer of
complexity, right? Like is, is making sure you're filtering data on, on nodes that are still
dynamically. Yeah, man. Like that's, that's fun. This kind of goes back into their previous,
uh, the previous episode too, where they were talking about like the rules.
And this was like one of those few cases where there might be rules. I think in the example
that they gave at that part of the chapter,
they were talking about like draining from a data center.
Yeah.
It shouldn't trigger an alert,
right?
Like,
yeah,
man,
that's a,
yeah,
that's fun.
And then the last one that they have with these four was,
can I take action on alert?
Does it need to be done now?
And can it be automated?
Will the action that I take be a short term or a longterm fix?
Oh,
that wasn't the last one.
There were other questions.
That's right.
And are other people getting paged about the same incident?
So basically,
am I accidentally repeating an alert that somebody else already has set up?
I need to make sure I'm not.
Yeah. Cause the last thing you want to do is waste two people's time.
Right. Um, so those questions help you with the whole notion of like what you need to be thinking
about when you're setting up a page because those things interrupt people. Well, a page in this case,
like not a webpage, but like an alert. Yeah, a pager alert.
Yeah, pager alert.
And they even call out pages are extremely fatiguing.
People can only handle a few a day.
And so the ones that you get hit with need to be real and they need to be urgent.
You don't want garbage funneling through, which I think all three of us can attest to, right? Like we've all been hit with things that you're like, oh,
the same comment's been applied to this thing 20 times in the past, you know, 12 days.
Like, why am I getting this?
Well, it quickly becomes that old, that old joke.
Like, let's say, let's say that you, you don't take this advice, right?
And so you're paging too often and unnecessarily and whatnot.
It really, it quickly becomes that old joke about like, um, uh,
something important on your, uh, part doesn't constitute an emergency on mine.
You know what I'm talking about? I'm messing up the exact quote, but you know, it ends up kind
of falling into that kind of category, right?
Where like,
you're just like,
whatever it's,
you think it's an emergency,
but I don't think that it is.
Well,
when 90% of what you get is not urgent,
right?
Then it's real easy for you to just filter out that next 10%.
Right.
Like,
Oh,
it's the same thing.
You know,
you start trying to take your alerts and figure out which of these ones you
don't care about.
It is, you know, like you come out from that perspective. Yeah. It's, it's not good. So, so to solve that, they say every page should be actionable, right? Like if you get
an, if you get a page, you should be able to do something about it. If a page does not require
a person's action or thought, then it shouldn't be a page.
Basically what they're saying is if this thing could have been automated,
then it shouldn't be interrupting somebody else's time, right?
Like unless you have to think about it, it should be done somewhere else.
They say that they should be about novel events.
I don't really like that term, but you know,
something big I guess is really what they're saying.
No new novel as in new novelism.
You shouldn't, you shouldn't be paging it a second time for the same thing.
Okay. I got you.
Cause cause they'd covered that earlier in the book too. Um, was that,
you know, whatever you're going to alert on,
it should be something new that, you new that you shouldn't be paging.
An example might be like, oh God,
let's say that you know that there's a physically bad cable
plugged into the server and so it's dropping packets
and so the latency is high and whatnot
and you're just ignoring those alerts right but
yet it's still paging you every five minutes about it right right that would be an example of
something like just take care of the issue or don't have the alert at all okay yeah that's fair
um they also called out here it's not important whether this came from white box or the black box
monitor and if you go back to our previous episode,
the white box being metrics that the system gives you easily, right?
CPU counts, RAM counts,
that kind of stuff versus the black box stuff where, you know,
these aren't directly able to be monitored,
but coming from somewhere else, they don't care where they come from.
They need to be important.
And so whichever they come from, it's fine. Now, this is the part that was interesting. It goes directly
against what we were saying about trying to root cause, find some of this stuff is they said in
this, actually, I had to read this like four or five times and make sure I read it properly.
It's more important to spend effort on catching the symptoms over the causes.
And I guess their thing there is trying to find the root cause is typically more difficult,
right? Like chaining together events that happened from a slow UI response to a database or an elastic search query or whatever, any number of technologies
in between, I think that's why they said it, right? Is these measurements give you the symptoms,
right? These measurements are the response is slow. All right, now you go figure it out,
right? Like we have the dashboard for this this you go chain all the other stuff together
and this is going to require some human interaction because you have the smarts to know how it all
works yeah i mean this goes back to like last episode well no i was going to say like when we
were talking about the uh keep it simple no simpler kind of thing about you know the dashboard being the overall product and not
necessarily like um you know what's the disk space on my postgres server look like oh my postgres
server is having you know having problems because the disk is full like you know that's that's the
cause of what the thing is but what you overall want to know is like well how how well is the
overall system performing and that's when you want to alert thing is. But what you overall want to know is like, well, how, how well is the overall system performing?
And that's when you want to alert on something.
And then you go dive in to figure out like,
why is it slow?
Oh,
Postgres database,
uh,
or drive is full.
Right.
And we talked about this a little bit last episode where we said,
uh,
if the system tries to be too smart,
it can often kind of bias people and what they look for and whatever.
And so,
you know, if your system says like, Hey, the database is down instead of saying latency is up then the person might go and you know check the database and it's not down it's fine that alert
stinks bye and not realize that the you know the there were facts behind that that made us say that
and you know just kind of put you off on the wrong foot which can be kind of slow things down and
just be inaccurate. Yep.
Yeah.
So the interesting takeaway here is less is more,
right?
Like monitor the product that you're providing and,
and the symptoms of it,
right?
The,
the latency,
the errors,
the four pillars that they mentioned and everything else should almost be an
investigation from there.
I mean, that's really been my takeaway from some of this.
Yeah.
This episode is sponsored by Retool.
Building internal tools from scratch is slow.
It takes a lot of engineering time and resources,
so most companies just resign to prioritizing a select few
or settling for inefficient hacks and workarounds for every
other internal business process. So Retool helps developers build internal tools faster so they
can focus on development time on the core product. Retool offers a complete UI component library,
so building forms, tables, and workflows is as easy as drag and drop. And hey, more importantly, Retool connects to basically any data source,
database or API,
offers app environments,
permissions and SSO out of the box
and offers an escape hatch
to use custom JavaScript when you need it.
With Retool, you can build user dashboards,
database GUIs, CRUD apps
and other software to speed up
and simplify your work without
Googling for component libraries, debugging dependencies, or rewriting boilerplate code.
Thousands of teams at companies like Amazon, DoorDash, Peloton, and Brex collaborate around
custom-built Retool apps to solve internal workflows. To learn more, visit retool.com. That's R-E-T-O-O-L dot com.
I think Joe Zach, it's his turn to ask for some sort of weird review.
Wait, why is there going to be a weird review?
It doesn't have to be weird. I'm just saying, if you have got a terrible review just been
sitting on, you've been
hanging on you got in your pocket see this is why we don't ask him to do this for the right moment
now is your chance oh you gotta drop it like it's hot on us
i think that's what that expression is all about right leaving reviews uh well we try to make it
easy for you you're gonna couldn't watch that slash review uh we'll have links up there i'll even maybe i could put some verbiage on there some kind of
sample reviews some sample bad reviews in case you need some kind of some things to jog your
memory or kind of you know get the ball started you know we don't want that uh blank page to
so yeah if you've got terrible review we have one thing for you yeah but if now if you have a great review
a good review you know somewhere in there uh you know we'll we'll gladly take those too
but uh either way just make sure you smash the five stars that's all it
really matters in and lay it on us that's right oh man
well that'll be the last time we ever asked Joe to do that.
All right. Well, OK, we need a little bit of a separation here before we get into it.
So how about if I ask you this? Because with what
Joe just asked for, I'm sure we're going to
get some bad reviews now. They're going to make us cry.
So how
do you make Lady Gaga cry?
I can't think of any of her songs.
Poker Face.
Poker Face, that's it?
Yeah!
There you go.
All right.
Well, with that, we head into my favorite portion of the show, Survey Says.
All right.
So a few episodes back, we asked, how did I word this?
What's most important to you when you're looking for another job?
See, I worded it right. I had it right the first time. All right. Your choices were,
it's all about that promotion. I need the title. Or work-life balance is what matters.
I need to be able to enjoy my life and my work. Or dollar, dollar bill, y'all. More money,
more problems, and I'll do anything to have more problems
or i need some flexibility in my schedule life gets hectic or whatever it takes to get away
from this company or lastly whatever it takes to get into that company all right so 186. According to Tucko's trademark rules of engagement.
Joe, you are first.
Balance.
30%.
We're good on balance.
That's pretty good.
Man, I hate it that he picked the same one
I picked.
So I'm going to have to change mine.
I'm going to have to change mine.
I'm going to go dollar dollar bill y'all
and
30% also
okay
uh
mathemachicken comes in with work life balance
at 30% Alan dollar dollar
bill y'all at 30%
survey says you're both wrong whoa okay really it's flexibility whatever it
takes to get into that company wow all right cool 79 of the vote oh wow that's awesome yeah all right
cool i like that that's uh that's's a, that's a re reassuring.
That's a positive way to go about this.
Yeah.
Good job.
Whoa.
Second.
I mean, there's only 21% left.
It couldn't have mattered really.
Um, yeah, it was, uh, work-life balance was number two.
Okay.
Yeah.
Wow.
Okay.
That's, that's kind of exciting.
I like that.
So we, we'll, we get on tap for this one
all right so for this episode survey we ask did you intern or co-op while you were in school
and your choices are of course i did no way school alone would have prepared me for the real world
or who has the time i was focused on studying and getting my degree as quickly as I could.
Or, well, my school was the school of hard knocks,
so it wasn't exactly called an internship,
although in the beginning I was paid like it was.
This episode is sponsored by Shortcut.
Have you ever really been happy with your project management tools?
Most are either too simple for a growing engineering team to manage everything
or too complex for anyone to want to use them without constant prodding.
Shortcut is different though, because it's better.
Shortcut is project management built specifically for software teams
and they're fast, intuitive, flexible, and powerful.
Let's look at some of their highlights.
Team-based workflows.
Individual teams can use shortcuts to fault workflows or customize them to match the way
they work.
Org-wide goals and roadmaps.
The work in these workflows is automatically tied into larger company goals.
It takes one click to move from a roadmap to a team's work to an individual's
updates and vice versa. Tight version control integration, whether you use GitHub, GitLab,
Bitbucket, Shortcut ties directly to them so you can update progress from the command line.
And a keyboard-friendly interface. The rest of Shortcut is just as keyboard-friendly
with their power bar, allowing you to virtually do anything without touching your mouse.
Iterations planning. Set weekly priorities and then let Shortcut run the schedule for you
with accompanying burndown charts and other reporting.
Give it a try at shortcut.com slash coding blocks. Again, that's shortcut.com slash coding blocks.
Shortcut, because you shouldn't have to project manage your project management.
So, here we go into the final stretch of monitoring distributed systems.
It's the final countdown.
Of chapter six.
Okay.
Which is, you know, mostly the way through part one of this book.
Wow.
I'm sure I sounded exactly like it too.
Yeah, totally.
That's good.
Yeah, so this section is basically talking about monitoring for the long term.
Like we said at the top of the show,
monitoring systems are tracking ever-changing software systems,
and so your monitoring systems also need some love to grow.
They need to be maintained,
and decisions that you make for it need to be made with to grow you need to be maintained and decisions that you uh make for it
need to be made with a long term in mind but sometimes you need to to do a couple things in
order to you know get you through the day-to-day and get you through urgent situations because
sometimes short-term fixes are important to get past the acute problems and buy you time for a
real fix an example might be here if you've got um you that it's got a memory leak and every 24 hours or so, the service is going to get killed by Kubernetes or something is restarted or crashes or whatever.
Then maybe that's something you figure out, you write a ticket for, it's going to take a couple of days to fix.
And so you just restart that box every 12 hours until you get that
fix in for the next couple days so you know that's not that's not a monitoring fix there but that's
the kind of short-term versus long-term thinking i wanted to kind of give an example of and i gave
two uh kind of case studies uh which i hate uh about the trade-offs in this case they are pretty
good uh and it's interesting. Both of these
case studies were just
interesting stories that
exemplified the trade-offs that
you're going to be faced with when you're making
monitoring systems.
The first one was about Bigtable.
What? Bigtable?
The T is not capitalized.
It drives me crazy.
It always looks so weird to me. Bigtable? The T is not capitalized. Drives me crazy. It always looks so weird to me. Bigtable.
Bigtable for those actually wondering what it says.
For people who pronounce things correctly, it's Bigtable.
Is it?
I don't know. It's not how it's spelled.
The gist is that originally Bigtable's SLO was based on artificial kind of
good clients mean performance and so they
basically kind of mocked something and said this is
what we want it to look like and they had some
low level problems in storage that happened
in very rare cases that basically
you know the worst 5% of their requests
were significantly slower than the rest
and what I'm kind of imagining here is like
it's some sort of cash miss situation
or maybe something, you know,
a request
exceeds some sort of threshold that's normally
hit. And so it takes longer to
kind of process these requests. And you see like a
cliff in the graph that didn't match
their kind of artificial normal
distribution that they came up with originally.
These slow requests with trip alerts.
But ultimately the problem
was kind of transient because, you know, once that request is done,
you wouldn't see it again.
It wasn't repeated.
It was something that just kind of happened,
you know, 5% of the time.
And when someone would get the alert,
they would go check on it,
and there was ultimately nothing they could do about it.
There wasn't like some switch they could flip
to make that work.
It was a systemic problem.
So, you know, imagine what happened.
Like, people would get the
alerts they kind of learned to recognize those alerts and they would ignore them sometimes they
would get an alert that they would ignore thinking it was this turned out something real
something else was going on so it's just a problem you can't have alerts uh that don't
mean anything and that aren't actionable uh so what do you do about it so in this case google dialed back the service level objective to 75
percent uh 75th percentile didn't i don't think they said what it was before maybe 95th but
basically it meant less alerts and they disabled email alerts and they did this until they were
able to go in and actually fix the root cause problems so this is kind of a funny case where like you're actually
changing the objective based on uh the amount of alerts that you're getting not on what the
business wants and business needs and so that's a a no-no it's a big no-no but they decided to do
it because it was a better solution than what they had going on and uh you know as long as your team
is disciplined enough to actually go and do that fix
when the fire alarm isn't ringing and that's a good thing.
Yeah.
It allowed them to at least focus on trying to solve the problem rather than,
Oh,
there's a new alert.
We let's go spend some time to see if it's the same problem as the last
alert.
Yeah,
exactly.
And like we said,
you know,
those pages are expensive.
So it's taking these people away from
the work they should be doing to fix
it to go check and make sure that there isn't
something going on. So that
was a good case where they decided to do something
short-term to kind of give them a little
bit of breathing room and then ultimately did
the right thing long-term.
I forgot to, this is kind of off-topic,
a little bit of a tangent here. Whoops, tangent alert.
Hold on, tangent alert. Hold on.
Um,
but since Joe is so consistent with his big table pronunciation, I did.
I,
I,
there's a little Easter egg in the last episode specifically for you,
Mr.
Underwood.
Uh,
anyone care to take a guess at what it might be was it in the show notes i don't know
it's about costco nope should be nope that's not the one i have no idea man i i made sure to
replace all words person with human. Did you really, man?
That's such an evil thing to do.
There's no mentions of person
on the page. That is
awful. You're a terrible
human, sir.
I forgot to mention that earlier,
because remember last episode I said that
I would do that as a joke. I did.
Oh, God. Remember last episode, I said that I would do that as a joke. I did.
In case any monkeys get in there and start banging away on the show notes.
Well, they won't get the alert.
Only a human.
Only a human, yes.
So bad.
I hate words like that.
Well, as I suspected,
someone has been adding soil to my garden.
And the plot thickens.
That's good.
Joe's just stuck over there.
I don't know.
I think...
Oh, we lost him. Yeah, I think't know. Do we, I think, uh, did, Oh, we lost him.
Yeah,
I think we did.
The joke was so funny.
It knocked him offline.
Yeah.
He said,
zoom crashed.
Well,
he can rejoin.
Yeah.
The show must go on.
So,
uh,
another story that they had in here was about Gmail.
So, uh, Gmail was originally built on a distributed process management system called WorkQ, which was adapted to long-lived processes.
And tasks would get descheduled, causing alerts, but the tasks only affected a very small number of users. The root cause bugs were difficult to fix because
ultimately the underlying system was a poor fit, right? And I mean, I know we've all been there,
or you picked the wrong technology to start something on, but you don't know that at the
beginning. You're like, this will do good enough. And then once you get into it, you're like, oh,
now you see all the problems with it, right?
Which is what happened here.
So engineers could, quote, fix the scheduler by manually interacting with it. Like imagine if you were to restart a server every 24 hours or something like that, right?
Should the team automate the manual fix or would this just stall out what the
real fix,
what should be the real fix,
right?
So there were,
there were two red flags here.
Why have,
what are we right here?
Why have root,
root toil?
Oh,
wrote tasks for engineers to perform,
which is toil.
Why doesn't the team trust itself to fix the root cause just because an alarm isn't blurring?
Blurring? Blurring.
What we said? Oh, yeah, we wrote blurring, but I guess we meant blaring.
Hey, Joe, did he guess we meant blaring. Yeah, I don't know. Hey, Joe.
Did he?
He locked up again.
Yep.
He came back.
Oh, there he is.
Is it the alarm blaring or blurring?
We'll never know.
I think he just locked up again.
I think so.
This is hilarious.
It's actually funny. We should take a screenshot of that.
I'm going to assume an alarm blaring because alarms blare.
Yes. All right. So, yeah.
Okay. So what's the takeaway? Do not think about alerts
in isolation. You must consider them in the context of the entire system
and make decisions that are good for the them in the context of the entire system and make decisions
that are good for the long-term health of the entire system. So in this Gmail case, rather than
trying to like automate a manual fix for it, just invest the time into going after the long-term
fix, which is to, you know, take the honest approach that like, hey, maybe we started out on the wrong platform.
We picked the wrong thing to solve this problem
and we need to re-architect.
And that's a tough pill to swallow when you hit that.
But when you do, you do.
So we don't have it in the notes here,
but I actually like,
I'm just going to read the first couple of sentences
of their conclusion because I think it's pretty good.
So, quote, a healthy monitoring and alerting pipeline is simple and easy to reason about.
It focuses primarily on symptoms for paging, reserving cause-oriented heuristics to serve as aids to debugging problems. So, and they go on to say that the
reason is monitoring symptoms is easier as you go up your stack. And so that goes back to what
we said earlier, right? Like you're not trying to root cause things in here. You're literally
looking at the simplest measures that you can to let you know if your service is running in the way that it should, and then leave the investigation for another path.
Yep. So, uh, we did it. Hey, we did it.
We got to the end of the book. Um,
Oh, wait. Oh man. This is only chapter six.
Yeah. How many, there were like 30 something.
There's so many
how many are there they're literally are we on really chapter six oh we really are on chapter
six we're chapter six and there are 34 chapters and a through f appendices so at this rate Rate, we'll finish, carry the one. Divide by pi. Yeah, I think 2093 will be done.
Somewhere around there.
That's probably not close.
Wait, did you account for leap year?
Probably not.
Which calendar are we using?
The Gregorian?
The divide by zero.
Yeah, there we go.
That always works out well.
Joe's back.
I think that's what happened.
I think he divided by zero.
His computer went down.
Yeah.
Yeah.
So we'll have a bunch of links to the resources we like for this episode.
And you know what?
If I had to sum up coding blocks, right, I would just say two guys walked into a bar,
a third one ducked.
So with that,
I always love how like I could see the reaction where it like takes a second.
So with that,
we head into Alan's favorite portion of the show.
It's the tip of the week.
Joe,
did you ever get that one?
Cause I never saw a reaction on your face.
Get which one?
What?
Two guys walked into a bar.
The other one ducked.
Oh, yeah.
Yes.
Yeah.
I didn't hear that at all.
I think I'm still having weird issues, but I appreciate it.
Yep.
I'd like to think that I'm the one that will duck.
No.
No. You're definitely hitting the bar i think you're hitting the bar right now
all right lovely so i uh i stole this one or i borrowed this one from murley so i appreciate it
this is actually a really cool one so if you still do land parties which i haven't that in years, I don't even know how big a rig I would have to carry around
to do a LAN party nowadays. But there's this really cool thing called lancash.net. If you go
there, they basically have like a Docker setup to where instead of everybody having to download a
game and getting hit with that,
you can download it once and share it with everybody in your LAN party.
So this has instructions on how to do that.
If you have data caps and stuff,
this obviously would help.
Like it's a really cool way of going about doing that.
So thank you Merle for that one.
And then I have to share this
because I've shared this with
you two guys before and there are times that you just want to watch really cool stuff you don't
really want to learn anything i guess like like this podcast we we teach you one or two things
per show but you know there's times that you just really want to sit back and relax. Yeah, this one, I think sometimes.
So I told OutlawJZ about this YouTube channel that I absolutely love.
It's called Project Farm.
And why I love this channel is this dude, he takes requests from people. So by all means, if you have some sort of tooling or some sort of home project type thing, then you're like, man, I wonder which is better.
Which pair of pliers is better?
These or these?
If you have any questions like that, submit it to this dude because he goes scientific on all this stuff.
And one of the ones I shared with these guys was the drywall anchors.
Like if you ever wanted to hang a picture up on your wall,
you go into home Depot or Lowe's or,
or choose your store Walmart.
And you're looking at the 50 different packs of drywall anchors.
And you're like,
well,
why is that one?
12 bucks.
And that one five,
they both say they hold 50 pounds.
Like which one should I get?
This guy.
I always go for the one that can
like hold a toyota on the wall that's the one i'm gonna i don't care it doesn't matter how small
the thing is that i'm gonna like you know put on the wall if it can hold it'll work if it's
gonna be a pound i want the 75 pounder that's gonna hold it right but but in all seriousness i have a link to his
his main youtube channel but i also have a link to the drywall anchors just because
this is the level of detail this guy goes in on everything and it is so enjoyable to watch
like him setting up the rigs on how he's going to do it the the measuring tools that he uses to
figure out you know how many pounds of weight before
it broke and all like, it's just awesome. So at any rate, if you want to go back and
waste a few hours of your life and be entertained, go check out his YouTube channel.
All right. And for me, uh, so I've got a nice little tip here for when you're trying to test
in production, have you ever been working on a Python system
and you want to try and make some changes
on that system and you know kind of see how they work
but the problem is that depending on you know
what you're doing in your situation the kind of app you're working on
the way a lot of Python apps work
like Django for example is it loads up the Python
basically on startup it's got all that stuff lot of python work apps work and uh like django for example is it loads up the python basically
on startup it's got all that stuff kind of you know sitting around memory and then it goes and
executes whatever it does stuff what if you need to make a change on that system and you don't want
to restart it well uh you can in python actually dynamically uh reload modules so i ran into this
in case uh a case where i had a system that wasn't in production, but it was in an environment.
And I couldn't restart it because it was in a pod.
You restart it, then it spins the pod back up with the old code.
And so I was losing my changes.
And I wanted to run just some unit tests.
And so what I ended up doing is just changing the code in the pod.
And then in my unit test, I actually found some code that I'll link here
that you can import depending on your version of Python.
There's a couple different ways to do it.
But the idea is that you use this library and you tell it to reload your module.
So initially when I was running my tests, they kept failing because the code was not right.
I went and updated the files on disk and ran it and the code was still not right because the module was already loaded in memory.
Then I found this
little block of code here and it's really simply
basically just import some sort of library
or if you're on Python 2 you don't have to
even do that. You just use a function.
But for higher versions of
Python 2x and
until Python there's a
library, Python 3, a library
that you just import.
It's built in the language, and you can tell it to reload that module,
which gets your changes.
So I just did that at the top of my unit test file,
used the main function, and there we go.
So it saved me a lot of time, and, you know,
ultimately that's not a great way that you want to work,
but sometimes you got to do what you got to do.
And I just thought it was really cool. You know, it was kind of almost like telling,
having like a script file that would say like, know recompile my java or something in this
namespace is kind of the equivalent i was thinking there which is a pretty cool power to have
not something that you'd want to have a production box you know you don't want to have your compilers
installed uh in production uh probably but it was just cool to be able to do that and so i thought
it was neat that python gave me the tools to kind of update
that stuff in memory.
That's pretty awesome.
I mean,
how do you run a Python app in production without having the quote compiler
on the box?
Can't.
Well,
yeah,
yeah,
yeah,
no.
Yeah.
All right.
So,
uh,
for my tip of the week,
I have some,
uh, a Docker file file words of advice.
So, you know, as we, you know, welcome to the Docker file corner.
Yeah. So I've been spending a lot of my time here lately focusing on like build optimizations and things like that. Um, which, you know, on a large scale,
uh, repo and application, you know, if you have, you know, dozens of different Docker images that
you're building and whatnot, um, that can all matter, especially if like in your, on your build
server, uh, you know, if you kind of come at it with like the trust, nothing type of, you know, build motto where, you know, you were, you don't have a cash already on that build server
at the time, you know, uh, like some of these tricks can, can really matter.
So one of them was, uh, if you have multiple run statements in your Docker file, right? Rather than having like
one run some command and then followed by another run some other command,
just concatenate those into a single one. So like run some command and, and, or, or ampersand,
ampersand some other command, right? And, and just, you know, you can do that as many times as you need to.
And if you want to like break it out onto a new line,
you could just add a space backslash at the end and then go to a new line.
And you could have all of these things that you want to run as one giant run statement.
And one of the big advantages to doing that is if you had all of those run statements as multiple statements,
then what's going to happen is Docker is going to persist that as an individual, as a brand new
layer, right? That to represent whatever the state of change is for that ultimate image. And, you
know, you're, those are going to be how many layers you end up with at the end. So when you need to
like docker pull or push, you know, those are all the different images or I'm sorry, all the
different layers that you're going to end up pushing, right? But if you just did the one
run statement with a bunch of commands anded together, then it's all in one layer and it's
only whatever the final output or whatever the final state is of all of those
commands that matter. That is the ultimate layer. Does that make sense? So, so, um, now add that,
think about that. Now, when I say this next statement, so, um, oftentimes you might need to install some tools, right? In your, uh,
in your image. So maybe you base it off of Alpine or you base it off of Ubuntu or whatever,
you know, uh, whatever the, the base image is, but maybe you need to install, uh, you know,
some, some other, uh, tool curl WGIT or whatever, uh, uh, coming up blank on some
other things, you know, Oh, JQ might be a good one that you want to install. Um, um, whatever
the thing is that you want to saw, you might have some things to install. The typical pattern that
you'll see that a lot of people do is they'll say like run, uh, APK update and, and add or APK
add some package. Or if it's, uh, an, if they're using app get instead, they might say like app
get update and, and app get install some package. Right. But the better way to do that, especially
if you use an APK, right. Or like if you're on
Alpine, which is already kind of optimized for, uh, Docker is to do, to skip the APK update portion
and just do a straight up APK add dash dash, no dash cash, some package, right. And then that'll,
that'll tell the ad command to, to not
even look at whatever your cache is. So it's not going to try to update anything. It's just going
to go get the thing, install it and be done with it. Because otherwise, if you do the APK update,
then you have a bunch of extra stuff added into your Docker layer. That's going to like bloat
the size of your layer and image in the end. And if you do,
um, if you're basing it off of like Ubuntu, for example, and you do that, like something similar
to an app, get, uh, update and app, get install. You need to be sure that the final part of that,
that you can have concatenated together is an app, get clean so that you can undo the bloat that happened from the
update because unfortunately app get doesn't have the same no cash option that apk has but the both
of those are kind of like building on top of the previous thing that i said where you would have
you would concatenate your run statements into that one thing so that that final,
um, uh, the result of that, um, change is the, is what's being persisted. And that's why it's,
uh, so important to do that, um, a app, app, get clean at the end there. And APK doesn't
have that option. That's why the better option there is to just do no cache.
Okay. The last one here. We all like to add files ultimately from whatever we've been working on, like Joe's Python, for example, into our resulting Docker image that we're trying to create, right? And so there's a
couple of different ways that we can add those files in, namely add the add or the copy commands.
But here's the trick that you need to be aware of with those two things. So when it's your own
files, that's probably like not a big deal, right? Because you can add a file or a directory. But what you need to be aware of is that for both of those Docker commands, Docker needs to be able to get the actual file or directory, compute the checksum of that, and then use that checksum to verify, hey, do I already have this layer in cache or not? And if I don't, then I'll build
that layer. Otherwise, if I do have it, if the checksum computes to something I already have,
then okay, I've already got the cache and I'll just move on, right? So that all sounds like,
well, that sounds like a no-brainer, Michael. What are you even talking about?
With the add command, though, you can add a URL, which sounds great because then you don't have to APK, add no cache curl and then or WGIT and then, you know, do some option there.
Because you could just do it natively as a Docker file.
Right.
But what you got to be aware of is you got to compute that checksum
of the file. So it needs to go download the file, compute the checksum, and then know whether or
not it needs to rebuild the layer or not. But guess what? You've already taken the hit of
downloading the file that maybe you didn't want to download every single time
you're trying to build your image, especially if this is happening on a build server, you know,
and you're trying to like optimize this thing for speed and time or whatever, you know, you only
want to download that thing if you're truly going to rebuild that layer. And so in that case,
it would be better to use a run statement with curl or WGIT to download the file for you because Docker will only compute the checksum of the run command string and not the result of whatever the file is that you would have gotten.
So you can avoid that.
I have a question here because so i get what you're
saying um so in short if you do the out of if you do the ad with the url it has to download it
there's no option so it can't even check to see if it has a cached version of that layer available
until it downloads the file right like this so so you're kind of hosed there, but how do you get around that using the run? Cause if you do a run with a W get or a run with
a, um, with a curl, a curl, it's still going to have to download that file in the run command.
Cause I assume you're talking about putting the run command prior to the add or the copy.
That no, no, no, no, no. Don't use the add or copy at all.
Let's say that you want to get some file into your Docker image. Okay. Maybe like a SQL JDBC
driver from, you know, a Microsoft or whoever, right? You want to download their jar
to put into your, your Docker file. And, and you definitely don't
want to commit that binary to your repo, right? So you're trying to be a good person here.
If you do the ad and then you give that URL, it has to go download the file from Microsoft to put
it into your image or, but it has to download it just to even determine what the checksum of the file is before it even does anything, versus if instead you do a run wget or run curl to download that file,
what Docker does is it computes the checksum of the string, the command,
literally the run statement, right?
It doesn't go get the file at all.
It just looks like, has this command changed at all?
If it has, I will re-execute it.
If it hasn't, I'm going to use a cached layer.
So let me back up then, because I think this might clarify it for anybody else that's listening that might have been stuck in the same headspace I was.
So what you're saying is don't use the ad with the URL ever, because if you do that, it's always going to have to
download the file before it can even check to see if it's got a layer cached, which you've
almost sort of defeated the purpose at that point a little bit. Whereas if you instead didn't have
that ad of that file from the URL and you did it in a run statement, Docker would be able to look
at that run statement and say, Hey, I already have
this thing cached because it's just looking at the string of the run command. It says, Hey,
we're good. I don't need to download anything. Just use that layer that I've already got cached.
Yes. Except one, one, uh, you know, uh, asterisk that I would put on that though,
is I'm not saying to never use ad
with a URL because you might have a legit reason for it. I'm just calling out, you need to be aware
of what's going to happen. And that, you know, if your intention is to not, if you're trying to
avoid downloading the file, uh, every time, then you want to do the run instead of the ad.
Because imagine if this was like a really large file too, you know, what if it was,
you know, like a multi-gig file that you, you know, then in that case, well, I mean,
that's a super large Docker image, I feel for you. But, you know, if you used an ad,
you're going to take the hit of downloading that file every single
time, even if you didn't need to, you need it versus with the run, you wouldn't. So here's,
here's the way to think about this. So going back to, you know, what I started on before,
as it relates to doing like a run app, get update or a APK update, run APK update. Like we've all seen
Docker files that have statements like that in there, or even like a, you know, an APK add no
cache in some package or a app get install some package. And you've seen it where like, it'll use
the cache, right? It, because it hasn't like actually executed that statement to see,
um, you know, Hey, what was the resulting change of the file system? No, it was like,
okay, well the command itself had this checksum that checksum I already have in my cache and it
hasn't changed. So I can assume that everything else is going to match. And so in the case of
the download file using curl or W get, it's the same thing. It's just computing the checksum of whatever that
ultimate command string is. So cool. Good way to save you, uh, some, some time there. Um,
all right. So with that said, Hey, we're at the end of the episode. Uh, you know,
if you're not already subscribed to us, can find some helpful links at uh you know
the top of the page for itunes spotify stitcher wherever you'd like to find your podcasts uh
we're well i guess we gotta check we might not be there uh we're there twice on a couple platforms
and maybe not there on some um but if you're hearing this you're probably fine though yeah
well i mean you know a friend could have been like, hey, listen to this crazy thing.
And whatever you do, please don't listen to Joe from earlier.
But if you haven't left us a review, we would greatly appreciate it.
You can find some helpful links at www.codingblocks.net slash review.
Yep.
Hey, and while you're up there at the website, check out our show examples discussions and more and send your feedback questions and rants to our slack channel head
over to codingblocks.net slash slack and join the awesome community if you're not already in there
yeah and make sure to follow us on twitter at codingblocks.net we brought our social links
at the top of the page including a link to uh the reviews page which of course you can go there and
leave a bad review.
That was awkward timing for him to
for his Zoom to crash right then
and there like he is gone.