Coding Blocks - Site Reliability Engineering – (Still) Monitoring Distributed Systems

Starting point is 00:00:00 You're listening to Coding Blocks, episode 186. Oh, you know what? I should have come in with the same gusto as last time. Dang it, I forgot. I think you scared people. Did I? Did I? I don't know.

Starting point is 00:00:12 Hey, you're listening to Coding Blocks. Yeah, there you go. Whoa. So, subscribe to us on iTunes, Spotify, Stitcher, wherever you like to find your podcasts by now. Man, if we're not there by now, I mean, we're like doubly there on Stitcher, so surely you can find us, right? We are on Amazon

Starting point is 00:00:30 now, too. I've got to figure out what's going on there. I don't relish that. I didn't know that. Yeah, it's frustrating. Well, some places you can find us twice. So, I mean, you know, that's how nice we are. That's right. So, make sure, while you're looking around for us, you can check us out at coding

Starting point is 00:00:45 blocks.net you can find all our show notes examples discussions and more you can send your feedback questions and rants to comment at coding blocks.net and uh at coding blocks on twitter is how you can find us on twitter and if you go social uh top page net. I'm Joe Zach. I feel like some packets got out of order there. That was you guys, not me. I don't know. Weird.

Starting point is 00:01:16 Okay, well, I'm Michael Outlaw. And I'm Alan Underwood. This episode is sponsored by Retool. Stop wrestling with UI libraries, hacking together data sources, and figuring out access controls, and instead start shipping apps that move your business forward. And shortcut, you shouldn't have to project manage your project management. Okay, so we're going to pick up with the second half of monitoring distributed systems in this particular episode. So we'll be wrapping that up.

Starting point is 00:01:46 But first, as we like to do, we like to get to some quick podcast news. And first up, we have Outlaw reading the reviews. Okay. Why is it? It's always me. You're always like picking on me to like read the names, but I'm going to try it. We don't have many. We don't have many right now, right?

Starting point is 00:02:01 We don't. I'm going to try it from From iTunes, thank you very much. Los Paz. Right? Sounds good, yeah. It's either that or Los Paz. It could be that. That one, yeah.

Starting point is 00:02:17 Okay. I don't know. Los Paz. Honestly, I gave it my best already so anything else after this might just be insulting like that

Starting point is 00:02:29 those were my two best guesses as to like how it would be pronounced a lost pass all right so man I don't know that's the news

Starting point is 00:02:39 to share with you so sorry morale we talked about last episode shared a great post on SRE and TOIL, and wrote

Starting point is 00:02:48 another great post. We had a discussion in Codingbox Slack talking about onboarding, mentoring, hiring junior programmers, which is kind of a controversial topic. A lot of companies don't hire juniors. They don't want to hire new people. He wrote a really great post about basically why you should make the case for hiring junior

Starting point is 00:03:04 developers. And it was really good. He came up with a kind of constructive scenario, basically talked about like, I mean, you got to read the article, but I will have a link in the show notes, but basically kind of comparing like what it would mean for you to work a lot of extra time and how much productivity you would get out of that compared to hiring a junior and spending your time kind of raising them up and how over time, you know, basically the productivity gains you get from hiring a junior and training them up was going to beat

Starting point is 00:03:35 any sort of, you know, extra hours you're putting in and which one's healthier and saner, you know, which one's better strategy. So it makes a good case for it and it has some great tips for like onboarding stuff like that which one gets me on the mountain bike faster uh definitely well i mean over time uh you'll get there with the juniors juniors equal mountain biking there you go so he's on board all right and also wanted to mention we got an email from zach asking about message brokers. Like we've never done an episode on them. We've talked about Kafka and the fact that we use it.

Starting point is 00:04:11 We've talked about RabbitMQ and other things. So I think we're probably going to get one on the schedule here and we'll do a deep dive into message queues and why you might choose one over the other. And I mean, there's several out there, so it is a pretty good and deep topic. But Zach, if you have like a specific question that you want to hit us up with in the interim, go ahead and shoot us an email over and, you know, we'll try and answer any questions. Maybe I'll just reply, you know, I could do that too. That would make way too much sense, wouldn't it? So, so yeah, anyways, that'll be upcoming. And with that, I guess we can go ahead and dive into the nitty gritty of the second half of this

Starting point is 00:04:54 particular topic on monitoring distributed systems. So first up we have instrumentation and performance. And honestly, before we even jump into this, I kind of like it that we're hitting some of this stuff because I know that in our professional careers, like these are things that we've been dealing with a lot of times are just add as many things you can find, right? Like, oh, there's some latency there. Well, we need to track latency. We need to alert on latency and we need to, and it's like, whoa, wait a second. Has it been a problem? Have we had a problem? If we haven't, let's not just make problems, right? Like we don't, we don't want to create things that we have to go chase for no apparent reason.

Starting point is 00:05:41 So that out of the way. And we also don't want to just have to look at more dashboards and widgets on the, on dashboards and panels on them just for the sake of it, which is kind of the whole point of this chapter was to like focus in on, you know, what you, you're going to monitor. So just as a quick reprise of the, of the previous chapter though, like where we ended with the four golden signals, that if you were going to monitor nothing else, that the four golden signals, according to Google, that you would look at would be latency, traffic, errors, and saturation. Yep. All right. So with that, what they say at the beginning of this is you need to be careful and not just track your times like these latencies and things as just medians or means, because we've talked about it in previous episodes. If you're just doing the mean, then you could get some highly inaccurate things because your tails could be way off in another direction.

Starting point is 00:06:45 You're not going to know about it, right? Yeah. Your, your outliers can be, can be lost and they can, uh, throw off,

Starting point is 00:06:52 you know, what's really happening there in the system. They might say, make it look good or bad depending on like what, what the thing is that you're trying to measure. Yep. Totally. And it can also mess up your median too,

Starting point is 00:07:04 depending on what's happening on those tails. So those you need to be very careful about. And this is actually something that I think I mentioned with Prometheus before was a better way is to bucketize data in histograms. And if you've never dealt with a histogram, first off, they're mind bending the first time you look at them because you're like, what are you doing here? But if you think about just making buckets and then counting how many times things happen in those buckets, it'll make a lot more sense. So the example, go ahead. Well, I was just going to say, like, I mean, you could easily think about this

Starting point is 00:07:38 in like a classroom setting, right? Like, you know, you take a test, you take, you know, you have a test in your class. This is the number of students that made A's. These are the number of students that made B's. These are C's, et cetera, et cetera, et cetera. Each one of those letter grades would represent a bucket. And now you've, you know, you put a number to that. And so now you could imagine a chart of those different buckets and what that might look like.

Starting point is 00:08:03 Yep. Now, the thing that's interesting about this is a lot of times histograms have to be predefined. At least the truest term of histogram. So for instance, like it's easy with grades, like what you said, right? Like you have A, B, C, D, E, A, B, C, D, F. You don't have an E and F. And so you have those fixed number of buckets and you know those up front, which is good. In histograms, like if you're doing something with Prometheus, if you're dealing with things like latencies, you kind of have to figure out what you want those buckets to be. And here, like I said, they gave an example of like 0 to 10 milliseconds would be one bucket.

Starting point is 00:08:42 And then they sort of did factors of three after this. So, um, from 10 milliseconds to, um, 300 milliseconds, 300 milliseconds to one second, et cetera. And so, uh, factors of 30, I think is what that is actually. So, so when you set up these buckets, every time a request comes in, that was five milliseconds milliseconds and you're going to put a tick mark in zero to 10 milliseconds, you have one, right? So these counters make it to where you don't have to keep all the low level detail around, right? These give you quick counters to where you can easily aggregate that stuff over time periods and you can see the trends and you can see how these things are working. I think I see there was was a mistake here in the notes it was a factor of three but the buckets that they gave in the example were

Starting point is 00:09:30 zero to ten ten to thirty thirty to a hundred a hundred to three hundred so it was roughly like you know okay so it was okay i jacked that up all right so 30 to 100 milliseconds all right so that's pretty good um now the thing is though again when you're defining all these and this is a hint i guess on prometheus as well you can define all your buckets up to a point to where you're like anything over this i just want it to sort of go into a catch-all prometheus has that in their histogram so that if like let's say that you wanted to cut off at 30 seconds, right? Like anything over 30 seconds, they have a plus infinity that they throw in there.

Starting point is 00:10:11 And anything that went outside the bucket ranges you got would at least hit that. So, you know, know your monitoring tooling systems and how those work. But, you know, hopefully that'll give you a little bit of insight. The next piece up that they have is choosing the appropriate resolution for measurements. Now, I don't know about you guys. When I was reading some of this, I don't know. It kind of jumbled up in my head how they were talking about some of it. But go ahead.

Starting point is 00:10:43 I was going to say, yeah, that for sure. Even just looking at Prometheus, like, I think I understand, you know, math with chickens and, you know, numbers and stuff like that. But in Prometheus, I'm like, wait, what's an irate? The way they kind of like put things together and like, you know, the terms they use and stuff are over my head in a lot of cases. Yeah. In here, I think what they were trying to get to at the heart of it was if you're looking to measure something, look at your service level objectives and agreements and sort of go from there, right? So they gave a couple of examples that I think help with this. They said if you're targeting 99.9% uptime, then there's no reason for you to check your hard drive fullness more than twice a minute, right? Like there's some monitoring systems that'll do it every second or every 15 seconds or whatever you want it to be. I mean, you could force them to be more granular, but the reality is

Starting point is 00:11:37 you don't need that much data. You don't need that many data points. So, you know, look at what your overall objectives are and work back from that um go ahead i i mean i i understood what they were getting at but it was just also such a bit of a mind you know melt for me like because i was like okay yeah i get that but then also i don't want to like monitor too late. But, you know, that's kind of their point is like, well, if the objective, if it doesn't matter anyways for the objective, then you could afford to have a little bit of a hit there and it not work against you. So rather than like alerting too often about something, it would almost be kind of like,

Starting point is 00:12:24 you know, the story of like the boy that cried wolf kind of thing, right? Like if the alarm is going to, you know, ping you too often because it's too aggressive and it doesn't really matter to your SLO or SLA that, you know, that much, then why have that noise? Right. Yeah. It's, I mean, it's going to be hard to, I guess, as, as probably all three of us are, we like data. And so the more the merrier, but not when you're actually trying to monitor the uptime or availability of a system, because the more data you have, the harder your CPU and everything has to work to analyze and aggregate that data. So fewer data points can be actually better for your monitoring solution. Well, also in this particular chapter, too, I mean, this is trying to focus the time of the human, right? Right.

Starting point is 00:13:16 You know, Alan's favorite term. So we're trying to make sure that, you know that when the human gets involved in whatever the problem is, that said human is focused in on the one thing. And so if you're getting alerts about the disk drive being full more often than you need to be because it doesn't matter in terms of your SLO or whatever then uh you know you're just wasting that person's time i'm sorry that human's time yes it worked it's expensive too uh so we got that coming up in the notes i want to jump ahead a little bit here but uh yeah those measurements are surprisingly expensive and some of the things like it's hard to really figure out like ahead of time how much stuff is going to cost you because the way they price those things is just not very human friendly. But when you get your first bill, you realize that it's a very real cost. Yeah. Oh, I shouldn't have been monitoring at per second intervals.

Starting point is 00:14:18 Yeah. I mean, you get hit with both the cost of storing the measurement as well as when you're aggregating that stuff, instead of if you're storing per second, instead of per minute, you're now aggregating, you know, 60 data points for a particular metric. And typically on these things,

Starting point is 00:14:36 you'll have more than one metric that's being, you know, aggregated. So yeah, it's interesting. And then they also say a really good thing about these histograms is because you're not keeping the raw measure around and because you're just doing a counter in each one of these buckets, those are way faster to aggregate, which means it's way

Starting point is 00:14:58 less intensive on your monitoring system, right? So it could keep that thing from going down as well. Yeah, a lot of time databases are actually designed to scale the data basically to some sort of resolution so they'll actually compress the data. It's lossy, but much more efficient. Yeah, and I mean, the reality is typically you don't need that low-level crazy amount of detail, right? Yep, but the heart wants it. The doesn't want it and if if processing were infinitesimally um fast then it wouldn't matter you know i just caught something though that we've said like i think we've mixed some some things here there

Starting point is 00:15:39 right because you were talking about prometheus earlier and then I rates came up, but now we are mixing Grafana and Prometheus. So I'm sure somebody is like screaming at their iPod cause you know, they're playing this on an iPod. No, I rate is a Prometheus thing. Well, I know in Grafana you can choose to do a Raider. I rate,

Starting point is 00:15:59 that's the kind of thing that I was thinking about. But um, yeah, I don't, I mean Prometheus just kind of stores the stuff right yeah i thought it was just the time series database for it all no it's also in the prom ql stuff so um irate is one of the prom ql functions that you can use rate and irate oh yeah yeah i mean i guess that's what drives it right yeah yeah yeah so we didn't see i suck at this stuff. And I only know this cause I was dealing with it recently.

Starting point is 00:16:26 Um, so yeah, this is the next piece that they have here. And I really like this. Keep it as simple as possible, but no simpler, man. What a hard line to walk.

Starting point is 00:16:36 Like that stuff is so frustrating, right? Like how do you know if it's simple enough or how do you know if you didn't have it simple enough or two, too simple? Like you won't know until you do it. i kind of want to start with the the perspective of like follow the four golden signals and then for like any product or service that you know that might be tempted to have a dashboard for like just start with those four things yeah totally and that really is the

Starting point is 00:17:01 answer yeah yeah you'll know it's too complex when you like start to show someone else and you're like okay wait don't freak out this top half here is for whatever and down here on the sides for sure yeah it's so it's really hard to keep it simple well yeah and especially like in grafana where you can have like so many panels and it's just super easy to be like you know what i'm gonna add another one but you know what this one doesn't matter as much so to like uh not inundate the reader i'm going to like collapse this one into a tab but then like every time i go to it like i'm going to expand that tab right and and then you know going back to the the data crunching that alan was talking about earlier and like how the you know compute, compute intensive that can become, you know,

Starting point is 00:17:45 with Grafana, you get like too many panels and with too large a time range. And now you can start to crush Prometheus in the background as it's trying to, uh, respond to all the queries for the different panels that are on there. So, yeah,

Starting point is 00:18:01 I don't know. But I mean, like it also begs the question of like, you know, cause I said to create a dashboard for each of the products that you might want to do so so immediately you could interpret that as like oh well um let's see i've got kafka so i'm going to have a dashboard for the four signals for kafka and i'm going to have a postgres maybe. So I'm going to have the four golden signals to monitor postgres, blah,

Starting point is 00:18:31 blah, blah, blah. But you know, that's probably not the kind of product that they're talking about here. You know, they would be talking about like, well,

Starting point is 00:18:40 what's the overall service doing? Like what's the, what's the product that like i'm delivering that i'm making not the products that i'm using right to deliver it but the product that i've that i've made and what's the four you know now monitor the four golden signals for the for my product so that's really interesting i mean if you think about that like what you just said and it's true right like if you've got kafka and 12 other technologies you you want to monitor and measure all those things right it's so tempting it is

Starting point is 00:19:10 but the reality is all you have to care about is your slo right like what are you trying to deliver is it you know i need to make sure that all my requests get back within a second and that's what you need to be tracking and then based off what we're about to talk about here in a second, then that's what you need to be tracking. And then based off what we're about to talk about here in a second, I think you work your way backward from that over time, right? Like if something triggers, Hey, things went up above a second. Why now, now you go find out what it was that caused it. And maybe without those other dashboards, you won't know, right? Like, and that's where, that's where it hurts, right? Like if you don't have a dashboard for Kafka over here showing its latencies and you don't have one for your Ram and your CPU, maybe you wouldn't have seen that. Well,

Starting point is 00:19:56 the CPU spiked to a hundred percent here and it went over five seconds, right? Like I, I don't know when you start introducing those things, but I think based off a previous conversation, you're already going to have those system metrics in place, right? But those aren't going to be what you pay attention to. It's going to be your service metrics that you're going to pay attention to, and you'll dig into the other ones when you need to. Yeah, I mean, that's kind of what I'm thinking, what I'm envisioning, because like, you know, take Datadog, for example, right? Like, you know, their whole pitch is like the single pane of glass type of experience, right? To monitor your thing. And so what I'm advocating for is like, well, that, that single pane that you go to

Starting point is 00:20:35 should you, you know, based on what we're reading here, I'm thinking that like, okay, you want that to be like the thing that you are building and that you are providing to the world, right? Whatever that is, you, that's not to say that you couldn't have other ones for deeper dives like you were getting at, you know,

Starting point is 00:20:53 for when you do need those, but that's not your go-to dashboard. That's not what you're watching. That's not, yeah, that's not the dashboard that you had the team like focus on when they're the on-call person, right? Because they talked about having it where you would rotate.

Starting point is 00:21:12 It was like a quarter of the time or something like that. Every two weeks, yeah. Yeah. Or every six weeks. Depending on how large the team was, right? Yeah. you know, I'm thinking that like you, you'd want to focus in on, you know, the overall dashboard for your product. So I don't know, you know, let's pretend you, you wrote a new email service, right? You know, yeah. In the background, there might be a database behind it and, you know,

Starting point is 00:21:39 you might want to care about the CPU and all that of the computers, but that wouldn't be your first dashboard that you would go to. Yeah. And thinking in terms of a Grafana type world, I think what you would have is you'd have those dashboards linked, right? So for instance, if you're looking at your service, your email service, right? And all of a sudden its latencies just jump real high. You highlight that timeframe and say, click through, and then it takes you to another dashboard and filters it to that time range that you selected on the previous one, that type of thing, right? And then that way you can start looking at all the system metrics that were in there. I think they were called the white box metrics, right? The ones that your CPU, your RAM, all that kind of stuff. then then you can see well

Starting point is 00:22:27 what might have gone wrong in this time frame so information architecture is really hard like if you have a strict hierarchical view it's well organized but a lot of times you really want to see things that are kind of like cross purposes like it's almost like you want like minority report like show me uh the database hard drives and also show me uh show me the topics okay we can see there's a correlation here let me drill into these things you want to see that stuff at the same time so if you have a strict view starting a product and kind of drilling out and like it kind of makes sense for some users not for others like you want a business facing dashboard for your c you know ceo or whatever to look at uh you want a financial thing for your cFO to look at.

Starting point is 00:23:06 Your SRE is going to want to see stuff totally different. And the first kind of dashboard they go to doesn't necessarily fit in the same hierarchy. It could be multiple kind of hierarchies. And it's hard. It's like the same things that are hard about information architecture, the same things that are hard about kind of coding, like organization, getting the right level of abstraction correct. That's all stuff that's hard. It's good news, though, because this is a living system that you're going to

Starting point is 00:23:31 be keeping track of and working on alongside your real system. Stuff can evolve as you do. But I think that means, though, unless I misunderstood you, let's say the three of us have some company. Let's go with the email example that I gave. So we, we start up some new email service, right? And, and we have a dashboard that is showing like the overall health of, you know, using the four golden signals of the, of the email system. Right. And when I'm on duty, that's the first dashboard I go to and look at to make sure that like, there's no problems. And when Alan's on duty, that's the first thing that he goes to look at it to verify there's no

Starting point is 00:24:08 problems. But when you're on duty, you're like, well, that's too high level for me. I'm going to go look to see like what the database is doing and I'm going to track that. And so then you could on,

Starting point is 00:24:18 when you're on call for that, that third of the time, you could be missing overall problems. Cause you're like, yeah, database looks fine. But there might be problems elsewhere. Right? So wouldn't that be wrong what they're saying? So I was talking about more about

Starting point is 00:24:33 different people, different roles. But if we're all three SREs on the same team, then ideally we would be using the same dashboard and the same view because otherwise, just like you said, that's a problem that you and I are looking at total different things to achieve the same dashboard and the same view because you know just otherwise just like you said you know that's a that's a problem that like you and i are looking at total different things to achieve the same kind of job um but i when i'm talking about is more like an incident response kind of thing where you're trying to figure out why something's going wrong

Starting point is 00:24:55 and you're kind of trying to drill in and then this hierarchical view which is great for organization uh around those four kind of tent poles uh it's just not so great to have that stuff separate so you start having different tabs or maybe kind of create ad hoc dashboards to kind of grab this from there and this from there and you're trying to kind of correlate stuff i'm just saying it's basically cross-cutting concerns that sometimes you want to see together and other times you really don't want to see together and if you just have one big pane of glass with like a lot of stuff then you know it's hard to to really kind of have that information feel useful to someone who's not as familiar with and doesn't like live in there. Yeah. I think, I think for the SREs,

Starting point is 00:25:34 if we're focusing on that, I think that, that having those four golden signals that you have to watch is the key, right? Just, I get what you're saying about other people within the business will want different views of that. But I think for you maintaining a service that needs to have a certain SLO, then you need to have the measures on screen that matter to you, right? Like you don't even care if the database is getting pegged as long as your requests and your responses are coming back in time. Yeah, what I'm saying is, like, oh, sorry. No, no, you're good. I was going to say, like, you know, it makes sense to have one dashboard

Starting point is 00:26:15 that's kind of like, is my overall system healthy? And then you kind of drill from there. But do you have one dashboard with all of the CPUs from all of your services together? No, you can't. No, that doesn't make sense. It makes more sense to bring them up by service. So like, database is over here, and it's got its CPU,

Starting point is 00:26:29 and it's got its latency, it's got its saturation. And you go over here, and you know, your, I don't know, your Elasticsearch, or something like that, your web app is, you know, but sometimes you want to see that stuff together,

Starting point is 00:26:39 and you want to say like, hey, well, the latency is bad on my web app, and the CPU is high on the database like maybe there's a correlation there and that doesn't work so well and they're on separate you know separate panes that that means a human has to know that there's a there might be some sort of correlation between these two systems and so you kind of have to pull them both up and wouldn't it be nice if you had a single pane of glass that really focused on, you know, maybe just user experience or something that was kind of more purpose built for the things that are more common. So I don't know, I'm just kind of thinking out loud about how hard it is to kind of organize stuff. Because if you

Starting point is 00:27:15 just break it down on simple lines, you know, simple is good. You should absolutely start there. But you might find yourself wanting to kind of expand and maybe make more targeted views for like drilling into kind of common problems that you have based on your systems behavior. Well, I mean the, but the whole point here, right,

Starting point is 00:27:28 was as simple as possible. No simpler. So, you know, I kind of had this thought that like, you know, I think we've talked about like the, the screenshot plugin before that,

Starting point is 00:27:39 um, for Chrome where you can like take a, uh, picture of like the whole page, even if it has to scroll, it'll scroll the page for you and take the whole thing. So if your dashboard, if you needed to like share in like Slack or whatever your messaging platform

Starting point is 00:27:55 is, you know, a screenshot of it and you had to use that plugin so that it can scroll the whole page, then maybe your dashboard has too much information. Yeah, totally. So imagine like, you know, if you haven't like, so, you know, can scroll the whole page then maybe your dashboard has too much information yeah totally so imagine like uh you know if you haven't like so you know i would i said with a database and a web app you know you're talking about one two three four systems that's okay but when you get

Starting point is 00:28:15 into like oh we have an ingestion pipeline with like 11 different nodes that do sort of like processing and talk to different data stores or something then suddenly it's like well i want to know where things stop in the pipeline and that that, you know, starts getting rough when you're talking about like having, you know, 11 plus tabs open, trying to figure out stuff. So this is, this is where I think we need to jump into what they say, because I think at the end of it, we're going to tie into what you just mentioned when, when you're actually trying to troubleshoot something, right? Yeah. Um, so when they say as simple as possible, no simpler, they say it's real easy for monitoring to become super complex.

Starting point is 00:28:51 You're because you're alerting on just all kinds of thresholds and measurements. You might even have code in there to detect possible causes. Then you've got dashboards that like what we were talking about, multiple dashboards up. So the monitoring can become so complex. It's difficult to change, maintain, and it becomes fragile. That sounds familiar. Remember, um, clean architecture. That's what they said about the code, right? Like when the code is too coupled, that's exactly what happens, right? Changes take way longer, all this stuff, it ties in directly

Starting point is 00:29:25 to that. Um, so what they said is there are a few guidelines that you can follow in order to sort of keep these things simple. Um, first you need rules that find incidents to be simple. They need to be predictable and reliable, um, data collection, the aggregation and alerting that is frequently used, and they said infrequently used, like less than once a quarter or something. You need to potentially think about cutting that out of your system. If it's never triggering, never hitting, then you don't need it. It's just noise. Data that's collected but not used on any dashboards or alerting, get rid of that too. No reason to be collecting that data. Um, and then they said, and this is what goes into kind

Starting point is 00:30:12 of Joe, what you were talking about. And I think even, even outlaw like this, this whole thing where you want to avoid pairing simple monitoring with other things like crash detection or logging or, or any number of other things, because then it gets very complicated, right? When you start chaining those things together to help you find the root causes of things, that's when your systems become super complex.

Starting point is 00:30:39 And basically what they said at the end of all that was try and keep those systems decoupled. It seems almost counterintuitive, but if you do keep those decoupled, you can change those and enhance those easier over time. I think it's actually further down where they talk about – yeah, we'll get into it in a minute. I won't bring it up. But so, so the problem is, is you want to keep your systems from being too complex because you won't be able to effectively change them and enhance them over time. Yeah. So, so really it was advocating for against what, uh, Joe was talking about then.

Starting point is 00:31:18 Like you wouldn't have that ability to do the deep dives for the SRE. Yep. Yeah. You'd, you'd keep it separate is kind of my take on it so uh my my read of it so you keep it separate so you can kind of break that stuff apart and if you need to kind of bring those things back together in order to follow some sort of trail uh then there's better tools for that like you know maybe you're looking at uh well distributed tracing is really the answer instead of like logs you know kind of so it's

Starting point is 00:31:44 a higher level but um you know maybe you have stuff in your playbook answer instead of like logs, you know, kind of, so it's a higher level, but, um, you know, maybe you have stuff in your playbook for kind of like how to track that stuff down or like, Hey, check this out. You know,

Starting point is 00:31:51 if this looks bad, go to step two or whatever, you know, so kind of help you navigate that and keep that stuff out of the monitoring, keep your monitoring dashboards and stuff like that. Just unimpeded stick to the four basics, keep it as simple as possible and no simpler. yeah which again is hard um all right so they have a section on

Starting point is 00:32:11 tying these principles together so google actually has a monitoring philosophy for their sres and they said that it's actually hard to attain but it sets a good foundation for these goals so they have some questions that you should ask before you set up alerts because they said, do you want to avoid this pager duty burnout, which is really easy to hit pretty quickly, right? So one, does the rule detect something that is urgent, actionable, and is actually visibly noticeable by a user. I think that's fantastic. Will I ever be able to ignore this alert?

Starting point is 00:32:55 And how can I avoid ignoring this alert? That's pretty interesting. Does this alert definitely indicate negatively impacted users? And are there cases that should be filtered out due to any number of circumstances like one of the ones that they gave was let's say that you were doing an upgrade on an app and so it was draining users off one section um if those users being drained off one then you should filter those out right like that shouldn't be a part of your metrics which is interesting i mean that's that's a whole nother layer of complexity, right? Like is, is making sure you're filtering data on, on nodes that are still

Starting point is 00:33:29 dynamically. Yeah, man. Like that's, that's fun. This kind of goes back into their previous, uh, the previous episode too, where they were talking about like the rules. And this was like one of those few cases where there might be rules. I think in the example that they gave at that part of the chapter, they were talking about like draining from a data center. Yeah. It shouldn't trigger an alert, right?

Starting point is 00:33:51 Like, yeah, man, that's a, yeah, that's fun. And then the last one that they have with these four was, can I take action on alert?

Starting point is 00:34:01 Does it need to be done now? And can it be automated? Will the action that I take be a short term or a longterm fix? Oh, that wasn't the last one. There were other questions. That's right. And are other people getting paged about the same incident?

Starting point is 00:34:17 So basically, am I accidentally repeating an alert that somebody else already has set up? I need to make sure I'm not. Yeah. Cause the last thing you want to do is waste two people's time. Right. Um, so those questions help you with the whole notion of like what you need to be thinking about when you're setting up a page because those things interrupt people. Well, a page in this case, like not a webpage, but like an alert. Yeah, a pager alert. Yeah, pager alert.

Starting point is 00:34:47 And they even call out pages are extremely fatiguing. People can only handle a few a day. And so the ones that you get hit with need to be real and they need to be urgent. You don't want garbage funneling through, which I think all three of us can attest to, right? Like we've all been hit with things that you're like, oh, the same comment's been applied to this thing 20 times in the past, you know, 12 days. Like, why am I getting this? Well, it quickly becomes that old, that old joke. Like, let's say, let's say that you, you don't take this advice, right?

Starting point is 00:35:22 And so you're paging too often and unnecessarily and whatnot. It really, it quickly becomes that old joke about like, um, uh, something important on your, uh, part doesn't constitute an emergency on mine. You know what I'm talking about? I'm messing up the exact quote, but you know, it ends up kind of falling into that kind of category, right? Where like, you're just like, whatever it's,

Starting point is 00:35:47 you think it's an emergency, but I don't think that it is. Well, when 90% of what you get is not urgent, right? Then it's real easy for you to just filter out that next 10%. Right. Like,

Starting point is 00:35:59 Oh, it's the same thing. You know, you start trying to take your alerts and figure out which of these ones you don't care about. It is, you know, like you come out from that perspective. Yeah. It's, it's not good. So, so to solve that, they say every page should be actionable, right? Like if you get an, if you get a page, you should be able to do something about it. If a page does not require a person's action or thought, then it shouldn't be a page.

Starting point is 00:36:26 Basically what they're saying is if this thing could have been automated, then it shouldn't be interrupting somebody else's time, right? Like unless you have to think about it, it should be done somewhere else. They say that they should be about novel events. I don't really like that term, but you know, something big I guess is really what they're saying. No new novel as in new novelism. You shouldn't, you shouldn't be paging it a second time for the same thing.

Starting point is 00:36:55 Okay. I got you. Cause cause they'd covered that earlier in the book too. Um, was that, you know, whatever you're going to alert on, it should be something new that, you new that you shouldn't be paging. An example might be like, oh God, let's say that you know that there's a physically bad cable plugged into the server and so it's dropping packets and so the latency is high and whatnot

Starting point is 00:37:23 and you're just ignoring those alerts right but yet it's still paging you every five minutes about it right right that would be an example of something like just take care of the issue or don't have the alert at all okay yeah that's fair um they also called out here it's not important whether this came from white box or the black box monitor and if you go back to our previous episode, the white box being metrics that the system gives you easily, right? CPU counts, RAM counts, that kind of stuff versus the black box stuff where, you know,

Starting point is 00:37:56 these aren't directly able to be monitored, but coming from somewhere else, they don't care where they come from. They need to be important. And so whichever they come from, it's fine. Now, this is the part that was interesting. It goes directly against what we were saying about trying to root cause, find some of this stuff is they said in this, actually, I had to read this like four or five times and make sure I read it properly. It's more important to spend effort on catching the symptoms over the causes. And I guess their thing there is trying to find the root cause is typically more difficult,

Starting point is 00:38:35 right? Like chaining together events that happened from a slow UI response to a database or an elastic search query or whatever, any number of technologies in between, I think that's why they said it, right? Is these measurements give you the symptoms, right? These measurements are the response is slow. All right, now you go figure it out, right? Like we have the dashboard for this this you go chain all the other stuff together and this is going to require some human interaction because you have the smarts to know how it all works yeah i mean this goes back to like last episode well no i was going to say like when we were talking about the uh keep it simple no simpler kind of thing about you know the dashboard being the overall product and not necessarily like um you know what's the disk space on my postgres server look like oh my postgres

Starting point is 00:39:33 server is having you know having problems because the disk is full like you know that's that's the cause of what the thing is but what you overall want to know is like well how how well is the overall system performing and that's when you want to alert thing is. But what you overall want to know is like, well, how, how well is the overall system performing? And that's when you want to alert on something. And then you go dive in to figure out like, why is it slow? Oh, Postgres database,

Starting point is 00:39:54 uh, or drive is full. Right. And we talked about this a little bit last episode where we said, uh, if the system tries to be too smart, it can often kind of bias people and what they look for and whatever. And so,

Starting point is 00:40:09 you know, if your system says like, Hey, the database is down instead of saying latency is up then the person might go and you know check the database and it's not down it's fine that alert stinks bye and not realize that the you know the there were facts behind that that made us say that and you know just kind of put you off on the wrong foot which can be kind of slow things down and just be inaccurate. Yep. Yeah. So the interesting takeaway here is less is more, right? Like monitor the product that you're providing and,

Starting point is 00:40:34 and the symptoms of it, right? The, the latency, the errors, the four pillars that they mentioned and everything else should almost be an investigation from there. I mean, that's really been my takeaway from some of this.

Starting point is 00:40:49 Yeah. This episode is sponsored by Retool. Building internal tools from scratch is slow. It takes a lot of engineering time and resources, so most companies just resign to prioritizing a select few or settling for inefficient hacks and workarounds for every other internal business process. So Retool helps developers build internal tools faster so they can focus on development time on the core product. Retool offers a complete UI component library,

Starting point is 00:41:18 so building forms, tables, and workflows is as easy as drag and drop. And hey, more importantly, Retool connects to basically any data source, database or API, offers app environments, permissions and SSO out of the box and offers an escape hatch to use custom JavaScript when you need it. With Retool, you can build user dashboards, database GUIs, CRUD apps

Starting point is 00:41:42 and other software to speed up and simplify your work without Googling for component libraries, debugging dependencies, or rewriting boilerplate code. Thousands of teams at companies like Amazon, DoorDash, Peloton, and Brex collaborate around custom-built Retool apps to solve internal workflows. To learn more, visit retool.com. That's R-E-T-O-O-L dot com. I think Joe Zach, it's his turn to ask for some sort of weird review. Wait, why is there going to be a weird review? It doesn't have to be weird. I'm just saying, if you have got a terrible review just been

Starting point is 00:42:24 sitting on, you've been hanging on you got in your pocket see this is why we don't ask him to do this for the right moment now is your chance oh you gotta drop it like it's hot on us i think that's what that expression is all about right leaving reviews uh well we try to make it easy for you you're gonna couldn't watch that slash review uh we'll have links up there i'll even maybe i could put some verbiage on there some kind of sample reviews some sample bad reviews in case you need some kind of some things to jog your memory or kind of you know get the ball started you know we don't want that uh blank page to so yeah if you've got terrible review we have one thing for you yeah but if now if you have a great review

Starting point is 00:43:08 a good review you know somewhere in there uh you know we'll we'll gladly take those too but uh either way just make sure you smash the five stars that's all it really matters in and lay it on us that's right oh man well that'll be the last time we ever asked Joe to do that. All right. Well, OK, we need a little bit of a separation here before we get into it. So how about if I ask you this? Because with what Joe just asked for, I'm sure we're going to get some bad reviews now. They're going to make us cry.

Starting point is 00:43:50 So how do you make Lady Gaga cry? I can't think of any of her songs. Poker Face. Poker Face, that's it? Yeah! There you go. All right.

Starting point is 00:44:09 Well, with that, we head into my favorite portion of the show, Survey Says. All right. So a few episodes back, we asked, how did I word this? What's most important to you when you're looking for another job? See, I worded it right. I had it right the first time. All right. Your choices were, it's all about that promotion. I need the title. Or work-life balance is what matters. I need to be able to enjoy my life and my work. Or dollar, dollar bill, y'all. More money, more problems, and I'll do anything to have more problems

Starting point is 00:44:45 or i need some flexibility in my schedule life gets hectic or whatever it takes to get away from this company or lastly whatever it takes to get into that company all right so 186. According to Tucko's trademark rules of engagement. Joe, you are first. Balance. 30%. We're good on balance. That's pretty good. Man, I hate it that he picked the same one

Starting point is 00:45:17 I picked. So I'm going to have to change mine. I'm going to have to change mine. I'm going to go dollar dollar bill y'all and 30% also okay uh

Starting point is 00:45:34 mathemachicken comes in with work life balance at 30% Alan dollar dollar bill y'all at 30% survey says you're both wrong whoa okay really it's flexibility whatever it takes to get into that company wow all right cool 79 of the vote oh wow that's awesome yeah all right cool i like that that's uh that's's a, that's a re reassuring. That's a positive way to go about this. Yeah.

Starting point is 00:46:08 Good job. Whoa. Second. I mean, there's only 21% left. It couldn't have mattered really. Um, yeah, it was, uh, work-life balance was number two. Okay. Yeah.

Starting point is 00:46:20 Wow. Okay. That's, that's kind of exciting. I like that. So we, we'll, we get on tap for this one all right so for this episode survey we ask did you intern or co-op while you were in school and your choices are of course i did no way school alone would have prepared me for the real world or who has the time i was focused on studying and getting my degree as quickly as I could.

Starting point is 00:46:49 Or, well, my school was the school of hard knocks, so it wasn't exactly called an internship, although in the beginning I was paid like it was. This episode is sponsored by Shortcut. Have you ever really been happy with your project management tools? Most are either too simple for a growing engineering team to manage everything or too complex for anyone to want to use them without constant prodding. Shortcut is different though, because it's better.

Starting point is 00:47:17 Shortcut is project management built specifically for software teams and they're fast, intuitive, flexible, and powerful. Let's look at some of their highlights. Team-based workflows. Individual teams can use shortcuts to fault workflows or customize them to match the way they work. Org-wide goals and roadmaps. The work in these workflows is automatically tied into larger company goals.

Starting point is 00:47:41 It takes one click to move from a roadmap to a team's work to an individual's updates and vice versa. Tight version control integration, whether you use GitHub, GitLab, Bitbucket, Shortcut ties directly to them so you can update progress from the command line. And a keyboard-friendly interface. The rest of Shortcut is just as keyboard-friendly with their power bar, allowing you to virtually do anything without touching your mouse. Iterations planning. Set weekly priorities and then let Shortcut run the schedule for you with accompanying burndown charts and other reporting. Give it a try at shortcut.com slash coding blocks. Again, that's shortcut.com slash coding blocks.

Starting point is 00:48:22 Shortcut, because you shouldn't have to project manage your project management. So, here we go into the final stretch of monitoring distributed systems. It's the final countdown. Of chapter six. Okay. Which is, you know, mostly the way through part one of this book. Wow. I'm sure I sounded exactly like it too.

Starting point is 00:48:47 Yeah, totally. That's good. Yeah, so this section is basically talking about monitoring for the long term. Like we said at the top of the show, monitoring systems are tracking ever-changing software systems, and so your monitoring systems also need some love to grow. They need to be maintained, and decisions that you make for it need to be made with to grow you need to be maintained and decisions that you uh make for it

Starting point is 00:49:05 need to be made with a long term in mind but sometimes you need to to do a couple things in order to you know get you through the day-to-day and get you through urgent situations because sometimes short-term fixes are important to get past the acute problems and buy you time for a real fix an example might be here if you've got um you that it's got a memory leak and every 24 hours or so, the service is going to get killed by Kubernetes or something is restarted or crashes or whatever. Then maybe that's something you figure out, you write a ticket for, it's going to take a couple of days to fix. And so you just restart that box every 12 hours until you get that fix in for the next couple days so you know that's not that's not a monitoring fix there but that's the kind of short-term versus long-term thinking i wanted to kind of give an example of and i gave

Starting point is 00:49:56 two uh kind of case studies uh which i hate uh about the trade-offs in this case they are pretty good uh and it's interesting. Both of these case studies were just interesting stories that exemplified the trade-offs that you're going to be faced with when you're making monitoring systems. The first one was about Bigtable.

Starting point is 00:50:20 What? Bigtable? The T is not capitalized. It drives me crazy. It always looks so weird to me. Bigtable? The T is not capitalized. Drives me crazy. It always looks so weird to me. Bigtable. Bigtable for those actually wondering what it says. For people who pronounce things correctly, it's Bigtable. Is it? I don't know. It's not how it's spelled.

Starting point is 00:50:41 The gist is that originally Bigtable's SLO was based on artificial kind of good clients mean performance and so they basically kind of mocked something and said this is what we want it to look like and they had some low level problems in storage that happened in very rare cases that basically you know the worst 5% of their requests were significantly slower than the rest

Starting point is 00:51:00 and what I'm kind of imagining here is like it's some sort of cash miss situation or maybe something, you know, a request exceeds some sort of threshold that's normally hit. And so it takes longer to kind of process these requests. And you see like a cliff in the graph that didn't match

Starting point is 00:51:16 their kind of artificial normal distribution that they came up with originally. These slow requests with trip alerts. But ultimately the problem was kind of transient because, you know, once that request is done, you wouldn't see it again. It wasn't repeated. It was something that just kind of happened,

Starting point is 00:51:30 you know, 5% of the time. And when someone would get the alert, they would go check on it, and there was ultimately nothing they could do about it. There wasn't like some switch they could flip to make that work. It was a systemic problem. So, you know, imagine what happened.

Starting point is 00:51:44 Like, people would get the alerts they kind of learned to recognize those alerts and they would ignore them sometimes they would get an alert that they would ignore thinking it was this turned out something real something else was going on so it's just a problem you can't have alerts uh that don't mean anything and that aren't actionable uh so what do you do about it so in this case google dialed back the service level objective to 75 percent uh 75th percentile didn't i don't think they said what it was before maybe 95th but basically it meant less alerts and they disabled email alerts and they did this until they were able to go in and actually fix the root cause problems so this is kind of a funny case where like you're actually

Starting point is 00:52:25 changing the objective based on uh the amount of alerts that you're getting not on what the business wants and business needs and so that's a a no-no it's a big no-no but they decided to do it because it was a better solution than what they had going on and uh you know as long as your team is disciplined enough to actually go and do that fix when the fire alarm isn't ringing and that's a good thing. Yeah. It allowed them to at least focus on trying to solve the problem rather than, Oh,

Starting point is 00:52:55 there's a new alert. We let's go spend some time to see if it's the same problem as the last alert. Yeah, exactly. And like we said, you know, those pages are expensive.

Starting point is 00:53:04 So it's taking these people away from the work they should be doing to fix it to go check and make sure that there isn't something going on. So that was a good case where they decided to do something short-term to kind of give them a little bit of breathing room and then ultimately did the right thing long-term.

Starting point is 00:53:19 I forgot to, this is kind of off-topic, a little bit of a tangent here. Whoops, tangent alert. Hold on, tangent alert. Hold on. Um, but since Joe is so consistent with his big table pronunciation, I did. I, I, there's a little Easter egg in the last episode specifically for you,

Starting point is 00:53:39 Mr. Underwood. Uh, anyone care to take a guess at what it might be was it in the show notes i don't know it's about costco nope should be nope that's not the one i have no idea man i i made sure to replace all words person with human. Did you really, man? That's such an evil thing to do. There's no mentions of person

Starting point is 00:54:12 on the page. That is awful. You're a terrible human, sir. I forgot to mention that earlier, because remember last episode I said that I would do that as a joke. I did. Oh, God. Remember last episode, I said that I would do that as a joke. I did. In case any monkeys get in there and start banging away on the show notes.

Starting point is 00:54:34 Well, they won't get the alert. Only a human. Only a human, yes. So bad. I hate words like that. Well, as I suspected, someone has been adding soil to my garden. And the plot thickens.

Starting point is 00:54:55 That's good. Joe's just stuck over there. I don't know. I think... Oh, we lost him. Yeah, I think't know. Do we, I think, uh, did, Oh, we lost him. Yeah, I think we did. The joke was so funny.

Starting point is 00:55:10 It knocked him offline. Yeah. He said, zoom crashed. Well, he can rejoin. Yeah. The show must go on.

Starting point is 00:55:19 So, uh, another story that they had in here was about Gmail. So, uh, Gmail was originally built on a distributed process management system called WorkQ, which was adapted to long-lived processes. And tasks would get descheduled, causing alerts, but the tasks only affected a very small number of users. The root cause bugs were difficult to fix because ultimately the underlying system was a poor fit, right? And I mean, I know we've all been there, or you picked the wrong technology to start something on, but you don't know that at the beginning. You're like, this will do good enough. And then once you get into it, you're like, oh,

Starting point is 00:56:03 now you see all the problems with it, right? Which is what happened here. So engineers could, quote, fix the scheduler by manually interacting with it. Like imagine if you were to restart a server every 24 hours or something like that, right? Should the team automate the manual fix or would this just stall out what the real fix, what should be the real fix, right? So there were,

Starting point is 00:56:30 there were two red flags here. Why have, what are we right here? Why have root, root toil? Oh, wrote tasks for engineers to perform, which is toil.

Starting point is 00:56:46 Why doesn't the team trust itself to fix the root cause just because an alarm isn't blurring? Blurring? Blurring. What we said? Oh, yeah, we wrote blurring, but I guess we meant blaring. Hey, Joe, did he guess we meant blaring. Yeah, I don't know. Hey, Joe. Did he? He locked up again. Yep. He came back.

Starting point is 00:57:10 Oh, there he is. Is it the alarm blaring or blurring? We'll never know. I think he just locked up again. I think so. This is hilarious. It's actually funny. We should take a screenshot of that. I'm going to assume an alarm blaring because alarms blare.

Starting point is 00:57:31 Yes. All right. So, yeah. Okay. So what's the takeaway? Do not think about alerts in isolation. You must consider them in the context of the entire system and make decisions that are good for the them in the context of the entire system and make decisions that are good for the long-term health of the entire system. So in this Gmail case, rather than trying to like automate a manual fix for it, just invest the time into going after the long-term fix, which is to, you know, take the honest approach that like, hey, maybe we started out on the wrong platform. We picked the wrong thing to solve this problem

Starting point is 00:58:09 and we need to re-architect. And that's a tough pill to swallow when you hit that. But when you do, you do. So we don't have it in the notes here, but I actually like, I'm just going to read the first couple of sentences of their conclusion because I think it's pretty good. So, quote, a healthy monitoring and alerting pipeline is simple and easy to reason about.

Starting point is 00:58:34 It focuses primarily on symptoms for paging, reserving cause-oriented heuristics to serve as aids to debugging problems. So, and they go on to say that the reason is monitoring symptoms is easier as you go up your stack. And so that goes back to what we said earlier, right? Like you're not trying to root cause things in here. You're literally looking at the simplest measures that you can to let you know if your service is running in the way that it should, and then leave the investigation for another path. Yep. So, uh, we did it. Hey, we did it. We got to the end of the book. Um, Oh, wait. Oh man. This is only chapter six. Yeah. How many, there were like 30 something.

Starting point is 00:59:24 There's so many how many are there they're literally are we on really chapter six oh we really are on chapter six we're chapter six and there are 34 chapters and a through f appendices so at this rate Rate, we'll finish, carry the one. Divide by pi. Yeah, I think 2093 will be done. Somewhere around there. That's probably not close. Wait, did you account for leap year? Probably not. Which calendar are we using?

Starting point is 00:59:58 The Gregorian? The divide by zero. Yeah, there we go. That always works out well. Joe's back. I think that's what happened. I think he divided by zero. His computer went down.

Starting point is 01:00:07 Yeah. Yeah. So we'll have a bunch of links to the resources we like for this episode. And you know what? If I had to sum up coding blocks, right, I would just say two guys walked into a bar, a third one ducked. So with that, I always love how like I could see the reaction where it like takes a second.

Starting point is 01:00:35 So with that, we head into Alan's favorite portion of the show. It's the tip of the week. Joe, did you ever get that one? Cause I never saw a reaction on your face. Get which one? What?

Starting point is 01:00:47 Two guys walked into a bar. The other one ducked. Oh, yeah. Yes. Yeah. I didn't hear that at all. I think I'm still having weird issues, but I appreciate it. Yep.

Starting point is 01:00:59 I'd like to think that I'm the one that will duck. No. No. You're definitely hitting the bar i think you're hitting the bar right now all right lovely so i uh i stole this one or i borrowed this one from murley so i appreciate it this is actually a really cool one so if you still do land parties which i haven't that in years, I don't even know how big a rig I would have to carry around to do a LAN party nowadays. But there's this really cool thing called lancash.net. If you go there, they basically have like a Docker setup to where instead of everybody having to download a game and getting hit with that,

Starting point is 01:01:46 you can download it once and share it with everybody in your LAN party. So this has instructions on how to do that. If you have data caps and stuff, this obviously would help. Like it's a really cool way of going about doing that. So thank you Merle for that one. And then I have to share this because I've shared this with

Starting point is 01:02:05 you two guys before and there are times that you just want to watch really cool stuff you don't really want to learn anything i guess like like this podcast we we teach you one or two things per show but you know there's times that you just really want to sit back and relax. Yeah, this one, I think sometimes. So I told OutlawJZ about this YouTube channel that I absolutely love. It's called Project Farm. And why I love this channel is this dude, he takes requests from people. So by all means, if you have some sort of tooling or some sort of home project type thing, then you're like, man, I wonder which is better. Which pair of pliers is better? These or these?

Starting point is 01:02:54 If you have any questions like that, submit it to this dude because he goes scientific on all this stuff. And one of the ones I shared with these guys was the drywall anchors. Like if you ever wanted to hang a picture up on your wall, you go into home Depot or Lowe's or, or choose your store Walmart. And you're looking at the 50 different packs of drywall anchors. And you're like, well,

Starting point is 01:03:17 why is that one? 12 bucks. And that one five, they both say they hold 50 pounds. Like which one should I get? This guy. I always go for the one that can like hold a toyota on the wall that's the one i'm gonna i don't care it doesn't matter how small

Starting point is 01:03:33 the thing is that i'm gonna like you know put on the wall if it can hold it'll work if it's gonna be a pound i want the 75 pounder that's gonna hold it right but but in all seriousness i have a link to his his main youtube channel but i also have a link to the drywall anchors just because this is the level of detail this guy goes in on everything and it is so enjoyable to watch like him setting up the rigs on how he's going to do it the the measuring tools that he uses to figure out you know how many pounds of weight before it broke and all like, it's just awesome. So at any rate, if you want to go back and waste a few hours of your life and be entertained, go check out his YouTube channel.

Starting point is 01:04:16 All right. And for me, uh, so I've got a nice little tip here for when you're trying to test in production, have you ever been working on a Python system and you want to try and make some changes on that system and you know kind of see how they work but the problem is that depending on you know what you're doing in your situation the kind of app you're working on the way a lot of Python apps work like Django for example is it loads up the Python

Starting point is 01:04:44 basically on startup it's got all that stuff lot of python work apps work and uh like django for example is it loads up the python basically on startup it's got all that stuff kind of you know sitting around memory and then it goes and executes whatever it does stuff what if you need to make a change on that system and you don't want to restart it well uh you can in python actually dynamically uh reload modules so i ran into this in case uh a case where i had a system that wasn't in production, but it was in an environment. And I couldn't restart it because it was in a pod. You restart it, then it spins the pod back up with the old code. And so I was losing my changes.

Starting point is 01:05:12 And I wanted to run just some unit tests. And so what I ended up doing is just changing the code in the pod. And then in my unit test, I actually found some code that I'll link here that you can import depending on your version of Python. There's a couple different ways to do it. But the idea is that you use this library and you tell it to reload your module. So initially when I was running my tests, they kept failing because the code was not right. I went and updated the files on disk and ran it and the code was still not right because the module was already loaded in memory.

Starting point is 01:05:45 Then I found this little block of code here and it's really simply basically just import some sort of library or if you're on Python 2 you don't have to even do that. You just use a function. But for higher versions of Python 2x and until Python there's a

Starting point is 01:06:02 library, Python 3, a library that you just import. It's built in the language, and you can tell it to reload that module, which gets your changes. So I just did that at the top of my unit test file, used the main function, and there we go. So it saved me a lot of time, and, you know, ultimately that's not a great way that you want to work,

Starting point is 01:06:20 but sometimes you got to do what you got to do. And I just thought it was really cool. You know, it was kind of almost like telling, having like a script file that would say like, know recompile my java or something in this namespace is kind of the equivalent i was thinking there which is a pretty cool power to have not something that you'd want to have a production box you know you don't want to have your compilers installed uh in production uh probably but it was just cool to be able to do that and so i thought it was neat that python gave me the tools to kind of update that stuff in memory.

Starting point is 01:06:46 That's pretty awesome. I mean, how do you run a Python app in production without having the quote compiler on the box? Can't. Well, yeah, yeah,

Starting point is 01:06:58 yeah, no. Yeah. All right. So, uh, for my tip of the week, I have some,

Starting point is 01:07:04 uh, a Docker file file words of advice. So, you know, as we, you know, welcome to the Docker file corner. Yeah. So I've been spending a lot of my time here lately focusing on like build optimizations and things like that. Um, which, you know, on a large scale, uh, repo and application, you know, if you have, you know, dozens of different Docker images that you're building and whatnot, um, that can all matter, especially if like in your, on your build server, uh, you know, if you kind of come at it with like the trust, nothing type of, you know, build motto where, you know, you were, you don't have a cash already on that build server at the time, you know, uh, like some of these tricks can, can really matter. So one of them was, uh, if you have multiple run statements in your Docker file, right? Rather than having like

Starting point is 01:08:07 one run some command and then followed by another run some other command, just concatenate those into a single one. So like run some command and, and, or, or ampersand, ampersand some other command, right? And, and just, you know, you can do that as many times as you need to. And if you want to like break it out onto a new line, you could just add a space backslash at the end and then go to a new line. And you could have all of these things that you want to run as one giant run statement. And one of the big advantages to doing that is if you had all of those run statements as multiple statements, then what's going to happen is Docker is going to persist that as an individual, as a brand new

Starting point is 01:08:52 layer, right? That to represent whatever the state of change is for that ultimate image. And, you know, you're, those are going to be how many layers you end up with at the end. So when you need to like docker pull or push, you know, those are all the different images or I'm sorry, all the different layers that you're going to end up pushing, right? But if you just did the one run statement with a bunch of commands anded together, then it's all in one layer and it's only whatever the final output or whatever the final state is of all of those commands that matter. That is the ultimate layer. Does that make sense? So, so, um, now add that, think about that. Now, when I say this next statement, so, um, oftentimes you might need to install some tools, right? In your, uh,

Starting point is 01:09:48 in your image. So maybe you base it off of Alpine or you base it off of Ubuntu or whatever, you know, uh, whatever the, the base image is, but maybe you need to install, uh, you know, some, some other, uh, tool curl WGIT or whatever, uh, uh, coming up blank on some other things, you know, Oh, JQ might be a good one that you want to install. Um, um, whatever the thing is that you want to saw, you might have some things to install. The typical pattern that you'll see that a lot of people do is they'll say like run, uh, APK update and, and add or APK add some package. Or if it's, uh, an, if they're using app get instead, they might say like app get update and, and app get install some package. Right. But the better way to do that, especially

Starting point is 01:10:44 if you use an APK, right. Or like if you're on Alpine, which is already kind of optimized for, uh, Docker is to do, to skip the APK update portion and just do a straight up APK add dash dash, no dash cash, some package, right. And then that'll, that'll tell the ad command to, to not even look at whatever your cache is. So it's not going to try to update anything. It's just going to go get the thing, install it and be done with it. Because otherwise, if you do the APK update, then you have a bunch of extra stuff added into your Docker layer. That's going to like bloat the size of your layer and image in the end. And if you do,

Starting point is 01:11:26 um, if you're basing it off of like Ubuntu, for example, and you do that, like something similar to an app, get, uh, update and app, get install. You need to be sure that the final part of that, that you can have concatenated together is an app, get clean so that you can undo the bloat that happened from the update because unfortunately app get doesn't have the same no cash option that apk has but the both of those are kind of like building on top of the previous thing that i said where you would have you would concatenate your run statements into that one thing so that that final, um, uh, the result of that, um, change is the, is what's being persisted. And that's why it's, uh, so important to do that, um, a app, app, get clean at the end there. And APK doesn't

Starting point is 01:12:21 have that option. That's why the better option there is to just do no cache. Okay. The last one here. We all like to add files ultimately from whatever we've been working on, like Joe's Python, for example, into our resulting Docker image that we're trying to create, right? And so there's a couple of different ways that we can add those files in, namely add the add or the copy commands. But here's the trick that you need to be aware of with those two things. So when it's your own files, that's probably like not a big deal, right? Because you can add a file or a directory. But what you need to be aware of is that for both of those Docker commands, Docker needs to be able to get the actual file or directory, compute the checksum of that, and then use that checksum to verify, hey, do I already have this layer in cache or not? And if I don't, then I'll build that layer. Otherwise, if I do have it, if the checksum computes to something I already have, then okay, I've already got the cache and I'll just move on, right? So that all sounds like, well, that sounds like a no-brainer, Michael. What are you even talking about?

Starting point is 01:13:40 With the add command, though, you can add a URL, which sounds great because then you don't have to APK, add no cache curl and then or WGIT and then, you know, do some option there. Because you could just do it natively as a Docker file. Right. But what you got to be aware of is you got to compute that checksum of the file. So it needs to go download the file, compute the checksum, and then know whether or not it needs to rebuild the layer or not. But guess what? You've already taken the hit of downloading the file that maybe you didn't want to download every single time you're trying to build your image, especially if this is happening on a build server, you know,

Starting point is 01:14:31 and you're trying to like optimize this thing for speed and time or whatever, you know, you only want to download that thing if you're truly going to rebuild that layer. And so in that case, it would be better to use a run statement with curl or WGIT to download the file for you because Docker will only compute the checksum of the run command string and not the result of whatever the file is that you would have gotten. So you can avoid that. I have a question here because so i get what you're saying um so in short if you do the out of if you do the ad with the url it has to download it there's no option so it can't even check to see if it has a cached version of that layer available until it downloads the file right like this so so you're kind of hosed there, but how do you get around that using the run? Cause if you do a run with a W get or a run with

Starting point is 01:15:30 a, um, with a curl, a curl, it's still going to have to download that file in the run command. Cause I assume you're talking about putting the run command prior to the add or the copy. That no, no, no, no, no. Don't use the add or copy at all. Let's say that you want to get some file into your Docker image. Okay. Maybe like a SQL JDBC driver from, you know, a Microsoft or whoever, right? You want to download their jar to put into your, your Docker file. And, and you definitely don't want to commit that binary to your repo, right? So you're trying to be a good person here. If you do the ad and then you give that URL, it has to go download the file from Microsoft to put

Starting point is 01:16:17 it into your image or, but it has to download it just to even determine what the checksum of the file is before it even does anything, versus if instead you do a run wget or run curl to download that file, what Docker does is it computes the checksum of the string, the command, literally the run statement, right? It doesn't go get the file at all. It just looks like, has this command changed at all? If it has, I will re-execute it. If it hasn't, I'm going to use a cached layer. So let me back up then, because I think this might clarify it for anybody else that's listening that might have been stuck in the same headspace I was.

Starting point is 01:16:59 So what you're saying is don't use the ad with the URL ever, because if you do that, it's always going to have to download the file before it can even check to see if it's got a layer cached, which you've almost sort of defeated the purpose at that point a little bit. Whereas if you instead didn't have that ad of that file from the URL and you did it in a run statement, Docker would be able to look at that run statement and say, Hey, I already have this thing cached because it's just looking at the string of the run command. It says, Hey, we're good. I don't need to download anything. Just use that layer that I've already got cached. Yes. Except one, one, uh, you know, uh, asterisk that I would put on that though,

Starting point is 01:17:44 is I'm not saying to never use ad with a URL because you might have a legit reason for it. I'm just calling out, you need to be aware of what's going to happen. And that, you know, if your intention is to not, if you're trying to avoid downloading the file, uh, every time, then you want to do the run instead of the ad. Because imagine if this was like a really large file too, you know, what if it was, you know, like a multi-gig file that you, you know, then in that case, well, I mean, that's a super large Docker image, I feel for you. But, you know, if you used an ad, you're going to take the hit of downloading that file every single

Starting point is 01:18:26 time, even if you didn't need to, you need it versus with the run, you wouldn't. So here's, here's the way to think about this. So going back to, you know, what I started on before, as it relates to doing like a run app, get update or a APK update, run APK update. Like we've all seen Docker files that have statements like that in there, or even like a, you know, an APK add no cache in some package or a app get install some package. And you've seen it where like, it'll use the cache, right? It, because it hasn't like actually executed that statement to see, um, you know, Hey, what was the resulting change of the file system? No, it was like, okay, well the command itself had this checksum that checksum I already have in my cache and it

Starting point is 01:19:15 hasn't changed. So I can assume that everything else is going to match. And so in the case of the download file using curl or W get, it's the same thing. It's just computing the checksum of whatever that ultimate command string is. So cool. Good way to save you, uh, some, some time there. Um, all right. So with that said, Hey, we're at the end of the episode. Uh, you know, if you're not already subscribed to us, can find some helpful links at uh you know the top of the page for itunes spotify stitcher wherever you'd like to find your podcasts uh we're well i guess we gotta check we might not be there uh we're there twice on a couple platforms and maybe not there on some um but if you're hearing this you're probably fine though yeah

Starting point is 01:20:02 well i mean you know a friend could have been like, hey, listen to this crazy thing. And whatever you do, please don't listen to Joe from earlier. But if you haven't left us a review, we would greatly appreciate it. You can find some helpful links at www.codingblocks.net slash review. Yep. Hey, and while you're up there at the website, check out our show examples discussions and more and send your feedback questions and rants to our slack channel head over to codingblocks.net slash slack and join the awesome community if you're not already in there yeah and make sure to follow us on twitter at codingblocks.net we brought our social links

Starting point is 01:20:40 at the top of the page including a link to uh the reviews page which of course you can go there and leave a bad review. That was awkward timing for him to for his Zoom to crash right then and there like he is gone.

Coding Blocks - Site Reliability Engineering – (Still) Monitoring Distributed Systems

We finished. A chapter, that is, of the Site Reliability Engineering book as Allen asks to make it weird, Joe has his own pronunciation, and Michael follows through on his promise....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.