Coding Blocks - Site Reliability Engineering – Service Level Indicators, Objectives, and Agreements

Starting point is 00:00:00 You're listening to Coding Blocks, episode 183. Good morning! The morning edition of Coding Blocks. Subscribe to us on iTunes, Spotify, Stitcher, wherever you like to find your podcasts. And you're probably saying, like, how do you know it's morning when I'm listening? And I don't. But if you can leave us a review, we would greatly appreciate it. Yep, and you can visit us at codingblocks.net where you can find our show

Starting point is 00:00:25 notes examples discussions and more send your feedback questions and rants to comments at codingblocks.net and you want more twitters we got more twitters we got twitters right at coding blocks also if you go to coding blocks then you can find all our dillies at the top of the page that i'm joe zack i'm michael outlaw I'm Alan Underwood. This episode is sponsored by Shortcut. You shouldn't have to project manage your project management. All right. So we are back with another chapter on site reliability engineering. Today, we are talking about service level agreements, objectives, and indicators indicators so before we get into that i think outlaw has some things that he wants to talk about that he's in love with here of late well i don't know why you gotta start it

Starting point is 00:01:13 like that we're not in love with maybe yeah well it was just more the idea of like we've talked about the the the benefits of like monolithic repos before so i'm like yeah okay you know meh whatever you know because i could see like some pros and cons to either right but the current thing i've been working on has me like more and more and more like even if you go monolithic repo right so which can make a lot of sense like if your code needs to version together and whatnot you know there's benefits to monolithic repo uh you know different teams all using the same repo right if this stuff needs to version version together but the monolithic build though i hate like just build the little pit the little bits here and there as they as they change

Starting point is 00:02:09 and as they need to get built and then you like compose them all together for the final thing for the final deliverable you know i can totally get behind that yeah as your code base grows bigger like the smaller percentage your code changes are likely to be right so as you know uh if you've got a 20 gig repo chances are you're not changing very much of that with every commit or every build so building all of that every time is uh it's pretty nutty yeah and if you have a 20 gig gig repo we should talk because that's cray cray yeah well i mean we even talked about this in the past i think merle was one of the ones who mentioned basil.io right like there's there's tools out there to help

Starting point is 00:02:50 with this kind of thing and yeah do it doing a massive build of everything in your thing every single time seems wrong just just like in fairness making it to where you have to deploy everything at the same time is wrong, right? Preach. So I guess that's my thing. I still believe that if it versions together, then it should live in the same place. I still believe in that because if you break that apart, now you have to manage version compatibilities and and some sort of matrix of of how all that works right so that's dependency dependency management is hard it is so so i'm i'm okay with the mono repo but you don't just because it's in the same place doesn't mean you have to do

Starting point is 00:03:38 everything every single time it doesn't make sense yeah yeah and so that that's basically like I was just curious to throw it out there to see if like anybody else would have experiences where they would say, no, no, no. Here's the reason why monolithic builds rule and you should embrace them. And, you know, here's our experiences as to like, you know, how how the monoild solved our problems. So, hey, if you do have some stories to share with that, you can throw some comments on this episode. You'll be able to find it at codingblocks.net slash episode 183. And you leave a comment, you get a chance to win the book or a physical copy of the book, unless you want the free online version.

Starting point is 00:04:25 Or, I mean, hey, maybe you want a Kindle version. Hey, and also, we didn't get any reviews this last time. I think, I don't know, is anybody out there? Yeah, no, we got, well, not for this time, but we did get it last episode. Yeah, we got one last episode, but no comments either, man. Two, two, don't take away from them. So, somebody say hi something virtually something all right okay so dive in yeah let's get into this thing so

Starting point is 00:04:55 we're going to talk about slow today right slow service level objectives um actually we're going to talk about all kinds of stuff that involve service level objectives. Actually, we're going to talk about all kinds of stuff that involves service level objectives because people, they even talk about it in the book. I'm sure you guys saw this where they're like, yeah, people just kind of use this as the de facto thing that they say. They might say SLO, but they might've meant something else, right? So we're actually going to talk about three different things. There's the SLI, which is a service level indicator, and SLA, which I'm surprised they put this one second, but that's the service level agreement, and the SLO, which is the service level objective. I think we did that in our notes because in the book it's not in that order. Is it not?

Starting point is 00:05:40 Oops. I'm surprised I put it in that order. What you're saying basically is generally when people say service level objectives, it's like a blanket term. What they usually mean is like one of these other three SLI, SLA, or SLOs or maybe a collection of more than one. Totally. So SLI is the first one we have in the notes, not alphabetically. In case you're wondering, I does come after A in the alphabet that I use anyway. But service level indicators are a very well and carefully defined metric of some aspect of the service or system.

Starting point is 00:06:16 So an example might be response latency, maybe error rates, system throughput, and typically aggregated over some period of time and the idea here is that uh this is information that you can use to determine um you know what i'm trying to say it's kind of their health their indicators right so it's a it's some sort of metric or number that gives you an indication of how your system is doing yeah Yeah. I mean, it's the quantitative thing that you can, that you can put your finger on, right? Like you can actually measure it. So,

Starting point is 00:06:50 uh, I wanted to call out though, that one thing that we've, we've said in the past that, uh, there was a lot of similarity, a lot of overlap between this book and the DevOps series that we've covered. Um,

Starting point is 00:07:04 uh, be it the, the, uh, the, the DevOps handbook or what was the other series that we've covered, be it the DevOps handbook or what was the other one? No, that was the big... It's 12-factor app maybe or... No, I thought there was another one. It'll come to me. Minimum CD.

Starting point is 00:07:19 Minimum viable CD. No, no, no. There was another one. It was like the big machine or something. I can't remember. Phoenix Project? No uh at any rate the point is is that they're in this specific section as it relates to like indicators objectives and agreements there's a lot of overlap in uh the designing data intensive applications section so it's like specifically i think it was in like the maintainability portion of the book. There was like some overlap, but this is coming at it from Google's perspective.

Starting point is 00:07:53 So, you know, some of these terms, if you listened to that series, then, you know, there might be like, you might think like, wait a minute, I've heard this term before. Where did I hear it? I will say this particular chapter, I actually liked a minute, I've heard this term before. Where did I hear it? I will say this particular chapter I actually liked a lot, especially coming from Google, because they have so much data and they have so many services that they really had to focus in on what mattered. And I think that was super important. I mean, we'll get into more of that a little bit later, but like we'll wait for it because there's some stuff that I have comments on. So so one of the things that they pointed out here with the SLIs is. There are things that are really easy to measure in a system, right? But sometimes it's not possible to measure exactly what you want with what you have in front of you,

Starting point is 00:08:47 right? Like if you own the server farm or something, that's easy to do. But if, if let's say that you have a service and you're getting complaints from a customer, it might be that the client side is experiencing issues that you're not aware of. So that might be a little bit more difficult to measure. So sometimes you actually have to go outside of what your own purview is to, to try and measure things externally as well. Um, they also, one of the, they said probably the biggest SLI that they have is availability. Um, and what's interesting is we talked about this i think on the last episode is google doesn't necessarily um measure uh uptime the same way that other people do so theirs is basically yield a ratio of the number of requests that succeeded versus the total number right

Starting point is 00:09:40 that's how they do it so that they can measure things in different regions and all that. Yeah. I was curious. I went back to just find like where we did cover this and it was in like episodes 121 and 122, I believe related to like scalability and maintainability. And we were talking about how like using your SLIs and SLOs as like a to know how to deal with scaling your application and defining, well, let's do this by the numbers, but what does it mean by the numbers? And so in this particular chapter where Google's talking about the SLIs and having those metrics to know what that, to even define what that SLI is, to know that you those metrics, right. To know what that, you know, to even define what that SLI is to know that you're even doing it. Right. So the indicator might be like, uh, well, how long is, is a, uh, page taking to load or, uh, a query, like how many queries are you able to return

Starting point is 00:10:39 per second or whatever? Like, you know, those, those might be your indicators. Doesn't, you know, that number by itself doesn't mean anything bad or good. Right. And that's why you need to be able to capture it first and then trend it over time so that you can then know, you know, you can then make a decision as to like, well, Hey, we're, we're doing good or we're doing bad or whatever, you know, like, um, and, and, you know, going back to our conversations from the DevOps handbook about like the importance of visibility and observability and tracking and having those metrics and tracking those things. Like all of these things, all these concepts that we've been talking about for years now. And it's like they're they're tying together. Right.

Starting point is 00:11:20 So from multiple different perspectives. Yeah. And I think, you know, that what you just hit on, too, is really important. When you hear SLI, that's basically your measurements. Right. So from multiple different perspectives. Yeah. And I think, you know, that what you just hit on, too, is really important. When you hear S.A.L.I., that's basically your measurements. Right. Like that's if you're going to simplify things in your mind. This is the thing that actually goes and gets the numbers for you. What was my latency? What was my number of requests? Exceeded the ratio, all that. And so they even called out like for storage purposes. It's more about durability, right? We talked about the number of nines and actually it's funny because I may have said something wrong on the last episode. I can't remember. I think I did.

Starting point is 00:11:53 I think that I said 99.99 would have been two nines and that's wrong. It's the number of nines. Basically, if you take it away from a percent and just do the number of nines after the decimal, then that's how many nines it is. Right. So ninety nine point nine nine percent is the same thing as zero point nine nine nine nine. That's four nines. So it was actually a discussion on Slack around it. I was like, I don't know what I said. Yeah, whatever it was, it was probably wrong. So clear that up a little bit. So if we take all these indicators, right, like I said, like just random number, you know, like, hey, like how many new you like select count star of new users that have been created in like the last hour, right?

Starting point is 00:12:40 Like that number means nothing by itself, right? Like that number means nothing by itself, right? But you might have an, now you want to like take these numbers and put an objective on it to say like, well, I want that number to, you know, I never want my error rate to go above a certain number, right? Or maybe I want, you know, like think about it from like a sales or marketing point of view. Like I want new users, you know, coming to my page, I want like a certain number per hour or whatever i think that might be their kind of objective so this is where we take the indicators and now we start talking about service level objectives and how we can use those indicators yep so i mean go ahead jay-z i was gonna say one thing that was interesting is they mentioned sometimes you'll have like two ends. It'll be a range.

Starting point is 00:13:33 So you'll have like a minimum and a maximum and you'll want your service level objective to be in between those two numbers. I just thought it was kind of interesting because every example I could think of off the top of my head is generally one side or the other. So it's like you either want to be more than this or less than this. I couldn't think of an example where you wanted to be right in the middle. I think that goes along with what they were talking about, where you sort of have your internal SLO and then you have sort of your external SLO. So that range, I think, is in between those two, like meaning, hey, internally, we want to we want all of our requests to be served within 200 milliseconds, right? But what we want for our users in another department, we want them to never experience anything over 300 milliseconds. So as long as we're somewhere in that range, then we're good.

Starting point is 00:14:20 That could be the only thing that I could think of. But they did. It wasn't or yeah they mentioned uh yeah i just pulled it up they mentioned having a lower bound and upper bound but they didn't give an example the only game example they gave was for a search which you know presumably you're fine with search being faster than whatever so they didn't give an example of i just thought it was interesting yeah so what they say in here is yeah they even say it is the range of values that you want to achieve with your sli right so the latency would be one um so they say a range um yeah i guess

Starting point is 00:14:55 you can say like we want our response time to be between 100 and 500 milliseconds so like well why wouldn't you want less than 100 it's like, that means we're paying too much. They actually did call that out later, right? We'll get to that one too. Now, here's one that was really interesting. As I said, choosing the SLOs can be difficult, mainly because you might not have any input into what it actually is, right? Like it's the business that might be driving what your SLO is. So for latency, we just gave, sorry, go ahead. Well, I was just going to say like, that would go back to the example that I gave of like how many new users you want, right? Like, you know, that's something for, you know,

Starting point is 00:15:42 the business owners to decide it's out of your control. Yeah. And to decide. It's out of your control. Yeah. And some of these might be out of your control. And they actually like did call it out in some of these things, too. Like, you know, like queries, for example, I mentioned queries as an example a moment ago related to search. Google has no control over like how many people actually start executing searches on their, on their service. You know, that's going to be based on like popularity and, you know, whatever. But what they can control is like how many they can return within a given timeframe.

Starting point is 00:16:16 And so that's why they would target like the queries per second, not necessarily like, you know, what, trying, trying to to like increase they're not trying to increase the queries necessarily because it's out of their control i don't know that probably didn't make sense the way i said that did it no i mean you're right you can't control at any given time that's why they just try and make sure that their systems can handle a certain amount right yeah okay said better yeah you're right dog you're right it you're right i think we mentioned it's morning time um we all have the the groggy you know voice sound right now so yeah uh well one thing they mentioned that i thought was interesting is um or i just thought it was a good thing to call out was um that these slis aren't necessarily

Starting point is 00:17:01 independent so if you get more requests for example your latency might go up. If you have multiple SLOs based on these, you might have multiple alarms going off at once because these things are related, correlated. It's reminding me of that scene in the movie where someone's flying a plane and every dial is just going nuts and everything's going wrong. It's this kind of funny

Starting point is 00:17:19 example. I've definitely seen that in production problems. One little thing can cause a cascading failure there and then yeah it feels like everything everything is blowing up yeah i'm i'm thinking of scenes from airplane yeah was that what you were thinking of yeah okay yeah literally but you realize like that's such a dated reference though too like i know but the movie is really like predates all of us anyways. But then on top of that, anyone new listening is like, what? That movie?

Starting point is 00:17:51 Yeah. What? Rent it. Wait, can you rent movies anymore? Yeah. Anyway. I don't know. I think planes now just have like iPad and you kind of rotate it around like a Wii controller.

Starting point is 00:18:02 Oh, no. Another dated. So you can play Angry Birds. Oh, wait, that's also dated. Dang it. Dang it. Yeah, we're old. Yeah.

Starting point is 00:18:11 So one thing that they mentioned here, and I love this, I absolutely love this, is the SLOs need to really have a realistic understanding of what the availability and the reliability of a system are so that they can actually publish that information so that you don't get claims that, oh, the system's slow or, oh, this isn't working because if you don't define these well enough, then that's what you're going to get your feedback. And that feedback is nearly useless, right? Like when somebody says, oh, the system's not working. It's like, well, what part of it? Can you log in? Can you go to a page? Can you do this? Right.

Starting point is 00:18:46 Like, so, so well-defined is helpful. Just the publishing of it is also kind of crazy. Like, have you ever, I've never worked in a, in a environment where like, let's say that I was responsible. I worked on the team responsible for like the front end of the website and another team was responsible for the back end of the website. We never, neither team ever, I've never worked in an environment where either team was like, okay, here's our expected uptime. And, uh, you know, what's yours? Like, I don't know, a hundred percent. Like we can't, we need to keep the site running. Yeah, exactly. But that's also the key, too, though.

Starting point is 00:19:28 Some of this is kind of interesting because when you work for smaller companies, then being down for that period of time, that can be super costly to you. Percentage-wise, in of like how much it impacts the, the operations of the, the company, right. Versus, you know, a much larger corporation. Sure. The dollar amount might be, you know, more for any kind of downtime, unplanned downtime, but, um, you know, they can, they can kind of like offset that a little bit better. So it's easier for them to say like, Hey, you know, we're, they can kind of like offset that a little bit better. So it's easier for them to say like, Hey,

Starting point is 00:20:06 you know, we're going to have this planned downtime of this other percentage. And, you know, we can accept that we can eat that. Right. And, and so like,

Starting point is 00:20:16 that's where maybe, you know, having been at smaller companies where it's like, no, no, no, we, we need to stay up and running.

Starting point is 00:20:22 We're always up. Yeah. So there's like a careful balance there of like well okay even in the beginning of this book google was like hey don't don't follow this is a blueprint right this isn't going to be applicable to every company um but you can you can see what we did and see you know and apply it how you know how it works best for you. So there was a chapter that we skipped on the podcast, Chapter 2, which talked about Google internal services. And one of those services was named Chubby. And I just wanted to mention that because we're looking through the notes. I was like, wait, what?

Starting point is 00:21:00 But there was a cool little kind of breakout in the book, which I normally hate breakouts, I know, famously. I've said this, but this one was good, about Google basically having this service Chubby that internal teams had grown to depend on, and they built these services kind of assuming that Chubby would never go down because Chubby was so good. It had a great track record. And then whenever there was an outage with Chubby,

Starting point is 00:21:24 they noticed that all these other systems would go down uh and so um kind of it was kind of a cool example of what can happen if you um uh if you do too good i guess that's what i'm trying to say uh that people just come to respect it and so they started doing planned outages in order to kind of uh let those other teams you know get used to the the notion of this not always being there and having to figure out how to deal with it yeah and to shake out those dependencies yeah yep yeah but uh and that's a good example like somewhere where you want to range where you're not really aiming for 100 and where in fact 100 is is even kind of bad. Well, I mean, it's weird, right? Because like you said,

Starting point is 00:22:08 they did so good that everybody just expected them to always be perfect. And so they had no published SLO. So there was no way for them to indicate to people like, hey, you really shouldn't depend on this, right? Yep. They even had

Starting point is 00:22:23 a big plan to, what do they call it? I they even had a like a big planned out like what they call it like uh i forget what they call it but it was like chubby outage day or something like global global chubby planned outage yeah so really really interesting so now this leads us into the soas and this is you hear this term lot, I think in business, especially if you're using like cloud services or anything, you'll see what the SLAs are. These are the agreements of what's to happen if the SLOs aren't met. And this is why they said that people interchange these terms all the time, right? So really an SLA is the consequences that happen if the, if something doesn't meet the SLOs. If you're not, if there is no consequence, then you're likely talking about the SLO,

Starting point is 00:23:13 right? And that's, that's really the big difference. And if you've, if you've looked at your cloud services, whether you're using Azure or AWS or Google or whatever, typically you'll see that there are so many nines with something like, you know, your storage or whatever. If that's not met, there's usually a monetary consequence, meaning they're going to credit you back a portion of your bill or, or whatever the case may be, right? Like that's typically what you're going to get. There was a section and I have been hunting for it so far while we've been recording trying to find it like where they said we're like one of these was like technically a measure of the other do you remember that where it was like slas were a measure of the slos or slos were

Starting point is 00:23:58 i didn't see that when i was reading this but i don't know that SLAs would be a measure of anything because SLAs are really just your almost like your legal obligation to whatever you've promised the customer, right? Yeah. I mean, they called out like, you know, what's the difference between an SLO and an SLA? And, you know, you just ask the question, okay, like what happens if the SLO isn't met? If there is no explicit consequence, then you're talking about an SLO. But if there is a consequence, then you're likely talking about an SLA. Yeah. And what they say here is the SLAs are, they're, they're decided by the business. But as a site, site reliability engineer, your job is to try and make sure

Starting point is 00:24:46 that you don't trigger the SLO that will trigger that SLA. And sometimes there's interesting kind of time constraints built into SLAs too that are a little bit different. So like, for example, you might have to, if there is an outage, you might be contractually obligated

Starting point is 00:25:01 to respond within 15 minutes or some sort of level of support. Or, you know, if someone opens a ticket, it's sort of certain priority, then you might have a service level agreement. And that's,

Starting point is 00:25:12 that's getting more into the kind of business side of things a little bit different than we're talking about, but those are frequently kind of associated, at least in my mind. Yeah. I mean, and again, those are going to be like,

Starting point is 00:25:24 but you ever noticed how like some companies like they'll have that sla where it's like well we have to respond within like x amount of time doesn't mean they have to do anything with it they don't have to like solve the problem they just have to respond and their response could just be like an automated you know system email like yes we we acknowledge that there is a problem. Okay. You are down. Yeah. SLA, meh. Right. So the SREs, they also are sort of responsible or they're, I guess,

Starting point is 00:25:54 tasked with helping come up with the objective ways to measure this stuff, which basically means finding the right SLIs in order to make these SLOs something that they can work on. Right. And go ahead. I was going to go on the next point. It was kind of cool to see here to mention that Google doesn't have an SLA for most of the services, like consumers use consumers use directly.

Starting point is 00:26:18 Like there's no SLA on search, for example, they don't, they're not going to pay you if their search is slow, but for their business consumers, like companies that buy like, you know, the documents, whatever their business suite is called, they do have SLA. So if like, uh, you know, maybe internal searches down or Gmail for business users is, uh, is down, then that's where those SLAs come into play, but they still have SLOs for those other things we mentioned, like for general

Starting point is 00:26:42 consumers, because obviously they have an, uh, a stake in, uh, and instead of to make things fast and good for you, but they're not going to pay you for it. You didn't have to sign some agreement, you know? Well, I didn't, I didn't put all this stuff in the notes, but what was interesting about the, the search, not having an SLA is the reason they still have the internal SLOs is because for one, they want their search to be fast because that's what the customer trust is, right? Like that's, that's one of the big things. But the other is if their search is slow, that means their Google ads are getting served slower. And so they actually have a monetary hit internally because, you know, search is slower. There's, there's a lot of things that happen there.

Starting point is 00:27:26 Right. I found the statement I was looking for, and I'm just going to tease it right now because we're going to come back to it later, and we'll talk about it then. But I did find, I'm not crazy, the statement does exist. All right. That's it? Yeah, that was it. i was just teasing you with it yeah it was it was a straight up teaser not even any contest just teaser um all right so this this this section right here is the reason why i liked this particular chapter because it's very relevant

Starting point is 00:28:00 to the kinds of things that we've been working on lately. So what should you care about? And this very first statement is so important. You should not care about every metric you can find as an SLI, right? So an example I can give is, so we, we all use Kafka and both love and hate it at certain levels. But you have hate for Kafka.

Starting point is 00:28:29 Can we wait? Pause. I mean, I have hate. I hate, or I have hatred for anything that makes my life more complicated to a certain degree. Right.

Starting point is 00:28:40 And Kafka enables so many things, but it is also like, it's a, it's a decently complex enough system. I haven't gotten to that with it, but okay. Just for instance, we work in the Kubernetes world. If you need to resize the volume, you need to make it bigger for some reason, and trying to make it smaller is just not easy. So there's a lot of things, but it's with any system. So in general,

Starting point is 00:29:08 I'd say I really do like Kafka and it does enable a lot of cool stuff. But where I was going with this is there are some amazing dashboards out there for like Grafana and Prometheus for tracking every single metric inside Kafka. But how many of those are actually relevant to you meeting a service level objective? Yeah. There might only be two out of like 50 that they can provide you, right?

Starting point is 00:29:37 That's cute. You thought there were only 50. Oh, yeah. I don't even know. Like there's probably way more, right? There's so many. Way too many. Yeah, there's a lot. You can see how it's kind of tempting to say, okay, well, here are't even know. Like, there's probably way more, right? There's so many. Way too many. Yeah, there's a lot.

Starting point is 00:29:45 You can see how it's kind of tempting to say, okay, well, here are all my 50. Like, what should each of these be? And so you start setting up, you know, alerts or surface-level objectives around, like, CPU and memory and stuff like that. And those things don't make sense because it's okay if they go high. It's okay if they go low. You know, it's really the surface-level objectives need to be around, like, the business cases. Well, here's the thing, like we we've all seen this situation. So like, let's say you decide to start using Prometheus and Grafana for the first time. And you're like, Oh, where do I even get started? Like Alan was picking on Kafka for a minute. So,

Starting point is 00:30:17 so you're like, Oh, well, where do I do this? And you go and you find there's already like some dashboards that, um, like maybe, you know, if you're using Strimsy, like they've already put out, uh, some for their operator or, you know, you can find other people's dashboards that they put out and you're like, okay, I'll start here. And so you basically start with like everything under the sun, like, okay, it's all in my face now. Right. And you know, now you have a problem and you're like, well, I can't see the needle through the haystack. There's too many things going on. I don't know which one of these things really matter. Right.

Starting point is 00:30:46 Or you take the flip side where you're like, hey, you know what? I was listening to CodeBox. They were talking about this DevOps handbook thing and they were talking about observability and like getting metrics. So I'm going to do one of those for my custom web app. So you're like, OK, well, this might be a good metric to know. And so you spit that one out and you build a little dashboard panel for that. And then you're like, Oh, Hey, here's another indicator here or another metric that I can easily put out. Right? So now you build a panel for that thing. And, and before you know it, you, again, you've like recreated the, the, the other situation where you have too much

Starting point is 00:31:20 data happening in your face. And the problem is, is that, that the thing that I liked about this portion was they were calling out, like, just because you can put the metric out there doesn't mean that it's a metric that helps you in any shape or form. Like imagine if in your, uh, you know, okay, you mentioned Kubernetes. So let's pick, let's mention in your Kubernetes cluster. Um, if you, I don't even know if you could do this, but let's say that you had a metric that was being spit back out to Grafana on your dashboard there that showed you the temperature of the CPU for that pod.

Starting point is 00:31:54 Why do you care? Why do you care? Like, I mean, temperatures are like, you know, that's an easy thing. Like there's a lot of, you know, solutions out there for getting the temperatures of your, you know, the different components on your system in this situation is that a metric you care about is that like it might be easy to do but who cares right because in your kubernetes cluster in

Starting point is 00:32:14 theory if that node dies because that cpu got too hot you're gonna get moved to another node right like it it does not matter to you at all shouldn't and some of these good indicator it right you know after the fact, like, Hey, why did my pod die? And go in like, Whoa,

Starting point is 00:32:27 this temperature was way too high. Like good info. No objective needed though. Right. None. Yeah. And I was gonna say that like some of these things might even be like, uh,

Starting point is 00:32:36 the indicators by themselves you don't care about, but maybe combined with other indicators, then you do care about it to know that like, Oh, well the temperature rose and they're like, you know, this person, this small amount of time it to know that like, oh, well, the temperature rose in like, you know, this person, this small amount of time, that's a problem, you know, but in general, I don't care. Never show me that. So, so that's one end of the spectrum, right? Like you have so much there that it's like you said, it's a needle in a haystack, right? Like that's a problem.

Starting point is 00:33:00 There's the other end of it too, though, that could be a problem is if you just have one or two metrics you may be missing entire gaps in your observability of the system because you don't really have enough to give you the picture of what's happening so it's it's a balancing act i i would say though um my uh take on it now, right? Because we've gone through this for a minute or now, right? And I think this might be even consistent with some of the things that came out of the DevOps handbook. But to me, less is more. So start with, you know,

Starting point is 00:33:40 I know I need to track the 500 level errors coming out of my web app. So I want that indicator being presented, and then I can trend that over time, blah, blah, blah, blah, blah. And then maybe I can know that, oh, hey, there was a high level of them because we took the system down for an upgrade or whatever. So you start with one metric five hundreds, right. But then over time you're like, Oh, you know what? I also needed to know, uh, you know, I had, I had some crashing on my database because I ran out of space for the, uh, the right ahead log. So, uh, I need to monitor what's the size of that log, or maybe I want to know the size,

Starting point is 00:34:27 you know, free space available on that disc. So either way, like now you've learned something. And so you're like, Oh, let me add a new metric for that. So the point is,

Starting point is 00:34:34 is like, as you learn that you need some metric, add the metric, but don't start up front with like every metric known to known to man, and then try to like whittle down. These are the five I care about. Right. So along those lines, though, I think the important part is to get to what you just described.

Starting point is 00:34:56 You need to know the service level objective. Right. Once you have the service level objective in mind, then you can at least intelligently say, Hey, you know, these are some important metrics that I need to track in order to even be able to see if we're meeting these SLOs. Well, yeah. So said another way, like the way I was just describing it is like, you'd have to have the problem first to know that you needed to have the indicator. You need that indicator, which will happen, right? Yeah. And I guess I'm just saying like, I embrace that. I'm fine with that. And, and's doing here, though, with, you know, like thinking about the service level objectives up front is they're trying to get in front of that and say like, well, what are the things I care about to know whether or not it's working correctly or uh that it you know

Starting point is 00:35:45 incorrectness might not also be the thing either like you know if you could be correct 100 of the time but if you're really slow at being correct then nobody's going to care to use it right um so you know those kind of factors they're trying to get in front of those things by thinking through this. Which is cool because they sort of have some templated layouts for what their SLOs are. Um, Jay-Z, you want to grab a couple of them? Yeah.

Starting point is 00:36:13 Uh, so the examples they gave are availability, latency, and throughput availability. You know, we've talked about a bunch of times, but the service is up or not a latency here. Um,

Starting point is 00:36:22 their example is specifically talking about, uh, like web requests requests like how long it takes to you know basically how slow something is but in our world we talked about kafka latency also has a different meaning because it can mean how long it takes for something to get processed in your pipeline and what i like about this is that if you have requests that come in every uh say 30 seconds but it takes you more than 30 seconds to process them, then you've got an outage waiting to happen in just a matter of time because you're not fast

Starting point is 00:36:51 enough to keep up with the data coming in. So you're going to have a problem. So latency there, it just means something different. But that's kind of a great example of something where you might only have a serviceable objective on availability because that's what the customer sees. But if you don't have one on latency, it can take you out and get you in a really bad spot where you're hours behind or whatever. And also never able to catch up. And throughput was last one, which is how many requests were able to be serviced.

Starting point is 00:37:18 And this is a good example, too, where zero is not good. If you've got zero throughput on the system, you might want to have an alert on that. Just like you might want to have one if it's too much. Hey, so real quick, on those three that he just mentioned, the availability, latency, and throughput, that was an example of their template for user-facing services, right? Like those are anytime they stand up a service that an external customer is going to use,

Starting point is 00:37:44 those are the three SLOs that they target there. Yeah, and then they have... Go ahead. Nope. No. Storage systems. My

Starting point is 00:37:59 SLO was too slow, and so you beat me to it. All right. What's that? You, you, uh, my, uh, SLO was too slow. And so you beat me to it. All right. What's that? We did a bit. All right. So storage sections, uh, storage systems was the next one.

Starting point is 00:38:17 They had examples of, so latency, how long did it take to read? Right. Uh, obviously that makes sense for something like a S3 or Bob storage availability. Uh, were you able to retrieve it at all and then durability is still there when it's needed and uh yeah that's where all those nines come in uh are generally around uh you know i need to look that up but um the all those nines for s3 are they for availability or durability or both i think durability i remember we talked about wasabi and they had the 11 nines, but it was for durability.

Starting point is 00:38:48 Yeah. So not, not necessarily availability, meaning that like they could take the system down for maintenance and they wouldn't lose your data. So it's not available, but the durability isn't, you know,

Starting point is 00:39:01 it's still, it's still there on disc as soon as they didn't like boot it back up. Yeah. So, uh, they're, uh, they're a S3 website for Amazon literally says designed for durability. Yeah. And then for big data systems though, the, the template is throughput. So how much data is being processed and then the end to end latency. So how long, uh, from ingestion to completion of processing. so this goes back to your pipeline example you know like uh it's it's taking into consideration more of the overall process not just one piece of it so so you can almost think of like end to end latency as like a higher level objective because like there might be components within your objective like if you did

Starting point is 00:39:44 like we were picking on kafka for example. So let's say you were doing a Kafka pipeline and maybe you have, you know, something like a Debezium that's reading from one source and putting it into a Kafka topic. You have Flink that might be reading it out of that Kafka topic and maybe writing it out into like, you know, another topic or database or, you know, like elastic search or whatever. And so like, you know, you might have an SLO on like how, how fast, uh, Kafka can read and write into a given topic. And you might have an SLO as to how, like how, uh, the availability of elastic search and, uh, you know, new documents being updated and like when they're searchable again, but none of that matters when, when you're talking about, well, let me say, let me not say it that way.

Starting point is 00:40:29 Let me say that that doesn't paint the full picture of what was the end to end availability. I get a new, uh, document in my source now in, and you know, there's all those touch points that I mentioned. So it has to go through to BZM, Kafka, Flink, Elasticsearch, four different technologies, and that's the overall end to end so yeah when they talk about the big data pipeline or big data system there and template includes the end to end latency yeah of course everything should care about correctness and a little section here on collecting indicators too so like most of the things we've talked about have been server side

Starting point is 00:41:04 and so you know have something like a promet or Influx is going to kind of scan those and store those and get those from logs sometimes. But also there are client side metrics. So there's things you can do sometimes with like a mobile app or whatever, where you can kind of collect metrics or just kind of on a website. And what's important there is you might catch something where there's some sort of bug or something else is going on that's actually in the app that makes the customer experience bigger response times or latency than you're seeing in their server-side metrics. I do want to add, though, that related to the correctness, they say that that's typically like a property of the data in the system rather than necessarily the infrastructure.

Starting point is 00:41:47 Right. So they don't measure that. That's not something that the SRE is responsible for. Right. Like a database. Like it should work. Right. Yeah.

Starting point is 00:41:59 It did make me wonder there, like, because like, have you ever like written a query that returned back you know incorrect results but it did it fast right like i mean you could like select the wrong columns or like have some error in your predicate right and so that's an example of like well the ascent the correctness is assumed to be there so if you if your predicate was wrong or you're selecting the wrong columns i mean that's just a bug in the system. Yeah, in your application code. And remember, we're separating out the product development teams from the SREs in this Google world, right? And so that's why that would be an issue that the product team would be responsible for, not the SRE. Hey, so do you guys remember?

Starting point is 00:42:45 I don't even know if this is how web pages work anymore. You don't know? If you don't know, we're up the creek, man. So, I mean, the thing is I haven't done any UI work for the web in a while, but you guys remember when back in the day, and I say back in the day, as in a long time ago, you were supposed to put all your JavaScript in the head, in the head section of your website, right? And this goes to the client side latency. At some point, they told you to stop doing that because that would block the rest of the page, right?

Starting point is 00:43:20 So it would wait for all of the JavaScript stuff to be loaded up in the head section before the rest of the page could be rendered. Now, the reason I say, I don't know if it's this way anymore is I don't know if Chrome and Firefox and all those have gotten smarter and they do things a little bit differently now, but at some point they said, take that stuff out of the head. Like for instance, the Google AdSense or the ad tracker stuff, right? Like if you wanted to track your stuff in Google analytics, they would tell you to take that script block and put it in the body at the bottom of the page so that your page could render first before it fired off and loaded that JavaScript to let Google know that, Hey,

Starting point is 00:43:59 there's been a visit to the page. And this is like why they say tracking the client side latency actually matters because in the old days, and like I said, I don't know, maybe it happens this way today too, but with those scripts being up in the head, it might be three seconds before your page would render because it was loading up all this stuff in the head, even though it had the content for the rest of the body, right? Whereas when you move that stuff out of the head and you put it down at the bottom of the body, your page could when you move that stuff out of the head and you put it down at the bottom of the body, your page could start rendering immediately.

Starting point is 00:44:27 So within, you know, I don't know, 300 milliseconds of the request being made to the server, you can start painting the page and then that hit would be unnoticeable at the bottom. And so that's what they're talking about. Like, that's why sometimes it's important to go down to the client to find out what's going on, because there may be things happening that you're not even aware of that will require some investigation. Yeah. You ever like look,

Starting point is 00:44:50 gone to a website and you'll see the content come in, but then it like starts moving around and taking shape as it's, you know, as it's loading in. And that's why, because, you know, maybe the CSS or the JavaScript to fill in and define like what those were supposed to look like wasn't loaded until later.

Starting point is 00:45:08 But you might have already had some Hello World kind of messaging or whatever popping in. It probably wasn't Hello World, though. It was probably something... an image? From render image server side, just send one image and then you know it's going to look the same. Ordf even better there you go yeah so get rid of html altogether i like it yeah that's that's the web 3.0 right there there you go uh so next section was on aggregation uh which was really nice so um typically you're going to aggregate raw numbers and metrics and so example would be like web requests you know it doesn't really make sense to say like 13, like you, it's a rate.

Starting point is 00:45:45 So you would say 100,000 over the last 15 minutes. But aggregations are dangerous because they can sometimes like hide the true behavior. I was thinking about like sensor data here. Like if you're looking at like a large window, you might look and say, okay, well, the temperature over the last five minutes has stayed the same. Great. But what it might be hiding is that it may be spiking really bad so like you know i said i think five minutes there like maybe minute one was way high minute two was way low minute three and so you know it averaged out the same and maybe you might even have the same median i don't know but um yeah it's

Starting point is 00:46:21 just uh the resolution of those metrics can can hide what's going on and things you might care about. Same with latency. Well, hold up. So you use temperature. Temperature is kind of hard to equate into a machine-level type thing. But if you had something similar with requests, like what you were just saying, super high and then super low and then super high and super low. The problem is the average might look the same, like you said, but your system's getting taxed way more when there's those bursts that come in. And so that's the thing that if it's hidden you and you don't have metrics to actually look at that thing

Starting point is 00:47:00 properly, then, then you're just saying, hey, the average looks like this. So I don't know why the system's having problems. But in reality, what's happening, it might be two to three times worse, but it's hidden because of how you're showing your metrics. Yeah. And they use this example of like, if you had 200 requests per second on the even numbered seconds and zero otherwise compared to a system that had a constant hundred hundred requests per second for every second right they would both average the same but their burstiness is definitely different and you you know so they're basically calling out like the difference or the importance of not using averages and instead going after percentiles. And this definitely goes back to the conversations that we had in the

Starting point is 00:47:51 designing data intensive applications related to scalability, where they refer to it as like P 95, P 99. What was it? P 98. Or I forget what the different ones. Oh, it's P 95 deviations, right?

Starting point is 00:48:05 No, it was P95, P99, and P39. So P999. But the P marking the percentile and then the numbers being like the 95th percentile or whatever. So basically, how do you know how uh, you know, how well your system is doing. And so like, if you're going after the 95th percentile, then you're saying like, okay, 95% of the time it's in this, this acceptable range, but 5% of the time it's, it's bad. And like when it's bad, like, you know, it's really bad. Right. Um, and so, you know, using these percentile ranks for your, uh, whatever your metric might be in this case, like if we're talking about latencies, right? If you're going after if you're targeting like a ninety nine point nine nine percent latency, that would that would be kind of a person.

Starting point is 00:48:57 If that's your percentile, that would be an extreme one. Right. But let's just say ninety ninth percentile for the latency. Then, you know, one percent of the time, the latency might be unacceptable. But 99% of the time, it's fine, right? Well, here's something to be careful with these. And so I actually threw in some notes regarding Prometheus and how this stuff works here because I've actually been messing with this a little bit. So if you do that P99, right, like they call them quantiles in Prometheus. But if you do 99%, then that means that 99% of your requests all happened within the given amount of time.

Starting point is 00:49:34 Right. So so if we're talking about latencies. Right. So, for instance, let's say that your 99th percentile is five seconds. That means that 99% of all your requests happened or were serviced in five seconds or less. If you go up to 99.99, you just added two additional nines of tracking and it might jump from five seconds to five minutes, right? So that's, that's the thing that you kind of have to wrap your mind around is typically when you set up these things, one of the things that Google mentioned is right, like they want 95% of their requests for a particular service to be in 300 milliseconds or less, right? So they would set up a quantile of 95%. And then hopefully that number they see comes in under 300 milliseconds, right? Because then that means that

Starting point is 00:50:32 they are meeting their service level objective. You might put in the double nines, the quantile of 99, just to see what's happening for your long tail users, right? It might be that they are having an absolutely horrible experience, right? Like it might've jumped from 300 milliseconds to 10 seconds. And they may want to address that. I mean, they may not, but at least it paints the picture. But then also you want to drop down to your 50% Quantile to see what the average requests are doing, right? So it might be that your quantile of 50%, you're serving most requests in, or, or a lot of requests in under a hundred milliseconds, right? So I guess the important part is if you don't go all the way up to a hundred

Starting point is 00:51:18 percent in your quantile, you won't see what the absolute worst request was, right? You're only getting what the population is hitting in those things. Right. And that's, that's what you kind of have to wrap your mind around where it's different than averages. Well, yeah,

Starting point is 00:51:38 definitely different than average, but I think that's also like a difference of the tool though too. Right. Cause in this particular case, you're talking about Prometheus, but like like you could you could target 95th percentile but still show over time like oh we definitely went over 99 percent or over 95 percentile like in a graph form right you could show that you went over that metric i think the important thing, though, is that, I mean, this goes back to like the start of this book where, you know, in that 99 percentile that you just gave where you said that it went up to five minutes. That's definitely bad.

Starting point is 00:52:18 Super bad. Nobody's going to argue that, especially if your target is like, I think you said 300 milliseconds in your example, right? Yeah. So, but maybe that's stuff out of your control, right? Right. That user could be on a cell phone in a really bad area. You know, they could be, you know, trying to browse your site from the Amazon rainforest and, you know, cell reception isn't so hot there, right?

Starting point is 00:52:42 Or, so I guess my point is that where I was thinking as you were describing this is like, that's all fine and dandy. And I agree with it. Don't dare set an alert just because you crossed, you'd want to say like, okay, it's been, we crossed it for a period of time, you know, rather than just like one occurrence. Agreed. And that's why the percentiles actually work out to your favor, right?

Starting point is 00:53:14 Cause when you have that, you should never trigger on one, right? Assuming you have more than two events that happen in the system. It should only happen if it starts trending that way over a given amount of time. Yeah. All right. So where do we have it? So studies have shown that users prefer a system with low variance, but slower over a system with high variance, but mostly faster. So this kind of goes back to what I was saying before about like, you know,

Starting point is 00:53:46 if you're a hundred percent correct, but you're really slow, you know, people would prefer there some response time in that than necessarily the absolute correctness. Actually it's the inverse of what you just said. People would prefer consistency. Whether if it's now, if it's super slow, obviously nobody's going to like it, but people would prefer knowing that every time that they use a service when they do it again, they'd rather be relatively decent all the time as opposed to screaming fast and then sometimes really slow. If there's a road and you can go 35 miles an hour, people are fine with it. But if there's a road that's 45 miles an hour, but every once in a while there's a bus that stops and people

Starting point is 00:54:41 have to wait. And and overall it averages higher but people really feel that they noticed that time that stops and so they're going to complain about it more they're going to rate that lower and they're going to say it feels slower even though on average it ends up being higher yeah i'm replaying what i said in my mind and i'm like thinking that i need to go back to bed when you said it i was like yeah this is totally not i'm like i said that out loud i like the bus example that's really good yes uh just said and did you say at google uh they prefer distributions over averages just

Starting point is 00:55:19 like we said because they kind of let you get at those long tails there's better representation of data like if you tell a data scientist uh you know hey the average of my numbers is 50 you know you're not telling him anything if you tell him the median it's a little bit better if you can tell him percentiles then suddenly you have a really great way of just describing a set of numbers but you know there's more overhead whatever but um i mean like much better you know the one that um my argument to like averages has always been like if you want a really easy to comprehend – like anybody, they don't have to be in computer science, right? Like an easy to comprehend version of why averages fail you at times is if you were to talk about wealth or just income, right? And you talk about averages, you have some extreme outliers right like a bezos or a musk or

Starting point is 00:56:08 gates or whatever that totally will throw off an average right and so it's like i don't know that's so helpful right yeah totally yeah it's like the worst way to describe a set of numbers but i mean it's still helpful and certainly it's got its uses but uh yeah it's all it's like the worst way to describe a set of numbers. But, I mean, it's still helpful, and certainly it's got its uses. But, yeah, it's like, in my mind, it's like average if you got it, median preferred in almost all cases. And then percentiles is preferred even more than that by far. Yep. So the one thing that they say here is if you don't really understand your distribution of data, it could be a problem because your system might be your system or you might be taking actions that aren't right for the situation. Right. Like if if every time like like Outlaw mentioned, don't don't dare alert me if just one thing hopped over this number.

Starting point is 00:57:02 Right. Like, don't do that. Well, if you don't have the right distributions in place, you might be restarting systems because you think, oh, the CPU is too high now, right? And it just spiked for one thing. Like, you might be doing things that are more harmful to the situation than helpful. And for those that are into statistics, they're saying, like, don't assume that the data is normally distributed. So, uh, the bell curve. Yeah. Don't, don't assume it's a normal bell curve. It might be a skewed curve in like the income example that I gave is a, is an example of a skewed distribution where like, you know, the spike is going to be on the left hand side, you know, and you're going to have this like very long tail out towards the right-hand side. Right. And so, uh, you need to understand what your data is because if you go after it, assuming that it is a normally distributed data set, then that, um, where the

Starting point is 00:57:57 tip of that bell curve is, uh, is, is might not be where the median, uh uh is in the case of a not normally distributed data set so it'll throw off all your all your metrics uh it's been being thrown off like as a human i'm often thrown off when um sometimes you ever see like a page of graphs or charts or whatever but the units aren't the same so maybe the top one on the right is like days of the week and the one on the left is like minutes the times don't line up uh so it could that could be a real problem with as a human especially time zones i've seen that where um yeah anyway just like two different charts will have different kind of time ranges um but uh just go ahead well i was gonna say can i go off on a tangent for a minute because

Starting point is 00:58:45 when we talk about like humans and readability and charts and things like that one thing that like super irritates me and i actually appreciated it that it was called out in this book specifically in this chapter because so often and we've talked about uh grafana here so let's let's pick on grafana for a minute um because grafana does this and it it irks me at times where you draw a graph and you know maybe the the left bottom corner is zero zero but also maybe not maybe we've scaled the graph right and and we've like super zoomed it in and And so like, or in the case of this chapter, the Y axis is logarithmic. So, you know, meaning that the bottom half of the

Starting point is 00:59:35 graph might only represent like say 50 points of data, but you know, the top half of the graph might represent a billion, you know, so it's like totally scaled weird, you know, the top half of the graph might represent a billion, you know, so it's like totally scaled weird, uh, you know, changing the scale on, of that Y axis as it moves along the graph. And Grafana will do this thing where like, depending on what the data is, it will just zoom in altogether. Right. And so your, your bottom left corner, instead of being 0-0, might be like 98, right? And so you'll see these large jumps in your graph, in the chart, right? And you'll think like, oh, the world's on fire. Like, look what just happened.

Starting point is 01:00:17 Like, look how steep that climb is. And you're like, oh, wait a minute. The axis is like, it's super zoomed in. Yeah. It's showing two different points right like 98 99 now instead of 0 through 99 or or maybe even like 98 and 98.005 and yet it looks like you know the world just caught on fire because of the way it zoomed in and so like that's why like i get it i look at these graphs and charts that are so often and i'm like wait what and then i have to like go back

Starting point is 01:00:51 into me like uh hold on like one of the accesses here have you ever seen it where like um sometimes uh you know the charts have to find like inputs and stuff so maybe you'll have a thing in the top right where you can just like shrink the time range to say like show me last three hours right and all the charts on the page that take that input will adjust but maybe one of them Maybe you'll have a thing in the top right where you can shrink the time range to say, like, show me last three hours. Right? And all the charts on the page that take that input will adjust. But maybe one of them still shows, like, the total count per day or something. And so, like, it wasn't set up as input, and there's not a good way to see that. It's not, you know, it's not respecting that field.

Starting point is 01:01:19 The woes of Grafana. Yeah. Yeah, it's really, it's user error, right? It's setting up the charts properly but but still it'll freak you out like what you said yeah it's definitely not cool why is this bad why does this look like this it's definitely not a problem with the tool it's a problem with the tool using it yes totally yeah totally exactly uh so uh also another good point is like you can imagine if Google had just one dashboard so that like CEO or whoever could just log in and be like, how are we doing on our SLOs today? And you can see the different teams like use different measures.

Starting point is 01:01:56 So like maybe, you know, the search team is like, hey, we've got requests per second. But then, you know, the office stuff is like well these are this is our email delivery rates per minute and the next one is like this is our uptime per hour it just makes things difficult to kind of read as a human so the more you can kind of keep those units the same and just kind of standardized the better yep oh that was that was actually leading into this last little bit here was the standardization. They almost don't even describe it when they're setting up new services because they do have a standard set of things that they measure on every single service.

Starting point is 01:02:33 Right. So those are almost assumed in, in the primary reason for that is so you don't have to convince or describe that same thing to everybody. Every time you set up a new service, these are the SLOs that the service has to meet done, right? Everybody is on the same page already.

Starting point is 01:02:49 So, um, you know, that that's helpful for both the business and the SREs. And you imagine having a dashboard and it's like, well, this one's, these are,

Starting point is 01:02:58 this is latency, but this is like latency plus plus and like latency plus plus also measures like, well, I can't compare these two now right right this episode is sponsored by shortcut have you ever really been happy with your project management tool most are either too simple for growing an engineering team to manage everything or too complex for anyone to want to use them without constant prodding shortcut is

Starting point is 01:03:23 different though because it's better. Shortcut is project management built specifically for software teams and they're fast, intuitive, flexible, and powerful. Let's look at some of their highlights. Team-based workflows. Individual teams can use Shortcut's default workflows or customize them to match the way they work. Organization-wide goals and roadmaps. The work in these workflows is automatically tied into larger company goals. It takes one click to move from a roadmap to a team's work to individual updates and vice versa. Tight VCS integrations. Whether you use GitHub, GitLab, or Bitbucket, shortcut ties directly to them so you can update progress from the command line.

Starting point is 01:04:03 And a keyboard-friendly interface. The rest of Shortcut is just as keyboard-friendly with their power bar, allowing you to do virtually anything without touching your mouse. Iterations planning. Set weekly priorities and then let Shortcut run the schedule for you with accompanying burndown charts and other reporting. So give it a try at shortcut.com slash coding blocks. Again, that's shortcut.com slash coding blocks. Shortcut. Because you shouldn't have to project manage your project management. All right, who's doing the beg here?

Starting point is 01:04:35 Want me to do it? Well, I think Alan did such a great job last time. Well, didn't you do a funny voice last time? What was it? Oh, you did like a Johnny Cash voice last time or something. I think I did. I don't remember. We also didn't get any reviews last time yeah we did so maybe we don't let alan do it yeah no i think maybe it should be jay-z this time yeah i think i think he needs to beg all right uh all right hey y'all uh i would like to ask you for reviews because

Starting point is 01:05:00 we are doing really bad we're doing really bad reviews We're doing really bad on reviews lately. We haven't been getting any. We haven't been getting many. And yeah, I mean, even if you got a bad one, just let us have it. We're so desperate. Whoa, whoa. Who let this guy talk?

Starting point is 01:05:17 Sorry, this is why I don't use it. So I'll take over from here. If you haven't left us a review, we would greatly appreciate it if you would leave us a review, especially a positive review. But if you do want to leave a negative review, hit up Joe on the Slack. He apparently likes that. You can find some helpful links at www.codingbox.net slash review. And also, I don't know how long we're going to keep saying this, reminding people of this, but I guess we'll

Starting point is 01:05:41 continue. But apparently, this is a thing in Spotify too so uh you can uh i guess it's just like a thumbs up or something or no it's like a star or a plus or something i forget you see how often i use spotify i'm like the one out of 10 people that don't use spotify yeah that's crazy maybe that should be a survey like do you use spotify and everybody's gonna be like yeah duh yes and right it's just like sock sock shoe shoe and i to be like, yeah, duh. Yes. It's just like sock, sock, shoe, shoe. And I'll be like, get off my lawn. Yeah. Well, I mean, sock, sock, shoe, shoe. Come on. Right.

Starting point is 01:06:13 Everybody does that. That's crazy. Sock, shoe, sock, shoe. We already established this. We had a poll. That's right. Yeah. And it was sock, shoe, sock, shoe, right?

Starting point is 01:06:24 Okay. I still think about that. Every time I it was a sock shoe, sock shoe, right? Okay. I still think about that every time I put my shoes on. I'm like, I'm still baffled that that was like such a thing. I never would have guessed it. It never would have dawned on me that it was like any kind of controversial statement. All right. The little things that we take for granted in life are sometimes funny, right? All right.

Starting point is 01:06:43 Well, we move on to my favorite portion of the show. Survey says. All right. So a few episodes back, we asked how awesome was game January and your choices were. I learned so much or I forgot how much I need to play other people's games or how much time I need to play other people's games or how much time I need to play other people's games. Or I thought my game was good, but oh my, some of these are pro-fesh. Or I now know that I want to be a game developer. Followed by I now know that I do not want to be a game developer. All right, this is episode 183.

Starting point is 01:07:22 Alan, you're up first according to to tucko's trademark rules of engagement yeah i'm going to say i thought my game was good but oh my some of these are so profesh because that would that would have been my takeaway uh i'm gonna go with 30 percent all right uh well i'm gonna go with 31 percent oh man because you felt the same way huh well i didn't say which answer though oh that was not assumed yeah i was just trying to be funny it totally didn't work uh yeah so well i'll stick with it though let's go with it with the profesh yeah yeah you're both wrong really what was it i learned so well i'll stick with it though let's go with it with the profesh yeah yeah you're both wrong really well what was it i learned so much i now know i do not want to be a game developer 75 of the vote oh wow oh my gosh all right yeah uh awesome yeah i thought it was pretty funny so

Starting point is 01:08:22 you know um you know we've been talking about measurements and everything, though, and it made me think that – because here in America, we use the – it's either referred to as standard or imperial system. Do you remember – well, we wouldn't remember it. It was technically before our time, but in the history books, you might remember having come across – there was a where america was trying to switch to metric back in like the 60s i think it was or something like that like it was either the 60s or 70s like they literally did put in like you know um various legislatures or whatever were like going to make a concerted effort to like we're switching america to the metric system to be like the rest of the world right but they didn't and if it failed miserably and then made me think like you know americans can't switch from pounds to kilograms overnight it just can't it caused mass confusion ah

Starting point is 01:09:17 excellent long lead up but poor execution, but whatever. It's morning. It's morning. Yeah, I'll take it. So how about this? We're talking about all these metrics and how to identify these things in SLIs and SLOs. So for this episode survey, SLIs and SLOs sound awesome, but does your team use them? And your choices are, of course, how else would we track our error budget? Or, I mean, they sound great, but yeah, we don't have

Starting point is 01:09:53 those. Or, oh, wow, we have so many slow parts. Oh, it's an acronym. Never mind. Or, we're on our way and it's looking promising. That's for the optimistic people out there. Or I'm convinced and we'll implement them in the near future. And that one's for the procrastinators. Oh, man. This is a... I can almost guarantee you I know what it won't be. I'm so curious, but I don't want to like, I mean, I really want to know.

Starting point is 01:10:30 I mean, do you have an error budget, sir? I just want to know. Me personally? Sure. Sure. Right. Right. All right.

Starting point is 01:10:43 Well, let's get back into this. Just a quick reminder, though, if you want a chance to get a copy of the Uh, Google made it freely available on the web, probably so that they can like keep metrics and track who's reading the book and how often it's being read and things like that. But that said, I did notice, uh, this week there was actually an update available to the book that, uh, on the online version, you just, you know, you get, well, it was kind of nice of nice. So objectives in practice. Yeah. I don't know that I like this one. No, really?

Starting point is 01:11:33 No. Find out what the users care about. Not what you can measure. That's so much harder. Well, it is. But I mean, this goes back to like what I was, this is similar to what I was describing earlier where it's like, you know, don't just everything that you can measure under the sun isn't necessarily the things that matter. And so this is kind of flipping it on how do you define what matters?

Starting point is 01:11:53 Well, what do your users care about? What's the user experience? And let's start with that. And I'm totally kidding, right? It should absolutely be driven by what the users care about because they even say right just because it's easy to measure doesn't mean it's useful to your slo at all right it doesn't mean anything yeah right yeah if you have a static web uh website right let's say let's let's talk let's keep it in a kubernetes world container world so that website isn't changing until you do

Starting point is 01:12:23 a deploy like why do you need to know like how many free inodes you have available on that system right like who cares don't matter yeah now i like uh next section on defining objectives slo should define how they're measured and what conditions make them valid and uh so here's an example of good slo definition 99 of rpc calls averaged over one minute return in 100 milliseconds as measured across all back-end servers that's fantastic yeah and it's up to you to type that into prometheus or grafana or whatever you know you can kind of define these things but that tells you so much so how many times have you seen something that's like latency five? Like, well, wait.

Starting point is 01:13:07 What does that mean? Is that good? I'm confused. I thought we said the averages were bad. Oh, no. All right. My head just exploded. That's right.

Starting point is 01:13:16 But hey, at least you know what it is. Yeah. And so, you know, as we mentioned before, it's unrealistic to have your service level objectives objectives meaning 100 and uh yeah striving for 100 takes too long expensive it's just not worth it so do you remember when i teased you guys at the start of the show or early in the show about oh hey this was the section teasing yeah this was the this was the section where like it made a reference to something else so we have previously talked about error budgets i believe in the last episode if i recall and so this they they made a point of saying an error budget is just an slo for meeting other slos and that was the thing that was the like one thing referring back to the other thing that was like wait was it this chapter it

Starting point is 01:14:02 was that one okay so it wasn't measuring measurements it was observing observability it was some it was some sort of recursive call to itself yeah all right so we should probably google recursion and you know it'll all go back i'll go well uh here's another section i like because this is like the first like this is instantly what i want to do i know it's wrong but uh they talk about when you're choosing targets one thing you should not do and this is something i desperately want to do is to chart is to choose slo targets based on current performance and what i mean by that is like if i'm setting up metrics for the first time on a system that's already existing and i'm trying to figure out what numbers should be the first dang thing i want to do is take a look at what they are now

Starting point is 01:14:53 and that is such the wrong answer right yeah well because like for example let's let's go back to that web server uh you know in kubernetes example if you were to say, okay, I just spun up my Apache or Nginx instance on this pod. How many times can I hit a static index HTML file? Maybe it's even a default one or whatever. And whatever that number is, that's the metric for like how well I want my web server to perform. But in that example, like that's not even relevant to what you're doing because that's just a static webpage. Whereas your other one's really dynamic, has a bunch of API calls to make. There's authentication to deal with things like that.

Starting point is 01:15:42 So like, you're not really comparing apples to apples. So who cares what its maximum was in this one particular scenario? What's really more important is like, well, what's realistic. And, you know, those high, those crazy high numbers might not even matter to the user going back to Google's point from the beginning of this book, right? They might not even matter. So let's come up with like a real, uh, uh, something that's more, what am I trying to say here? Like, uh, uh, representative of what the users are going to care about. Yeah. And that's definitely what I'm thinking too, is, uh, it's about what your users expect and what your users want, not about what you have now, but I guess you can make the argument and be like, nobody complained last week so let me see here's our average sometime last week i guess

Starting point is 01:16:28 it's fine i mean so it's a shortcut but it's just it's coming out from the wrong direction i i you you three or two i can't count this morning um i'll average that out and it'll come out to like you one and a half um ah dang it wouldn't even be that, uh, you won. So I'm going nowhere fast. You might recall, like, do you remember, uh, way back in the day we had, uh, uh, uh, server environment where like we were trying to decide like, okay, well, what do we want to be able to serve? And like, how many servers do we need? Blah, blah, blah, blah. And we ended up with like a really really high number do you remember that i remember i totally remember this and and we were using like at the time we were we were trying to say like okay how many concurrent users do we want to be able to maintain on a given server right and so we had this formula of like okay here's the average think time that a person's going to stay on a given page. Here's the pages per second, you know, divided by the CPU times CPU, blah, blah, blah, equals concurrent users. And we ended up with, like, many more web servers than we needed. A few.

Starting point is 01:17:35 By orders of magnitude. Yeah. We needed it. I mean, that makes it sound like way worse. But, yeah. It was bad. It was a lot. Yeah.

Starting point is 01:17:46 But, but it made me think back to like this portion made me think back to that. Like that was an example of, you know, um, we at least tried though, right? Like we,

Starting point is 01:17:56 we, we had a, a metric that we wanted to start with and we use that metric to try to define, uh, how to build out from there but then you know ended up overshooting and and had to go back and revisit things so i guess the point there being like even with these slos you're going to try to come up with a target number rather than like what the system is capable of necessarily but uh it's also okay to like re-evaluate as time goes by right yeah and having

Starting point is 01:18:29 a calculator is fantastic because you can go back and adjust those numbers so i'll take a calculator that gets it way wrong any day over a well 10 is too much let's try four right exactly well that that kind of leads into the next thing that they said is keep your SLO simple, right? If you make it too complex, then it's hard for people to even understand. And when there are changes made in the system, it might be difficult to even see what that impact was on what your original SLO was anyway. So, you know, the simpler, the better. And avoid absolutes. I like this one, too. Like, oh, and I'mutes. I like this one too.

Starting point is 01:19:05 Like, oh, and I, and I'm sure we've heard this with Kafka and other things too, right? You can scale indefinitely, right? Like this thing will handle everything. Don't say that, right? Like, because as soon as you say that you're going to hit a tipping point to where it doesn't scale indefinitely without a ton of work, right? And making those statements means that you're going to be spending a ton of time trying to make sure that the thing will do what you tried to promise up front.

Starting point is 01:19:34 I like this too. They say that have as few SLOs as possible. And you want to be able to have just enough to ensure that you can track the status of your system and they should be defendable. And I think that's really cool too. So I think you should like take away until you can track the status of your system and they should be defendable and i think that's really cool too so i think you should like it's like a take away until you can't take away anymore yeah i mean this this goes back to kind of what i was thinking before with the the grafana and like you know you can definitely find some easy to start with dashboards out there for like a given system that are totally generic, right? Like, so for metrics for a Postgres or a

Starting point is 01:20:07 Kafka or Zookeeper or whatever, right? And, you know, they know obviously nothing about what your business needs are in your application. So all of those metrics are super generic and not helpful. And, you know, if you were to use those as your starting point, definitely start whittling away at it and, you know, the things that you don't care about, like get rid of them because you don't want to have like more things in your face that you ever like, have you ever been in a situation where, uh, maybe you thought you had something like really nice together that had like a bunch of metrics, like, Oh, I can know exactly what the health of the system's doing. Right. And maybe some things are like,

Starting point is 01:20:44 you know, red and on fire you know alarming but you're like you have trained yourself like okay well you know i mean i kind of care about it but it's not like you know the end of the world or whatever and then like your boss's boss's boss's boss happens to stop by and it's like hey what you got there and you're like oh i see this i can like monitor the whole thing he's like oh my god why is it on fire and you're like oh well that one doesn't matter and then his immediate reaction is like why is it there yeah and you're like well because i want to get to it eventually it'd be you just said it doesn't matter yeah it kind of doesn't then you're not going to ever get to it because i'm never going to let you get to it right yeah because good point yeah yeah they they also say here perfection can wait right

Starting point is 01:21:27 you never it's basically what we just said with the web server thing right we started out with this crazy high number well um that wasn't right refine it right like trim it down numbers did we well like what arguments did we get wrong here and let's fix those try again yeah don't don't shoot for perfection, man. I've heard so many times and I actually liked the statement, right? Perfection gets in the way of good enough and, and good enough is usually what you want and all you need. So that's a, you know, go that way.

Starting point is 01:21:59 And this, I actually liked a whole lot too. The SLO should be a major driver for what the SREs are actually working on because that SLO defines what the users care about. So if the users care about it, then you should make sure that you're meeting those users needs. And so that should be what the SRE is focusing on. So said another way, let's think back to the purpose of this book and what the SRE title was, right? This means that these are a group of people who aren't necessarily going after new features to the product or the service. And instead are saying like, oh, I see this thing trending in what could become a bad for us, you know, dashboard or whatever. And I'm going to

Starting point is 01:22:47 go ahead and get ahead of that. I'm going to put in a fix to address that before it becomes a problem, right? Right. To even automate the fix, right? Which is what we talked about earlier on with this. Like that's the whole goal. Hey, real quick before we go on to the next section. So I went looking for the S3 SLA because we were talking about that earlier and I was curious how they define it. They don't talk about durability at all in the SLA. The SLA is only for uptime. So I think around their SLOs, they may have durability, but I'll put a link here in the page. It was kind of interesting to me. I'm going to stick it down here.

Starting point is 01:23:27 Well, S3 definitely, I mean, it might not be part of their SLA, but they talk about a durability of like, I can't even count how many nines it is. It's like five nines or something. No, it's like more than that. 11. Yeah. The durability is 11.

Starting point is 01:23:41 Okay. So, but that's what was interesting to me is in their SLA. The only thing that has a consequence is if their uptime is down, I put it down there in the resources we like, if you guys wanted to check it out, but, but yeah, and they actually have monetary returns, right? So they have a service credit percentage. If, if the uptime goes down to below 99 but above 98%, then you get a 10% credit. If it goes lower than 98 but above 95, you get a 25% credit. And if it's lower than 95%, then you get all your money back. So it's pretty interesting.

Starting point is 01:24:18 But everything that I looked at on this page has nothing to do with durability. It's all about the service being available when called by the end application or whatever. Yeah. And I just saw this. I Googled that because I just had read about the durability. And so S3 does claim 11 nines of durability, but it's not part of the SLA, just like you said. Yeah. And that's kind of why I wanted to bring it up is because, again, we mentioned earlier, the key differentiator between an SLA and an SLO is the consequences of it going down, right? And the only consequences here is if the uptime is not what they claim it to be. So it's interesting, I mean, how people define these things or companies define them. Just to close the loop on where I was going with the SRE that we're working on.

Starting point is 01:25:12 I mean, this goes back to the maturity level of the type of company that you're going to work for as to whether or not this is going to work for your team because it it it's a it's definitely a certain level of maturity for a company to be able to afford to have a team focused only on sre type uh initiatives right and and not focus on you know new features and whatnot for the product yeah totally and next section was on control measures which is basically what you do are like that the kind of knobs that you can tweak in order to fix things when they go bad so if you're imagine you're monitoring your system slis you've constructed slos over those slis to know when things are you know going wrong when you have a problem and then when that service level objective is out of compliance, basically if there's an alert, something's wrong, you need to take action.

Starting point is 01:26:10 Then control measures are the actions that you take. So, for example, if you see latency going up, and so your SLO is kind of in violation, and you've got an alert glowing, blaring, then you can go and see that maybe your CPU is too high. So then you can go increase your CPU capacity and you should see that latency go down, assuming that's actually what's going on there.

Starting point is 01:26:34 And so it was just kind of cool to talk about that. And I think that ties in with the playbooks. So knowing what to do about these things when those service level objectives are going wrong. But of course, you can't write out every permutation of what's going on, but even just knowing that CPU is one of the things you can tweak and how to do it is a good thing. Also remember that too,

Starting point is 01:26:54 though, that like Google specifically at the start of this book said that they prefer to like, rather than having a playbook of like, Hey, here's how you fix it. They prefer to automate that process so that it can fix itself. Right.

Starting point is 01:27:09 So I forget how they referred to it. It was like, they didn't want automated. They wanted automatic. Or do you remember the phrasing that they used for that? Yeah. Yeah. It was the automatic over automated.

Starting point is 01:27:23 I forget. Yeah. I think that's what it was yeah i think that's what it was i think that's what that's the goal of the sre right is to automate the things that can be actionable based off some sort of playbook like if you have a playbook for it then you should be able to automate it more or less right i mean kubernetes is a great example of this right like think back to pre-kubernetes days right like how you go back to our example that I gave him about like think time and, and page per second and whatnot. And you're trying to decide like, Hey, how many web

Starting point is 01:27:51 servers do we want for our application? Right. And so you had to like, think about that. And then you're like watching your traffic load and you may like, Oh, Hey, we don't have enough. And so then you as a person had to go in like, you know, a watch that metric and be like, oh, you know what, based on CPU utilization or IO or whatever, I think I need to add in another web server. Right. And so then you would have to go in and handle the provisioning of that yourself and configurating and whatnot. Now, with Kubernetes, for example, you can just say like, hey, here's the health metric to watch. And if this resource limit gets above, you know, like a 90% CPU, go ahead and, you know, spin up another pod for it. Right. And, you know, that's an example of like, you know, something that can now be automated that I'm sure some SRE and Google you know was was uh tasked with figuring out or you know probably responsible for implementing and bringing to the world you know yeah kubernetes is the best until it's not and then it's yeah it's awesome there's room for it to be simplified

Starting point is 01:28:58 for sure but right now i can't i wouldn't want to work somewhere that didn't have kubernetes you know unless you've got like a three-tier system with literally like one server you know for each tier like i can't imagine i had this thought though on it though like uh another tangent tangent alert um because the one downside though that i that i thought about this week with kubernetes is that like you're like everybody has to become an expert at every layer of networking and security and how to deal with firewalls or whatever.

Starting point is 01:29:32 It's no longer just as simple as it works on my machine. You're spinning up a cluster, a server farm, everywhere you go, every time. It's got its good and its bads you know it's all good what are you talking about oh you don't want to learn about networking i i totally messed up that part then yeah um let me rephrase this uh i'll figure it out i totally

Starting point is 01:29:57 agree with you you're right it's just it i i love it but it's it is complex like there's there's so much and in fact there was that thing that I, well, I guess we'll share it with the rest of the world too. But Jay-Z shared this picture of the Kubernetes glacier, right? Well, like what I don't, all the things I don't know about Kubernetes and like how depressing it is to be like, Oh, look at how much stuff is really in there. Like, uh, yeah, it gets deep. So, uh, all right. Well, so going back to what we said before about publishing that

Starting point is 01:30:35 the SLOs, like these SLOs set expectations. And so it's great for teams to be able to publish those so that other users can know what to expect. So going back to the chubby example, which, by the way, can we admit that's a horrible service name? Yeah. These guys. Like, yeah. So, yeah, because, you know, you don't want people to become too reliant on it. You want to, like, set some kind of level of like, Hey, you know,

Starting point is 01:31:05 it's okay if we're down every now and then for maintenance or whatever. But it's great that our service is good enough to where like, you know, we only have to take it down when we want to, but you know, don't, don't be so reliant on it. Hey, so this, this next part is what I was mentioning earlier and I forgot it was at the end of this was they were talking about having a safety margin. One approach to having these expectations is you have your internal SLOs, right?

Starting point is 01:31:35 Like, hey, I want all of my requests for my service to come back at 200 milliseconds. But for the external customers that are facing the SLO, we're going to publish for them as we want them to come back at 300 milliseconds, right? So you've got a buffer of 100 milliseconds there. And they were talking about if you do this, then this kind of protects you because you can aim for your internal targets. And that way you're always pleasing your external customers. So that's one way to do this. And then the other is don't overachieve, right? They talked about this early on was you are not trying to get consistently better than your SLO because then people come to rely on that. Like the chubby service that never went down, but when it did, there were outages everywhere

Starting point is 01:32:23 because people thought it was there. And so they actually said that you should consider doing failure injection, which is a thing that they have there to where they actually introduced downtime on purpose X amount of time throughout a quarter. So that seems weird to me, but I get it. But this isn't failure injection as in like the chaos monkey from Netflix. No, that's totally different. This was just like, I almost hate to call it like failure injection.

Starting point is 01:32:53 It's just like we're going to, we're introducing forced downtime. Right? Yeah. That's all it is. It's not necessarily a failure. The system didn't crash, but we're, we're bringing it down for one reason or another. Yeah. It's going to be unavailable. And it may not even be planned, right?

Starting point is 01:33:08 Like we don't want people to know that, hey, we're taking the system down tomorrow night at midnight. No, it's just it's going to be down at some point and we're going to take it down so that people will understand that our reliability is what we published. So it's interesting. It's weird, but it sort of makes sense. The failure injection might be on the other teams that have the dependency. Right. Right. You know?

Starting point is 01:33:32 So, yeah, they, they need to make sure that they can work around that. And so like, it'll call out the issues in their system in the case of the chubby example. Yep. So agreements in practice.

Starting point is 01:33:50 So we, we've pretty much covered SLIs and SLAs pretty good, but we've only kind of scratched the surface with the SLAs, right? But that's also fair because the SRE's role is not to set the SLA. That's a business decision. Going back to the consequences that Alan gave with the Amazon and the S3 and the cost there. Us as developers, we're not going to say, hey, guess what we're going to do?

Starting point is 01:34:17 If this service goes down that I wrote, I'm going to give you back this amount of money. We're not going to do that. That's not for us to decide. If you could, how much would you give back uh zero zero wow joe's greedy jay-z grinching it that's awesome um but what the sre's role is they may not be defining those they are supposed to at least inform them about how difficult it's going to be to meet those SLOs or the SLAs that are being put up there right like if somebody says oh yeah we're going to guarantee an SLA of 99.9999 uptime the SREs are going to go dude you're crazy like we can't we can't make that happen right and so that's that's where they kind of

Starting point is 01:35:06 come into play so they they get to call people crazy all the time is yeah exactly we could have summed up this book a lot easier sre is getting to call people crazy that's it i feel like i'm sorry maybe this should be my title though like you know there were managers talking about the plan outage they were just like wait you're gonna take this down mess with my stuff give me a bunch of extra work like i've got i've got to figure out how to squeeze this stuff in with all my normal goals because because that might happen right so maybe right get out of here you know crazy it's crazy so people think that's the SREs are crazy too. Yeah. Well, just because you call me crazy, Alan, doesn't mean you get to be an SRE.

Starting point is 01:35:49 There's more to it than that. Wait, that's not what that is. I could have sworn. That's how you got the title. I'm not even saying that you're not within reason to say that I'm crazy, but you know, there's more to the job than that. That was awesome. Can you imagine if like Google wrote a book, like how to be an SRE first call michael crazy and like i was specifically called out that was the end of the book end of chapter one here's the credits thank you and here's the copyright date you've arrived and it has been number all right so the last few that we got here you should be conservative with the sls and the slas that you make publicly available because otherwise you're setting yourself up for pain. This goes back to your buffer, your SLO buffer.

Starting point is 01:36:33 Yep. Buffer and also not trying to make it so unsustainably reliable that you can't work on anything else. Right. That's really what it is. Or take, I'll put it in their terminology, the safety margin. But yeah. So the big thing that they call out here is you want them to be conservative because as soon as you make that stuff public, it's hard to backtrack that,

Starting point is 01:36:57 right? Like if you publish your SLOs and your SLAs, then, then people are going to hold you to it. Right. And, and if you slip on that, it's going to be a problem.

Starting point is 01:37:07 Um, uh, then people are going to hold you to it, right? And if you slip on that, it's going to be a problem. So we talked about this earlier. They called it out explicitly here in the footnotes of this chapter. SLA is typically misused when talking about an SLO. Basically, if there's an SLA breach, it probably triggers a court case. Unless it's something like these credits from Amazon where they're talking about, hey, we were down, you know, percent, then you're just going to see it back on your bill. Otherwise, if it's something big, then you're probably going to see something go to court. And then I think Outlaw mentioned this earlier. If you can't win

Starting point is 01:37:40 a particular argument or if you need to justify the SLO that you have and it's not like really gaining any traction, then it's probably not worth having because you don't want your SR team to work on it. So if, if, if your boss comes by and is like, Hey, why is that graph so bad? You're like, Oh, take it off. Right? Like it doesn't matter. Get rid of it. Yeah. I just like, uh, do you you remember have you ever been in an environment where the development team had like literally a stoplight where uh you could see like oh the builds are green the builds are red or you know i don't know why the green would ever be yellow or why the light would ever be yellow but um imagine like your boss's boss's boss's boss

Starting point is 01:38:22 like somebody puts in a commit you know know, and it broke the build. And, and so it's red and your boss's boss's boss's boss is like, Hey, why is it red? And you're like, Oh, there's a bug.

Starting point is 01:38:31 We're going to fix it. And he's like, Oh my God, where's on fire. And you're like, do you know how many bugs we fix every day? Like, I mean,

Starting point is 01:38:36 it was definitely like a different world though, too, that I'm kind of thinking of. Cause now I'm like, I can't imagine like not having a PR gate in that would prevent the build from ever getting broken like that you know but oh i remember having arguments about that stuff you remember people like i ought to be too slow to merge code it's like dude i don't want you to merge code that's broken yeah i don't have time for that right now man i just need to get this in there and be done

Starting point is 01:39:01 with it and it's like uh but you're not done with it it's broken yeah but it's fine for but it's not fine for now because i still need to do my job yeah yeah yeah i like pr gates even if they had five minutes to an approval well get ready for an hour oh wait sorry okay so now we go full circle back to my monolithic build complaint. Right. All right. Hey, mergeify mergeify.

Starting point is 01:39:29 Yeah, there you go. So, uh, yeah, so we'll have some, some links there in the resources we like. Uh,

Starting point is 01:39:35 obviously this book will be in there. There might be a link to the, uh, Kubernetes glacier. You'll probably cry when you see it. And especially when you realize how far you have to scroll to see all of it. And that's okay. Leave a comment. Maybe we'll send you a box of tissues to help wipe away the tears. Kubernetes

Starting point is 01:39:55 sponsored Kleenex. All right. Well, how about I ask you this, though? How does Darth Vader like his toast? Dark side. It's got you there on the dark side. Good. All right. Well, with that, we head into Alan's favorite portion of the show. It's the tip of the week. So it's usually my favorite portion of the show, except for when I can't think of anything.

Starting point is 01:40:27 Um, so this is, I've actually, I've got to, I'll probably type the other one in here in a second. So the first one is, uh, had somebody,

Starting point is 01:40:37 we kind of forced this person onto a Mac recently, and this is a, you know, 20, 30 plus year user of Windows. Ed complained like nonstop, like, why did they switch the command and the control button? And I'm like, dude, you got to realize Mac has been around for a long time too. Like they didn't switch it.

Starting point is 01:41:02 They didn't just decide that, oh, we don't like windows and we're going to do this. This is how they'd be doing it forever. It didn't matter. Yeah. Since the beginning. So, you know, whatever windows had control,

Starting point is 01:41:12 Mac had command and control, which is really confusing. Um, so, and then windows decided to add a windows button a few years back. So whatever the key is, if you find yourself switching from something like a windows to a Mac and you're finding like control C to be hyper frustrating, there are, um, software drivers that you can

Starting point is 01:41:35 download a lot of times for your keyboard for a specific OS. And like in this example, he was able, I think he likes, Natural Keyboard 4000. Like he's got a stack of them, right? Well, you can download software for the Mac to basically remap the keys. So the problem is there is a feature in Mac OS to say, hey, I want to swap my command and my control key. But the problem is a lot of the software that you use is going to map Command-C to control or whatever. So even though you try and do it at the OS level, your software is going to overtake it at some point. So if you do it at the keyboard level, then now you can stick with your Control-C or whatever, and it'll make your life easier.

Starting point is 01:42:20 So I highly recommend if you find yourself in that situation and find it very frustrating go look that up there's a lot of times mac software downloads for your keyboards just embrace change that was what i said i was like dude it takes like a week for you to mentally remap your thumb from one key to the other and he wasn't having it and i was like all right whatever i don't want to hear about it anymore. I don't know what to do. Yeah. But like, so, um, an example of that. So I'm using this Kinesis gaming freestyle keyboard. Right. And one of the annoying things to me, I don't, I don't know why they made this decision. I think it's ridiculous. The keyboard has an option so that you can put it into, quote, gaming mode. And that way, when you're in like your full screen game and, you know, if you were to accidentally hit the Windows key, it'll just ignore it, right? It doesn't like pop you out of

Starting point is 01:43:22 the game, right? Because especially like if you're in like a, a multi-level, sorry, a multiplayer game and you're like in an epic battle, like you're in the fight. Right. And you accidentally hit that key. Like you don't want to be taken out of it. Right. But for this keyboard, they have, uh, one of the options that you can buy is the Mac keys. So you can like swap out the, like it swaps out the size and position of the keys so that it is more Mac-like versus the Windows experience, right? So it's not just a reprint of the labels. It's actually changed the different size of the keys as well and then in the software you can tell it like hey i'm now i'm using the mac switches right awesome for the love of god i don't know why but they disable the gaming feature well because people don't game

Starting point is 01:44:18 on max come on man uh that's probably what their rationale was like, Oh, you're only going to use this keyboard on one computer. And I'm like, you know, I mean like 99% of the time I'm on the Mac, but when I am on my PC, I like to play games. And so, yeah,

Starting point is 01:44:34 it's, it bothers me that I can't have the Mac keys and have that gaming feature. Well, they got a, they got a figure. If you're trying to go over to the back, you're not gaming. And if you're going gonna go through the trouble of replacing the keys to be on the back then you're not gonna be gaming on the windows machine right like that's not true

Starting point is 01:44:53 0.001 percent of the population that buys that keyboard yeah there probably is a super small percentage of the population that has it for sure for that purpose for that that makes use purpose oh yeah so the the other tip i had was just being when you're doing the implementation of metrics and stuff and i know you guys have seen it they're not all free right so what i mean by that is we talked about picking your SLOs in a manner that makes sense, right? Like don't just put a ton of them out there. Have a handful of them and make it that way.

Starting point is 01:45:34 Well, with metrics, a lot of times when you're adding in metrics, you have a tendency to just start adding them everywhere. And the thing is, that stuff is collected by a server, whether it's Prometheus or whether it's InfluxDB or whatever it is that you're using. And so you could be adding a ton of data that you don't even realize you're doing, right? So for instance, in Prometheus,

Starting point is 01:45:59 if you have, we were talking about the quantiles earlier, if you have a 95% quantile and a 99% quantile, that's just two, right? But then they're going to be bucketized as well. If you start adding labels to those, every label that you add for a Prometheus metric is essentially a new series. So if you're adding a customer name as a label, if you have a hundred customers and that's a hundred times two that you had previously. If you add a super high cardinality thing, like the size of a request, now you could have a million different sizes that come in and that's a million times 100 times two. So the number of metrics that are gathered grow like just insane amounts. So pick and choose the metrics or the SLIs that you're also looking for, because it can actually

Starting point is 01:46:53 impact the systems that need to gather and serve up those metrics. So, um, you know, just a, a warning for when you're going and adding these things into to your your metrics gathering for your applications all right well i have a tip for you so um have you ever moved browsers and uh exported your bookmarks and then re-imported them in another browser uh i mean no i've never you know you can do a problem though yeah i know you could though. Yeah, I know you can do it. I've never done it. Yeah, you can do it. Okay.

Starting point is 01:47:30 You just sign in with Google, and it automatically brings it all over. Yeah, but if you wanted to switch to Firefox or something, then you can export your bookmarks and actually import it because Firefox supports that, right? So I always knew that was a feature, but I didn't really think too much about it. Well, I have a little script that I use, and sometimes at work I have to switch environments, and it switches a bunch of different links you know so um i like to have a script that i use that i run and i type the environment name and it generates all the links for me like the website

Starting point is 01:47:54 maybe to logs you know stuff like that and i always think like wow wouldn't it be nice if i could like generate these bookmarks on the fly but sometimes these environments are not common whatever but so it'd be nice to be able to either open them all up or just kind of be able to have a temporary bookmark but it's you know it's not cheap it's a pain to add bookmarks right especially if a lot of them so it got me thinking like well you know what maybe i can reuse this format that you use to export and import to generate a bookmarks file and then whenever i'm working on an environment for maybe a couple days i could import this as I'm working on an environment for maybe a couple of days, I could import this as a new folder

Starting point is 01:48:27 and have it around for a couple of days and then delete it. And I can do that if it's easy. And so I looked and sure enough, the format is dead simple. It's basically an HTML file with just a couple of links. And it's got a couple,

Starting point is 01:48:39 it's got like a weird DT tag, but I found an article that shows how you can use this and you can easily generate it. So what I'm going to do is I'm going to update my script. So it generates this HTML file whenever I run it. So I can just kind of go into Chrome and import it. And it's going to be easy. So I never thought about this, but this is actually really nice for like onboarding, for example.

Starting point is 01:49:01 So you imagine like someone new starts the company, you can be like, hey, check it out. Here is the HTML file of bookmarks that you can just import and it'll add so it's not totally wiping out whatever you've got it's not like a total import export but you can add like the 20 common you know bookmarks that most people should just probably have for working here i like that so that was pretty cool so i'll have a link to a script that somebody actually put together already too uh so you can see how to do that i'm also like really bad though with bookmarks i mean i love the idea of the onboarding example but i'm really bad about bookmarks because like uh i i am not a heavy user of them and instead i'm like well i could just easily like re-google that or go back

Starting point is 01:49:40 to like uh you know what the reason the thing was. So I rely on search capabilities either through whatever that site or services or the or like, you know, Google, if it's public stuff so bad that it's more like. Because then otherwise, like, I'll forget that I have a bookmark for it. You know, do you ever do that? And you're like, Oh, I'll just go and like, uh, query the wiki for that page rather than realizing like, Oh, I already bookmarked the link to that page in the wiki. Yeah.

Starting point is 01:50:14 It just, uh, like where I get, uh, where I stopped doing that as well with number one, wiki searches are notoriously terrible, but two, um,

Starting point is 01:50:22 like logs. So if you have a bunch of different environments in a cloudy environment and you're like trying to get this like certain pods or certain like queries that are common you know like oh it's such a pain to like navigate to the right project and the right cluster and then you know the normal kind of filters you add and so i abuse those yeah yeah you know what i do like about this so i'd never thought about that in terms of onboarding if you you did it like that, then everybody's bookmarks will be in the same place. So when you're communicating about what page to go to, you can be like, hey, go to this bookmark folder and this bookmark and you're good. Like that's actually really nice for communicating.

Starting point is 01:50:59 Yeah, it goes back to which book was it where we were talking about having a common language? Ubiquitous language. Yeah, yeah, yeah. Domain-driven design, sir. There you go. It kind of adds to that, right? Because then you're all talking the same thing. All right, so for my tip of the week, we got a remix! Yeah, so I didn't realize that we had already like,

Starting point is 01:51:27 uh, called this one out before, but then, uh, Jay Z saw it and he's like, Hey, you realize I've already talked about that. Right.

Starting point is 01:51:33 So, um, my tip of the week, well, I guess is more of a reminder. We, we, we remix of,

Starting point is 01:51:39 of using the power level 10 K, uh, theme for Z shell. And the reason why this became a thing for me, why I called it out is I forget which theme I was using for ZShell, but I'm working in a pretty large repo, not the 20 gig repo that I think Alan mentioned at the start or Joe, I forget which one of you mentioned it, but not quite that big, but it's big enough to where like every time I would like execute any command on the command line while in that, um, that repo that, uh, the theme was trying to like also report back, like, you know, get status kind of information, like what branch you're in and if the if it's dirty or not, you know, if there's any changes in it or not.

Starting point is 01:52:28 Fairly simple thing, but like literally it would add a few seconds of time to execute any command. Like you just do a simple LS. And it was so frustrating. And Joe was like, hey, you should try this other theme. And that'll that pain will go away. And sure enough, with power level 10 try this other theme. And that pain will go away. And sure enough, with power level 10K, I no longer have that pain. So I didn't realize that it had already been given away as a tip. But yeah, so now you got a remix of it.

Starting point is 01:52:55 Looks cool too, right? It does do a bunch of cool things that I'm questioning like, oh, God, did I go overboard with how much stuff i have on the command line now available to me because like you know you see the execution time of everything and uh you know the path and the path is all like you know color coded uh and you know pretty and whatnot in the branch you're on and yeah it i really do like it but uh i also might go back and redo my configuration for it but that also like made me question too though there was another um we got a comment um i forget

Starting point is 01:53:31 how it came in either email or or on the site or whatnot but um someone was talking about like hey you know uh since i'm talking specifically about command line here and this you know the person who wrote in was saying something along the lines of like hey you know, the person who wrote in was saying something along the lines of like, hey, you know, you guys give so much love to the command line, but like there's a lot of advantage to knowing tools like a GitKraken

Starting point is 01:53:51 or something like, GitKraken was specifically the example given in the comment and everything. But, you know, just like knowing like the UI tools because like how much faster you can move around in those tools. And I don't discount that,

Starting point is 01:54:07 right? Um, knowing, knowing UI tools that are available for you, it's like certainly, you know, you can definitely move faster and that's their whole, that's their whole point.

Starting point is 01:54:15 That's their reason for being right. But I was curious from your guys' perspective though, cause like, I guess maybe it's just like, you know, career wise, like what kind of things you've had to work in or the luxury of like what you've had to, you've had to do, but I've never found myself being in a situation to where like, I can only ever use that tool. Right. Like I find,

Starting point is 01:54:37 I also many times find myself to where like, I need to be able to do things by the command line. Like this is all the access that I'm given to a particular system. And so, you know, I, I guess for me, I like to use the command line as a practice, as a,

Starting point is 01:54:51 to keep those skills sharp. Right. So that when I do get into those environments where that's the only thing I have available to me, I'm a habit. Does that make sense? Yeah, totally.

Starting point is 01:55:04 And the, the. The argument that I mean, that's not even an argument, but I love canines, the UI for using Kubernetes, and one thing that's kind of Terminal UI. Yeah, terminal UI. So, yeah, it feels like a CLI, but it's not. It's kind of a mix of that and VI.

Starting point is 01:55:21 But, you know, there are definitely times when I want to script something and I realize I don't have that muscle memory built. but you can go so much faster than cover so much ground and it kind of teaches you too so i have mixed opinions you know i think it's important to know both and i think it's fine to use a ui if you like it but so i don't know i'm torn on it i 100 agree with what jay-z just said it's the reason why I do the command line as much as I do is as soon as you want to automate something, that's kind of what you have to do. Right. So that's, that's one thing. But, but I have found myself and I think I mentioned this, I don't know, several episodes back. I have found myself, especially with get using things that are like built into IntelliJ, right? So for instance, if I'm, I did this with Jay-Z and somebody else recently is I had a fairly large PR that I needed to walk through and looking at that inside GitHub's pull request was useless because there's no context, right?

Starting point is 01:56:28 So what I could do is I could go to the UI tools in IntelliJ and say, hey, show me the files that change. And it would actually give me a list of all the files on the file system that changed. And I could click the file and it would show me the diffs between both of them on the page. And so I could easily navigate in a way that's easy for people to consume and say, hey, I made this change here because it relates to this over here. Right. And this is where it is in the file system. So I can show you that, hey, this was in this module and these changes were made. And this is how it relates to this module. So, you know, I like UI tools and I definitely find myself using them, especially when I need to communicate with other people. But I agree, man, like doing things from the command line are ultimately how you end up automating things in a lot of places. So you need that muscle to even know how to go about doing it.

Starting point is 01:57:21 Yeah, I mean, I like that. You know, it's I guess said another way is like it's good to to know and use both there's a time and a place and you know like it's if we're it's i don't want to say that we're harping on command line as much as we're like trying to keep that skill sharp. I think that's it for me. So, um, yeah. All right. Well, uh, so subscribe to us on iTunes, uh, Stitcher, you know, Spotify, wherever you like to find your podcasts. Um, in, uh, as Jay-Z tried, but failed to say earlier, and I had to go back and correct and make it better. Uh, I'm pretty sure that's what happened. That's the way it went down.

Starting point is 01:58:07 Yeah. Send your bad reviews to wait, send your, send your reviews to, okay, here's some helpful links available at dub, w dub dot coding box.net slash review. Hey,

Starting point is 01:58:20 and while you're up there at dub, w dub dot coding box.net, make sure you check out our show notes, which are extensive examples, discussions, and more, and send you're up there at www.codingblocks.net, make sure you check out our show notes, which are extensive examples, discussions, and more, and send your feedback, questions, and rants, apparently to joezack at Slack. And definitely check out our Slack channel. There really is just some awesome people in there,

Starting point is 01:58:38 which I have not been in this past week because I've been crazy busy. So, yeah. Yep. Also, we're on the Bird site, Twitter, at CodingBlocks. If you've got some mean tweets to sling at us, you can do it over there. And also

Starting point is 01:58:51 CodingBlocks.net. You can find all our social links at the top of the page if you want to hit us up on those dillies.

Coding Blocks - Site Reliability Engineering – Service Level Indicators, Objectives, and Agreements

Welcome to the morning edition of Coding Blocks as we dive into what service level indicators, objectives, and agreements are while Michael clearly needs more sleep, Allen doesn't know how web pages w...ork anymore, and Joe isn't allowed to beg.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Coding Blocks - Site Reliability Engineering – Service Level Indicators, Objectives, and Agreements

Welcome to the morning edition of Coding Blocks as we dive into what service level indicators, objectives, and agreements are while Michael clearly needs more sleep, Allen doesn't know how web pages w...ork anymore, and Joe isn't allowed to beg.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.