Coding Blocks - Site Reliability Engineering – Monitoring Distributed Systems

Starting point is 00:00:00 and you're listening to coding blogs hey guess what it's episode 185 subscribe i gotta keep them i gotta keep you guys on your toes so you know uh yeah i think i had a heart attack yeah there you go you like that it would come in like uh with some fire some gusto so subscribe to us on you know itunes spotify stitcher wherever you like to find your app your podcast or your apps uh but mostly your podcast in an app. And I hope we're there. And, uh, if not,

Starting point is 00:00:27 you can complain to Jay Z about it on Slack. Um, you can reach him at Alan and I think I got this right. Wait, I might've crossed signals. It'll get there. I'm pretty sure he threw himself off too. All right.

Starting point is 00:00:39 So you can visit us at coding blocks.net where you can find all our show notes, which are amazing examples, discussions, and more. And you can send your feedback, questions, and rants to comments at CodingBlocks.net. And we've got Twitter at CodingBlocks. And we've got a website, www.CodingBlocks.net. And we've got a bunch of social dillies at the top of the page. And with that, I'm Joe Zach. I'm Alan Underwood.

Starting point is 00:01:04 And I'm Michael Outlaw. Wait, is that what I sound like to you guys? No, I can't. I'm terrible at voices. So I don't know. I thought about it as soon as I did. I was like, oh, man, I might get a harsh reality here that I don't want. Oh, am I strong enough to take this?

Starting point is 00:01:22 It's like, ooh, buddy, here it comes. You say you're nasally. I don't even know how to do that. I'm Michael outlaw. I don't know. Oh God, it's worse. Why did I say it?

Starting point is 00:01:31 I should have stopped while I was ahead. No, it's all good. I can't, I cannot do voices. I can barely do my voice. This episode is sponsored by retool. Stop wrestling with UI libraries,

Starting point is 00:01:44 hacking together data sources and figuring out access controls, and instead start shipping apps that move your business forward. And shortcut, you shouldn't have to project manage your project management. All right, so we're picking back up with the site reliability engineering. I think Jay-Z should say that from now on and we're doing the monitoring distributed systems this go around and i actually went and found the link in the free version of the book online if you want to follow along so that'll be up there in the show notes oh good idea i didn't realize you could do that right yeah so um it was

Starting point is 00:02:22 an epiphany that happened this evening you know we're six chapters in why wouldn't that happen now so a link like you're just brushing your teeth and you're like what if we put a link on the web page yeah it's crazy uh but before before we dive into that juiciness, we like to go over some news and reviews and all that kind of stuff. And as always, I think we have Outlaw here or Alan here to read off. Thank you for recognizing. I'm so confused. Yeah.

Starting point is 00:02:59 So from Audible, we have just brie and from itunes we have um one two three four five six six seven seven eight eight eight eight nine nine nine nine nine zero zero zero zero zero and then um i'm thinking this one would be like manic or maybe like either that or they're from north carolina so it'd be man in c one of those two or just man c be Man and Sea, one of those two, or just Man Sea, maybe. I think any one of those is good. And then my favorite, Good Beer Hunter. It's always good to be a good beer hunter, I believe.

Starting point is 00:03:35 I was assuming that was supposed to be like a play on the movie. Good Will Hunting, but Good Beer Hunter. Possibly. I could be wrong. wrong hey and to be clear sweetwater 420 is not one of those good beers oh wow we yeah we get ipa in it it is an ipa yeah i'm not a fan so we can all argue about that if we'd like later so all right if you like ipas you know you're you like nasty stuff let's argue about i'm not i'm not a beer connoisseur so i guess like i you know because i i like guinness and that's where i draw the line and i'm like there's no sense you know

Starting point is 00:04:13 bothering with anything else i'm just i'm done with that journey that is done oh that's no fun you gotta try them all is that is that see but yeah if you're a connoisseur then you would say that i wouldn't say i'm a connoisseur, then you would say that. I wouldn't say I'm a connoisseur. I just know what doesn't taste good and sweet water for 20s on that list. Yeah. Yeah. So I mean,

Starting point is 00:04:32 that's an Atlanta, you know, beer. It is the shame that it's so bad. Yeah, it really is. I don't know. I it's,

Starting point is 00:04:39 I've had a, I've had some pretty rewarding sweet waters after like getting off the mountain bike trail, you know, and I would imagine anything tastes good after after that i wasn't going to go there with it but okay i see what you did yeah all right all right well hey i got a bunch of news this time so uh i'm sorry i already mentioned several times many times uh actual sr it was a nice actual sre at Google. Probably during while this book was being written or at least not long after. But had really great feedback after the last couple of episodes.

Starting point is 00:05:14 Last episode was no different. And this time he actually went and wrote a full-length blog post on it. It was really great and went over a lot of things. He emphasized some parts. He challenged some parts. He had some additional things to kind of add. Just really great posts. and so you should go go check that link out and read it. It's really good.

Starting point is 00:05:30 Also, I had some really great recommendations just on kind of posts. There were a couple things that kind of happened, you know, I guess earlier this year and the last couple months, recent history. The Slack outage from January 4th and it had a really great post-boredom online so you can read about

Starting point is 00:05:44 what happened, how they figured it out, what they did to prevent it. Just really interesting. And also another one for Atlassian, you know, that they had that big outage in April, and they ended up actually losing data for customers, and what they did to find that, fix that, and how they tried to recover the data and put Humpty back together again.

Starting point is 00:06:04 So we'll have a couple of links to that. Before you go past that, that one that Merle had shared on the, on the Atlassian one, that was such a good read. And I actually really, even though as a customer of that company, you would be really upset that you lost data,

Starting point is 00:06:20 their approach and their openness and their frankness about how they do things was so refreshing to see from a technology company. One that companies kind of lean on for a lot of their operational things. So that was also a very, very good read. Yep. I got one more for you. So we've mentioned Simon Barker on the show several times. Also, we've mentioned Brandon Lyons several times on the show.

Starting point is 00:06:44 And Simon's got a podcast which Lyons several times on the show. Simon's got a podcast which should be mentioned, All the Code, and he had Brandon on as a guest recently in a really good episode, so I listened to that about Brandon's journey. He's working at Microsoft. They covered a lot of ground, so it's just some really cool stuff. Brandon's got an interesting history

Starting point is 00:06:59 and Simon's a great interviewer. It's just a really great show overall, even if you don't know those people. If you're in the slack uh they're both kicking around so you should check it out and then did i put this note in no no i i did because it's funny i don't it's not real is it outlaw i don't think it was the it is it is real it is oh okay why would you think that it isn't this is something about costco that's near and dear to your heart why would you assume that it wasn't real it's so amazing so for my fellow costcoers out there i know micro g i believe catch um i believe there's several other people right that that we constantly are talking about our love of costco

Starting point is 00:07:39 well there's a link here in the show notes that's amazing costco is selling a kirkland signature t-shirt that actually has the word kirkland signature on it and it's 12.99 at outlaw is like man there's no way you'd wear that and i'm like i'd totally rock that shirt like i would wear that everywhere that that might be a close second to my coding block shirt so so yeah man um it was it was kind of funny i've not seen this shirt nor maybe i buy it i mean for 13 bucks why not um it's a supima t super i don't know am i saying that right i guess but yeah that's pretty awesome that is that is crazy uh that's a whole other level of costco fandom if you buy that shirt when you know you know yeah i guess i guess that's what it is you know it was actually kind of interesting because there were people that were commenting on the the uh the cake

Starting point is 00:08:39 discussion right and apparently apparently yeah there was you know the the popular vote was people talking about the costco cake over the public's cake i you know and that's oh man i forgot like that that should be a future tip of the week right there like or not tip of the week but survey i really i really think because come on a costco cake over a public's cake come on you just don't know man i'm telling you i mean there's the quality of the public's cake and then there's the quantity of the costco cake no no no no it's the quality in in mass quantity i'm just trying to trigger you that's what i'm trying to do i'm like i'm going to say the right the right combination of words about costco and alan's going to have to leave to go take his medication because he's going to be twitching over there on camera.

Starting point is 00:09:30 We're actually rebranding the show to Costco Blocks. But, hey, in all honesty, last thing on Costco, if you've never had their chocolate chip cookies, like, they don't look special. But if you take a bite of one like it's dangerous you really want to eat all of them that man i just love that place yeah man have you ever had their tres leches cake oh my god dude oh it's the best well my wife and i do the same we say uno dos tres leches and that's like you know you shout that as loud as you can in the house and then like, let's everyone know it's time for cake. That's amazing. So good.

Starting point is 00:10:10 Okay. Well, I guess. What are we talking about? I'm so, uh, I guess in this episode, we're going to dive into all things related to Costco and like all the good deals that are going on and, uh, good tips, you know, when you go to Costco, like what aisles you should, you know, uh, pay more attention to it. Cause I mean, let's, let's admit it. You're not going to say like which hours you should avoid. Cause that's just, that's crazy. Go down all of them. It's an hour and a half thing. Yeah,

Starting point is 00:10:39 for sure. Oh, that's too long. No, no, all right all right let's get it back on the tracks here so um this first section that we're talking about here in monitoring distributed systems is there were a couple of things that they called out is there are the issues that you should send a page on send an alert on that actually interrupts a person and i don't know why but i could not type in the word human in the notes like they want to say everything is for a human like if you're interrupting a human and i'm like man it's a person get this isn't mars this is a person it was driving me crazy but why can't a person be a human i don't know man like it's it was literally like one of those things like i would twitch every time i'd see them refer to it as a human.

Starting point is 00:11:26 Like what else was it going to be? Did it interrupt your cat and your cat had to go do something? No, it was a person. Anyways, all right, I'm done. No, hold on. Wait, no, we are not done with this topic. Because wouldn't everything you just said apply to person as well? No, cats are people.

Starting point is 00:11:43 Oh, are they? Okay, I must have missed that one. What do you think they are? If you have a cat, or at least if you have more than one cat, you know they're persons. You know, my son was telling me that there was a study about where

Starting point is 00:11:58 cats do recognize their name, but they just choose to ignore you. God. That's why I don't have cats all right so there's that the ones that should actually send out an alert and and you know take somebody off task whatever they're doing and then there's the type of issues that you should not send a page for and you shouldn't even you should know how to deal with them right like the the applications maybe do something but they should not interrupt the regular flow of the date and those are two big things coming into this so the one thing though that was like

Starting point is 00:12:37 going on in the back of my mind though right was wasn't there one of the previous chapters though where they were saying that like the the anything that you might want to alert on like it should try like uh i think they worded it as they preferred automation over um they preferred like automated solutions over what was the thing for like uh like manuals well they had a cute expression for it. I think I remember which episode it was. I'm going to look that up. Yeah. But,

Starting point is 00:13:09 but, but the point was, is that like, if there was some kind of an alert that it was going to be sent out, that the system would like respond to it on its own. So the only things that you should be doing these types of alerts on were things that, um,

Starting point is 00:13:24 they're like, they're like, they're new, for example, and, and, and, you know, they haven't been seen before or,

Starting point is 00:13:30 and so therefore you, you haven't like, uh, coded the environment to be able to, uh, respond to it on its own. Right. Um,

Starting point is 00:13:40 and the quote, by the way, is we want systems that are automatic, not just automated. There you go. That was it. Yeah. systems that are automatic not just automated there you go that was it yeah so that was why it was like you know while reading this chapter kind of like going on in the background in my mind you know like but but why are we minor but why are we yeah yeah it's interesting so they did throw out here at the beginning, some, some definitions for things just so that you'd sort of have a baseline on, on going forward. So monitoring, right.

Starting point is 00:14:10 That is collecting, processing and aggregating quantitative information about a system. Um, I mean, I didn't put this in notes, but that's things like counts, right? Like error counts, um, latencies, timing, like all that kind of stuff, right? That's, that's things like counts, right? Like error counts, latencies, timing, like all that kind of stuff, right? That's your monitoring. Yeah, I think it's good that you define it because I remember the first time I remember hearing about that we had to be SOC compliant or something at work

Starting point is 00:14:35 and one of the requirements is that we had to sign off on those that a system was monitored. And I remember at the time thinking like, well, what does monitored mean? Does that mean somebody is like sitting there watching this stuff? Like, what does that mean? How that mean somebody is like sitting there watching this stuff? Like, what does that mean? How do we say yes or no something is monitored?

Starting point is 00:14:48 And so it's kind of nice to say like a system is considered monitoring if you've collected, processed, aggregated that information. And I think they don't call it out here, but I think that's a really good point is all the systems out there have all these values that are coming out of them. So the monitoring is actually bringing that data into a place where you can look at it, right? Or see how it's happening. So the next one up that they have, they call white box monitoring. And this is monitoring based on metrics exposed by a system. So jvm profiling um system event logs right like windows has its own event logs linux has its own stuff like that is all white box monitoring because more or less there's an api or a known way to get to it and these are the things that you expose like you have access to to make the kind of changes and expose the kind of things

Starting point is 00:15:43 that you want yep and then you have black box monitoring this make the kind of changes and expose the kind of things that you want. Yep. And then you have black box monitoring. This one's kind of interesting. So the reason they call it black box monitoring is because it's anytime you talk about something being like a black box, it means you don't have like you don't have direct sight into it. Right. And so this is this is seeing a system as a user would see it. Right. So if you were to go to a Web page and click on something and it takes five seconds for it to load, that's what you're experiencing, right? You're not reading latencies and response times or anything from the system.

Starting point is 00:16:16 This is just what you're experiencing seeing as an end user. So that's what a black box monitoring is. Yeah, you're going after the end result of it. Another thing that struck me when I read that part though, is that, I don't know if you guys have heard of this, but I've heard of the white box and black box in regards to unit testing. Have you guys heard of it in that regard? Where it's the same concept, but the idea is that with the black box, you're going after the end result. So it's the same, same concept. But the idea is that, uh, you know, with the black box, you're going after the end result. So it's more like, you can almost think of it as like black box unit testing would be like end to end, you know, kind of integration testing where you

Starting point is 00:16:53 like did the page load? How long did it load? Like, does it have the right content on it versus, you know, the, the white box unit testing might be like, did this specific method return the value that I wanted to have returned under these conditions? Right. And, and so, but now they're using those same terms, but from a monitoring kind of point of view, but you know, similar kind of concepts, just, you know, monitoring. Totally. And then the last one is dashboard. And I think this is what ties it all together, right? The dashboard is what provides a summary view for the most important metrics for a service. So what Jay-Z was bringing up a minute ago with the monitoring, right?

Starting point is 00:17:33 Like, what does that mean? Well, if you don't have a place to actually go look at it and to gather that information and then to be able to alert on and all that, then you don't really have a monitoring solution set up. So this part, Oh, I think we're going to say something good. Well, probably not.

Starting point is 00:17:51 Uh, cause this part specific to the dashboards though, I kept thinking like in our own world where I'm like, maybe we have too much stuff going on in our dashboards. And like, uh, like as I kept going through this chapter, like I'm trying to apply this to my, my, uh, you know, day job kind of situation.

Starting point is 00:18:10 Right. And I'm, and thinking like, okay, well there, they strongly advocate throughout this chapter about like, almost like a less is more type of approach. You know, if you put all that data in front of somebody, you know, then, then, you know, the real issues might get lost. You know, if you put all that data in front of somebody, you know, then, then, you know, the real issues might get lost. You know, it's the needle in the haystack kind of thing, right? Yeah, it's nice to have when you, when you need it. There's definitely their metrics, there's sometimes it can be really hard to go get it. And so it's nice to just kind of have that at the ready when you want to go looking for, but somebody who's less familiar with the system to go in and see like 18 graphs on something when,

Starting point is 00:18:48 when they actually have some recommendations on it, like the basics, it'd be nice to have something, maybe it's just kind of higher level that you can kind of get to first and then drill in if needed. It almost made me think of like, you know, I'm thinking specifically about like a Grafana,

Starting point is 00:19:07 for example, and that maybe you would have like one set of dashboards that are like, hey, oh, hey, here are the key overview dashboards. Like these are the things that I want, you know, everybody to pay attention to. And then maybe hidden off in another folder, you would have like, you know, a technical analysis or debug kind of folder. And that's where like, you know, you could see all the internals of some specific thing that you might be investigating and you might want to go look for like, hey, what was the trend of this particular, you know, what was the size of my database over this period of time or whatever, you know, but it's like, it's you had to go, you had to know that you wanted to go look for it and you went hunting for it. And so, yeah, you had the data

Starting point is 00:19:49 there to your point. Like it, there are times where it could be helpful, but the stuff that's in your face is like, you know, here's a dashboard that might have like four or five things on it. Period. Well, we talked about this, I think even on the last episode where like, you know, some of the just as an example, like the Grafana dashboards for Kafka, right? Like there's dozens of things that they chart on there. But for you, what you care about in your service that you've created is, is data making it from the user input into my system, right? And that might only require one chart. It may be two depending because you might need latencies and you might need throughput or something, but, but that's where, yes, it's important to monitor your infrastructure,

Starting point is 00:20:36 but your business level goals, you know, is probably less is way more, right? Like, is it doing its job? And that's super important. Yeah, I like an example there is like, imagine you hook up like a Spring app or just any web app for kind of a compiled language, like a garbage collected language. They're probably gonna have metrics on garbage collection.

Starting point is 00:20:59 How many things are in generation one, two, three? How often it cycles? What's the size? What's the utilization? How long does it take to do garbage collection? You can imagine like a whole page just of graphs and charts for garbage collection. That's such a lower level of detail

Starting point is 00:21:12 than you might want if you just want to know if the system's up. You know, like how are we doing? Is it crashing? Is it working? You know, and so if you kind of flood out the good stuff or, you know, that stuff's great when you need it

Starting point is 00:21:24 because you don't like, it stinks to go hunt for that information when you're really trying to find, you know, debug a problem and to not have it and to have to export it then. And so it's good to kind of have it in your back pocket when you need it, but it needs to be your back pocket. Right. Yeah, so when you see my pull request

Starting point is 00:21:40 to like reorganize all of our dashboards, you'll understand. Good luck, man. We'll see that in a bit in a year. Oh, why do you think I would like take so long? Like,

Starting point is 00:21:54 no, I thought that would be quick. I mean, I'm just saying, Oh, I mean, I thought you were trying to like, again,

Starting point is 00:22:02 like if you only said it in my voice, then it would have been even worse. Like double whammy. I got nothing there. Alan is on the attack tonight. It's been a long week, man. So the other thing that they actually mentioned, it's funny. Like when I read this, I was like, oh, that totally makes sense.

Starting point is 00:22:20 But I don't, I don't typically think like this is they said, Hey, on these dashboards, you might also include things like team information. So, um, yeah. How many tickets are in your queue? Um, what are the highest priority bugs that you have to go on? Who is the current on-call engineer? Like those are all things that would be great to see at a glance on the same screen where you're monitoring your application. Okay. Now, personally, like being able to see like the tickets that are in the queue and all that kind of stuff, like, you know, I mean, because there are platforms out there where you can assign the tickets

Starting point is 00:22:55 and you can already see dashboards for all that kind of stuff. I'm not concerned about that. But the idea of being able to see like in the dashboard, like who the on-call engineer is for that specific part. It's nice. That integration, I was like, oh, I want that in my life. But then I was also kind of thinking, maybe I don't want that at all.

Starting point is 00:23:15 It depends on if I'm on-call or not. Yeah. And so now they go on to define the alert which we've already sort of talked about but this is a notification that's intended to be read by a person now they probably said human but we already went over that um these are tickets you know i'm gonna in the show notes i'm gonna do a find and replace for person and i'm gonna replace it with human that's fine as long as i don't look at it ever again just kidding oh root cause now this is a good one this is i actually like how they define this is the best i've ever actually

Starting point is 00:23:50 seen this defined in in terms of how i've ever heard it it's a defect that if corrected creates a high confidence level that the same issue won't be seen again i really like that i usually when you hear root cause it's like well what caused caused the problem? And that's it, right? Like you write up a paper, you write, you write why it happened and what you're going to try and do to mitigate it instead of treating it as the thing that can be corrected so that you never deal with it again. I love that. Um, they also then said that there can be multiple root causes for any particular incident, right? Um, like for instance, I don't know. The thing was running out of Ram.

Starting point is 00:24:35 Oh, and by the way, we didn't test it for this, right? So you didn't have any unit test in place and there was this thing that happened. So you would have never known that it was going to happen, right? So those could be two root causes that you actually have the ability to go out and fix. Sorry, I was looking at cakes on Costco. We talked about node machines, is that where we're at? Node and machine, yeah. All right, yeah.

Starting point is 00:25:00 So single instance of a running kernel. So yeah, I had to put this in here because I don't know. The thing is, we have a lot of different types of listeners to the show. Right. And if you're not a computer science major or you didn't go through a traditional computer science thing, you may not even know what a kernel is. Right. So I wanted to throw this in here just so that everybody's on the same playing field here. The kernel is the core of the operating system. So it's usually what controls everything. It's always resident in memory and it is what is doing all the interactions between the hardware and software. Like if you open up task manager in windows, right. The, the ability

Starting point is 00:25:45 to go kill a task, your kernel is the thing that is sort of controlling all those running tasks, scheduling them for processing time and all that kind of stuff. Right. So it is the heart of how your operating system works for a single computer, for a computer, for a virtual or, you know, regular computer, either one, a virtual machine or, or a or, you know, regular computer, either one. Yes. A virtual machine or, or a real, you know, on the metal computer. And I think that was the important thing that they were calling out here is they're not calling it a computer. They're basically saying any instance of a kernel, right? So if you've got a, if you've got a bare metal system that's running 10 VMs, then you've technically got 10 kernels running, right? And each kernel is a node.

Starting point is 00:26:29 Or 11, right, the host. Correct, yeah, good call. So, yeah. And they actually said that, like, node and machine here are interchangeable. Yep. Yeah, the way they said that kind of bummed me out because it was like instantly my mind just conflicted and fell over. You talk to someone now because it was like instantly like my mind just conflicted and fell over. It's hard to tell now.

Starting point is 00:26:48 It's like let me explain to someone with known kernel machines. Maybe they're a newer programmer. We've only been programming for a year. You start out by telling them first that back in the day, a computer used to just be a computer. And it wasn't always the case that you had multiple computers on a computer. That's new kind of in the grand scheme of things yeah i guess it's not that new i love when like uh you know when i try to explain like kubernetes to somebody that's never touched it before and you start with the node and

Starting point is 00:27:15 you're like well okay you know so a node is the physical piece of uh hardware you know that's going to run pods and everything and then you get through this like this whole explanation of of like what it could be and then you're like okay now that i've got that settled in your head let me blow your mind a little bit a node could be virtual that name didn't really have to be a computer yeah yeah yeah so yeah containers and pods which run on nodes which runs on virtual machines which run on machines theoretically no one's actually seen a machine in many years now so we're not really sure about that anymore they're all locked up in data centers like aren't those like the new age submarines like people never actually come out of them they sleep

Starting point is 00:27:53 in bunks down there so i don't know um but at any rate so now now that we've level set on that you know that that should help a little bit um They say that there could be multiple services worth monitoring on the same node. They could be related or unrelated. I mean, they gave some examples of like a web server and a cache server, right? Those could be related. So you might monitor both those things. It could be two things that are totally unrelated, like your Git server. Yeah, Git repo, Git server, and I don don't know your web server like those are completely

Starting point is 00:28:27 unrelated they don't have anything to do with each other but you might be monitoring them i don't even honestly when they called that out here i was like why like why are you calling it out like there's some things you want to monitor some things you don't like that seems to be simple enough but whatever i didn't write it so i guess i can't give them too much garbage about it it's mostly fine it's mostly fine except the humans thing and the yeah humans yeah so uh in this episode of critics corner alan discusses yeah i just teased it but you know I don't love this next term either. But hold up, hold up before we go there.

Starting point is 00:29:17 Is it does it happen to you guys like when you're reading somewhat mostly technical things like when when they call out things that you're like, man, this is already enough to consume, right? This is already enough for you to go. Okay, I read this five times. I think I got it this time because I was paying enough attention the fifth time to catch it why are you gonna throw in things that really just seem to not matter like that distract from the overall meat do you guys feel that way sometimes when you're reading things like designing data into or design driven yeah domain driven domain driven design like that was one of those books and i was just like there there's really good content here, but I'm having a really hard time digging it out of all the words. Now it was thick. Definitely. Yeah. Yeah. So anyway, yeah, I, I'm sorry. I'm not, I'm not trying to be mean. Thank you for the free resources,

Starting point is 00:29:56 Google. It's amazing. Um, but yeah, all right. I'm done. Asterix. Yeah. And the last one's a push referring to any change or to a running service or its configuration which is fine it's just uh you know i like the word change like change to me means it can be a change to a running service but also changes can be like configuration like we call it change management not push management so it's just kind of like a i don't know i don't know understand the distinction i don't know why they went with that one, but I'll let it go this time. I don't know. That one made sense.

Starting point is 00:30:32 Yeah, I don't hate it. Yeah, I think they're just saying push the change to whatever that environment is, right? I don't know. Maybe just wording, right? Maybe things have changed since this was written and the wording is different to us now. I'm just thinking they get really confused when they listen to some salt and pepper. And that is a very topical reference that I've just made. And I'll,

Starting point is 00:30:55 I'm sure I'll have more. That's good. I think you just dated yourself, which is kind of weird. That's a weird phrase too, right? Anyway, we could do this all day so why should you monitor i've got a couple reasons here uh analyzing trends seeing things

Starting point is 00:31:15 whether they're going up or down um being able to compare them over time which is good of course you're going to alert when there's a problem which is nice because then you don't need someone watching the screens all day long. Hey, hold up on the comparing over time. So this wasn't just the, the things that were changing. They were basically talking about if you made a change in a system,

Starting point is 00:31:33 like what, what did, you know, before I made the change in the system and after I made the change in the system, did the CPU start doing better or did it do worse? Did the web page load slower after this new release than what it was before? Right.

Starting point is 00:31:50 So without that monitoring, you wouldn't be able to see what those changes were. You wouldn't be able to graph those trends between the two different things over time. It kind of implies there's this ability to correlate correlate you know your releases with your as part of your monitoring so you know i'm not sure how they would have that wired up and i i honestly haven't seen that wired up but it sounds super cool other than like manually knowing like hey this is when we released and i can see this in grafana i know that like oh that spike there or or that that dip is where we did a release or or in if you took something like grafana and let's say that you had enough stats backed up for a week or something right you could totally graph you know today and then today minus

Starting point is 00:32:40 seven days on the same thing and see and see what's happening right yeah but i i guess i meant something like in the more automated fashion to where like it would on the on the chart itself there would be like you know uh maybe a line that says here's where this version was released and then here's another one where this version was released and you could like see them on that chart without overlapping like single single line graph you You remember DevOps Handbook said that you should do that. You should actually graph and chart those releases when they happen and all that. Yeah. Yep. I haven't seen how to do that in Grafana.

Starting point is 00:33:17 I've not seen how to do it in Grafana, but our friends over at Airbrake, who used to sponsor the show, they absolutely had the ability to do it, and they had an API, and you could also do it manually, just like clicking the chart and like noting. And it was really nice for like incident investigation. Cause like while you're trying to figure out something, you could just kind of add a little annotation and say like, okay,

Starting point is 00:33:33 this is where things went wrong. We don't know why this is, and this is a deployment and whatever. Yeah. There were, there were ways to do it in things like Jenkins. And I actually talked to the group that handled that and they were like, yeah, we don't, we don't give access to some of this stuff to, to be able to call these Jenkins and I actually talked to the group that handled that and they were like, yeah, we don't

Starting point is 00:33:45 give access to some of this stuff to be able to call these. And I was like, come on, man. Really? We can't do that? And they're like, no, we don't give you access to it. It's like, all right. So if you have a dashboard, like a page of dashboards, you can add a note on one and it'll put it

Starting point is 00:34:01 on all the others, which is really nice. Yep. Oh, yeah. So the dashboard's fantastic. They answer add a note on one and it'll put it on all the others, which is really nice. Yep. Yeah, so the dashboard's fantastic. They answer some basic questions and this is probably my favorite part of the chapter. They mention the four golden signals, but we're not going to tell you what those are yet, but

Starting point is 00:34:17 it's basically talking about kind of boiling down what's most important because you don't want to overwhelm and confuse the situation. And dashboards are fantastic for kind of targeting and being at an appropriate level for what you're trying to do. I think the four golden signals, like if you, you know,

Starting point is 00:34:34 each person that collects one, they get a free entry into the chocolate factory. Yeah. But if there's one flying around all the time and if you catch it, you somehow win the rest of the game because it's so many more points than all the other quidditches or whatever and the oompa loompas yes maybe we mixed several metaphors there it's great i think we did yeah yeah kudos if you got them all yep gotta catch them all uh ad hoc analysis when things change uh and also being

Starting point is 00:35:03 able to identify what causes this. The thing we talk about all the time is like with your observability, you need to be able to figure out if things working and if not, what went wrong. I mean, kind of to Alan's point earlier, though, there was a lot in the beginning portion of this chapter where it was like, OK, I mean, this is let's get to the meat of it already. You know, some of this was like, yeah mean this is let's get to the meat of it already you know some of this was like yeah it makes sense i mean but but it was like also you know setting the groundwork for like okay this is what we're gonna call these things humans and these things are machines humans are machines and i also should say unless you know when the system is about to break which is really important you know things like uh disk space is probably my favorite example here. When you say like, hey, whoa, whoa, whoa.

Starting point is 00:35:49 And the rate you're going, you're going to be out of disk space in one hour or one week or whatever. Okay, this is where we're starting to get into my mind exploding a little bit as I was reading this. Because it was like, okay, isn't that something that we could make this system deal with that on its own? Why would that need to be an alert? Why is that an example? So check this out. So our disk space, it could just be something that we just need to add more disk. And maybe we can automate that.

Starting point is 00:36:16 And that's great. But you can think of it like, what if that's a sign that there's a problem? Like something's going nuts. It's writing way too much. We're adding disk much faster than we way too much we're uh adding disk much faster than we expect to because we're not adding you know it's it's out of uh well then proportion with our growth well then your metric wouldn't be that you're running out of this space the metric should be how often are you adding additional space okay i like yeah i mean that's

Starting point is 00:36:39 that's modifying the the thing but but same type thing goes right like if if all of a sudden you see some unexpected trend you know and this is where usually alerts aren't based off some linear thing right like it's when you start seeing this slope that that changes like so fast that you're like wait a second somebody needs to take a look at this because this isn't normal right now obviously though you know i mean you can tell i've kind of got a Kubernetes bias as I'm reading through this, though, because if you're dealing with physical servers, then you're like, well, hold up now, Michael. I mean, Alan, that's not so easy to just go and add a hard drive. The computer is going to do it for itself on the fly.

Starting point is 00:37:22 How's that supposed to work? You know, it's funny. I hadn't even thought about it because when you said that, I was thinking the same thing. I was like, oh, you just modify the PVC size. Yeah, it's totally different when you've got, like if you're managing a data center or something, right? Like that's a different ballgame.

Starting point is 00:37:37 I definitely wasn't talking about like trying to modify the PV size because that's not going to play well so easily. I mean, I guess technically it's possible, but I was totally thinking of horizontal and vertical auto-scaling that you could do in Kubernetes. Sorry, like managed

Starting point is 00:37:56 services for storage and Kubernetes for stateless. There you go. That's hard and whatever, but yeah, it's just nice to not have to worry about that you know that's like hardened and whatever but um yeah it's just so nice to not have to worry about that but that's also making some big assumptions about your managed service and that it can add disk dynamically which is tough so i think a lot of work probably goes in that but you imagine like gmail or you know whatever like they have to be

Starting point is 00:38:18 able to add disk drives as needed you know so that's something they're gonna have to figure out and that's uh something i'm glad I don't have to worry about. Well, you know, it's funny, like when you talk about automating that, that's also part of the reason why people's cloud bills are so high is because if you do just set something up to automatically be like, oh, it came in within

Starting point is 00:38:37 20% of capacity, it just keeps adding them. You go back and look at your bill and you're like, when did we have 10 petabytes of drive space attached to our thing right like when did that happen so something went recursive right right oh oh this next one i actually liked too they say you should never alert on something just because it seems off right like don't don't try and call out every little thing that comes out, right? Like if it's not a problem,

Starting point is 00:39:07 let it be. Yeah. That's cause you get too many false positives and humans hate that. And yeah, it makes it because I'm stopping attention. And also one of the bullets that you have right there, if you finish that out, they said the problem with that,

Starting point is 00:39:22 it's just what you said is if there are too many false positives or too many things that come through, humans have a tendency to just stop investigating because they feel like, oh, well, this is a waste of my time, right? They're not going to be as thorough. Yeah. And this is another example where like you could actually lose the needle in the haystack kind of situation because when the real alert comes through you can train yourself and i've even caught myself doing this where like train myself to be like oh no i can just ignore these and then you're like oh wait a minute no wait i needed to go back and look at that one yeah you can alert and your first thought is how can i relate this to something i know I can ignore? You know, from the wrong perspective.

Starting point is 00:40:06 It's later. I don't know that you guys had planned to talk about it, but there was later in the book where they were talking about examples of – they give some use cases. Jay-Z probably didn't read them. No, I'm just kidding. I know Alan did. I hate him.

Starting point is 00:40:23 Yeah, exactly. them but no i'm just kidding i know alan did hate them where yeah exactly but um they were talking about like when you they did get an alert for a specific thing that they're like okay well while we're trying to resolve this thing let's scale back you know what might cause that thing to alert so that people aren't getting inundated with it and other engineers aren't like, um, being pulled away from their work and in order to try to figure this out when we already have people, you know, trying to resolve it too. So, um, you know, I, I kind of liked that in regards to like trying to, uh, not, not inundate people with that, with that alert. Yep, absolutely. And also, one thing I mentioned here is that paging a person is very

Starting point is 00:41:08 expensive. It takes them away from what they're doing. If they're sleeping, if they're, you know, whatever, like, whatever you're doing, like, people contact switching, getting them off work, whatever, all that stuff. People are expensive, like, much more expensive than a computer in general. So, yeah, don't do it.

Starting point is 00:41:27 So, you talk about, oops go ahead well i was gonna say too like you know going back to my my whole thing about like well but it's supposed to be uh you know what was that phrase again automatic not automated automated and and but they did say in this chapter though that a lot of these are like ideals to strive for and that even they didn't have it perfected across their teams. So, yeah, I thought that was a really good, like, honest revelation that they were saying that, you know, hey, even though we've written this book and we're sharing these ideas with the rest of the world like we're by no means perfect on it yep and so it's just about building that good signal to noise ratio and kind of keeping the the hay out of the needles or vice versa needles out of hay uh but one thing i was kind of a fun question is like how can you measure your signal to noise ratio? Like, how do you know?

Starting point is 00:42:27 Another thing I could really think of is like, you know, obviously if you've got alerts that go out, you need a metric on alerts that were false positives or false negatives. Well, I guess false negatives is a little harder. You can't really quite measure that the same, but it's almost like you need to have a meta dashboard there. JIRA stats is another way to handle that well i would i would also say too based on this chapter that uh like an a counter of how many times you've sent the same type of alert out because

Starting point is 00:42:58 they strongly advocate in this chapter that um you know these alerts are supposed to be novel right and so if you're if you see that same alert 20 times that's 19 too many yeah right and that would be noise i feel like the grinch when we talk about noise though i always want to say noise noise noise yes i get that it reminds me of one of our past sponsors i cannot think of the name right now and that's terrible i want to say it was x something but um like when when something would happen x matters when something would happen it would automatically create tickets and all that kind of stuff that would actually be a potential way to do it right to where if you created a ticket in your system and you constantly just say duplicate, duplicate, duplicate something, you could almost

Starting point is 00:43:45 do like a roll-up count of those tickets and the resolutions that were done and say, hey, we got this thing a hundred times and it was only ever something that mattered once, right? Like that's a 99% noise ratio, which is a problem. This episode is sponsored by Retool. Building internal tools from scratch is slow. It takes a lot of engineering time and resources, so most companies just resign to prioritizing a select few and settling for inefficient hacks and workarounds for every other internal business process. tools faster so they can focus on development time on the core project or product. Retool offers a complete UI component library, so building forms, tables, and workflows is as easy as drag and drop. More importantly, Retool connects to basically any data source, database, or API, offers app environments, permissions, and single sign-on out of the box and offers an escape hatch to use custom JavaScript when you need it.

Starting point is 00:44:50 With Retool, you can build user dashboards, database GUIs, CRUD apps, and any other software to speed up and simplify your work without Googling for component libraries, debugging dependencies, or rewriting boilerplate code. Thousands of teams at companies like Amazon, DoorDash, Peloton, or rewriting boilerplate code. Thousands of teams at companies like Amazon, DoorDash, Peloton, and Brex collaborate around custom-built Retool apps to solve internal workflows. To learn more, visit retool.com. All right, who's begging?

Starting point is 00:45:19 I haven't done it in a while. Oh, gosh. Here we go. Oh, no. Let's do it. Oh. All right. Hey, there. in a while oh gosh here we go now let's do it oh all right hey there uh it's joe uh outlaw is in the restroom voice uh yeah alan is oh yeah you ruined by what oh it's all messed up now i was gonna ask i was gonna ask for reviews because we need them we love we got several this time which

Starting point is 00:45:42 is really great it makes it feel great but Now, I can't do what I was going to do. It was going to be super weird and awkward. I'm kind of glad that I stopped it then. So, yeah, it's for the best, really. I was just going to ask for a review because we know we love them. We try to make it easy. If you go to codingbox.net slash reviews,

Starting point is 00:46:00 there's a bunch of links there that can help you help us with helping you find the right reviews look i kind of did i saved it about it you're like yeah coding boss.net slash review yeah that's how i that's how i like it uh yeah um but yeah it's all ruined so i mean if you could just hook us up that'd be great. I'd give up. All right. Well, you know what, Joe?

Starting point is 00:46:29 Maybe I can make you feel a little bit better. Please. What does a clock do when it's hungry? Something ate. Something ate. Oh, wait a minute. Hold on. You thinking about it? Keep thinking about it, though.

Starting point is 00:46:55 We got time. Everybody loves this song. Hey, we can't go the full 30, though. Oh, okay. Well, that's fine, then i i realized we had to be you got somewhere to be this was a final jeopardy man this is the first question oh you're right you're right all right well the answer is uh it goes back four seconds oh geez that's very good. Very good. Ah, it took a second. I gave you the signal. It didn't matter, man.

Starting point is 00:47:29 It didn't matter. I was trying to remember what the original question was. I was all up in the Jeopardy. See, yeah, okay, fine. I'll play the Jeopardy song again. You ready? No, just kidding. All right.

Starting point is 00:47:43 So, a few episodes back. Oh, guess i should say like we now head into my favorite portion of the show because i know that jay-z loves when i do this part survey says all right you gotta like do the the uh what's it called the doppler effect you know oh yeah all right so uh a few episodes back we asked how mature is your CICD pipeline? And your choices were extremely mature. Something like commits are made, bits are built, things are tested, PRs are merged, builds are deployed automatically every time. Or somewhat as in we build and test PRs regularly, but deployments still require a person to initiate.

Starting point is 00:48:27 Or not even close. It's more like Leroy leaves his laptop running in a closet, and when we need a build, someone walks over and performs the 18 necessary build steps. So, this is 185. So, to TechCo's trademark rules of engagement, Alan, you are up first, man. I'm, I think I'm going to go the middle road here, uh,

Starting point is 00:48:50 because I'm not sure how many people work at large companies like an Amazon where it's going to be extremely mature. So I'm going to say somewhat and I'll go, go 40% on the somewhat. Joe, um, 40% on the somewhat. Joe? Okay, hold on. I'm doing a little bit of math here. Okay. I'm going to go with the same answer. 40%. Wait, no. What?

Starting point is 00:49:23 That's my final answer. I'm pretty sure Price is Right rules you can't do that, sir. Is it true? Can you not do that? No, there has to be a difference. I don't know what to do. I mean, that's my answer. Negative 40%.

Starting point is 00:49:40 Okay, negative 40%. I'll take that. I don't think the math is going to work in your favor, though. It never does. When's the last time you attended a math class, by the way? Oh, geez. Hold on, let me count. Okay. 1, 2, C, D, E, F.

Starting point is 00:49:59 Minus the 14. All right. So, uh, Alan says somewhat as in we build test PRs regularly, but deployments still require a person to initiate with 40% of the vote. And Joe says somewhat as in blah, blah, blah, blah, blah,

Starting point is 00:50:18 but negative 40% of the vote. It could be, I'm pretty sure I know who's going to come out as the winner here. It's Alan, of course. Now, it would have been a super awesome thing if Jay-Z

Starting point is 00:50:36 I should have said Jay-Z just to trip you up, right? It could have happened. I thought he was going to be 41, honestly. I'm surprised he didn't. That's how close he came. Because had he outnudged you like that, by a single point, he could have walked away victorious. Because it was actually 52% of the vote.

Starting point is 00:50:57 Wow. Okay. That was my second choice, actually. I almost said 52. It was on the fence. You know, it's like, how do you take this guy seriously? Like, do I trust that answer? Do I really think that that was his second? Because I don't know.

Starting point is 00:51:14 That's so good. I don't think so. You never know. Just rewind and listen. I said it a few seconds ago. I kind of think Joe's lying to us. I don't know. Maybe.

Starting point is 00:51:22 Let me ask. Why are ghosts bad liars? Because you can see right through them. That's right. Look at that. Yeah. Was the number two answer, was it extremely mature? Yeah.

Starting point is 00:51:37 It really was. I think that's because there's definitely a number of people that work at the Microsofts, Amazon, those kind of places. Yeah, the Fang. I guess now it's Mang because it's Meta and Set. I always thought that that was because Microsoft got added,

Starting point is 00:51:55 but then I realized, oh, wait, they changed their name. Yeah, Mang. Hey, you know, we did that great resignation episode. I've heard Meta, hiring freeze, that great as resignation episode, uh, Facebook, I've heard meta, uh, hiring freeze, uh, Netflix looking at layoffs and, oh,

Starting point is 00:52:10 Twitter also, uh, hiring freeze. I think Twitter's like sort of this whole world of, yeah, that's a weird situation to be in. Yeah. Yeah.

Starting point is 00:52:22 Yeah. I haven't heard what's going on over there. Yeah. I don't know. I heard something. And on over there yeah i heard something and they were like no and then they're like oh okay and then he was like no no now they're like yeah i love that synopsis though it's very dramatic it's it's quite accurate though. All right. So you put a Twitter thing in here saying, ha, thanks. But by the way, that doesn't work.

Starting point is 00:52:51 So I'm not sure what that's supposed to be. Oh, yeah. So the reason the survey was inspired by a tweet that came from Kurt Frank. And so I was going to throw this into the show notes when I put the, when I was just going to add it in there. Okay. So, so then he already knows what this episode's survey is going to be, but for the rest of us,

Starting point is 00:53:12 this episode, we ask which book should we finish? You know, I mean, like it's a little bit of a self owned, I guess, you know, that we got to like take up here. But you know, I thought-owned, I guess, you know, that we got to like take up here.

Starting point is 00:53:25 But, you know, I thought we like purposely were like, you know, walking away from some of the books like, you know, no, no. Like we'll let we're not trying to like get into copyright infringement issues here. Like, you know, do the whole book, are we? But apparently people were like, hey, why don't you cover the whole books? They want to know if we can battle a dog in court. Probably not. I don't you cover the whole books they want to know if we can battle a dog in court probably not i don't know but uh it's actually illegal to do uh in the last chapter basically so as long as you stop right before then you're good legally i'm not a lawyer but we always buy these books and give them away too so i mean they can't they can't be too mad at us yeah actually and then the authors that we've talked to you is have never been upset about it right yeah but don't tell don't tell them right yeah i'm also not sure that's how the law works

Starting point is 00:54:13 i don't i don't know but uh yeah it probably doesn't know any of these authors don't tell them yeah should we have this our little secret if we're not law professionals, blah, blah, blah. Yeah. Please don't sue us. Yeah. All right. So which book should we finish? And your choices are, we got to start with site reliability engineering or site reliability

Starting point is 00:54:36 engineering. I should just say all of these names as Joe would. Yeah. Mush it up. Site-resolving engineering. And number two, designing data-intensive applications. I can tell what you're saying. Good enough. I'm not going to do it.

Starting point is 00:54:57 Maybe we should have Joe read these. But designing data-intensive applications was number two. Domain-driven design, number three. Next up, clean code and clean architecture the imposter's handbook the devops handbook or any book just finish one or i actually like that you leave some of it for me to read on my own or just move on to another book ain't nobody got time to go back to these old books. But you, you know,

Starting point is 00:55:28 what's going to be crazy is when people actually read these books on their own, they realize that it doesn't take four hours to finish a chapter. Oh my God. There were only 30 pages of that. Yeah. I actually read this chapter while I was sitting, waiting for my Sonny's drive through. Oh, Sonny's. I hadn't had that in a minute. Yeah. I actually read this chapter while I was sitting waiting for my Sonny's drive-thru.

Starting point is 00:55:45 Oh, Sonny's. I hadn't had that in a minute. Yeah. It's pretty good. Mine's got a drive-thru. It's like wicked fast. That's great. That's the series chapter. This episode is sponsored by Shortcut. Have you ever really been happy with your project management tool?

Starting point is 00:56:01 Most are either too simple for a growing engineering team to manage everything or too complex for anyone to want to use them without constant prodding. Shortcut is different though because it's better. Shortcut is project management built specifically for software teams

Starting point is 00:56:17 and they're fast, intuitive, flexible, and powerful. Let's look at some of their highlights. Team-based workflows. Individual teams can use Shortcut default workflows, or customize them to match the way they work. Organizational-wide goals and roadmaps. The work in these workflows is automatically tied into larger company goals. It takes one click to move from a roadmap to a team's work to individual updates and vice versa. Tight VCS version control system integration, whether you use GitHub, GitLab, or Bitbucket shortcut ties directly to them so you can update progress from the command line. And a keyboard-friendly interface. The rest of the

Starting point is 00:56:59 shortcut is just as friendly to your keyboard with their power bar, allowing you to do virtually anything without touching your mouse. Iterations planning. Set weekly priorities and then let Shortcut run the schedule for you with accompanying burndown charts and other reporting. Hey, give it a try at shortcut.com slash coding blocks. Again, that's shortcut, S-H-O-R-T-C-U-T dot com slash coding blocks. Shortcut. Because you shouldn H O R T C U T.com slash coding blocks shortcut, because you shouldn't have to project manage your project management. All right. And we're back to, to page two of chapter six.

Starting point is 00:57:41 You don't know how many pages it is because it's on a web page they didn't print it that's right oh that'd be hilarious if you did print it though like no i'm not that guy we all know i like kindle better than i do printed printed so yeah um so setting reasonable expectations for monitoring this is kind of important right so monitor they actually call this out. Monitoring complex systems is a major undertaking. So they mentioned that their Google SRE teams with 10 to 12 members, they always had one to two people focused on building and maintaining their monitoring system for their service. So a fifth of their team, a fifth to a sixth of their team was already devoted to just making sure the monitoring was working the way that it should. They did say over time that they've reduced that headcount from two to one. As they've improved and centralized some of their infrastructure and some of the things that a lot of their services share. So that's awesome, right? Everybody's getting to benefit from that.

Starting point is 00:58:45 But they still have at least one person that is dedicated to doing that for their service. Yeah, and that's crazy to me because that's a lot. You know, you're talking about like some percent, you know, 8% to 10% or something like that of your workforce imagining MNH. It's like working just on monitoring your system. Like that seems like a

Starting point is 00:59:05 whole heck of a lot and uh it got me wondering too like is that is that a dev ops role the sand anyway well you're saying you're saying that's crazy and to me it was more like okay so there's a team that for for gmail that's like 10 to 12 people and only one of those is monitoring gmail well that's not very many people for Gmail. Yeah, I was going to say I think Gmail is probably a little bit bigger, right? Yeah. Well, that's on the 12th side because they did get the range

Starting point is 00:59:33 10 to 12. That's right. That's a large team. It's broken up into sub-teams. Imagine you have 100 employees, like 10 to 12 are dedicated to monitoring. It seems like a lot, but it's important. This is SRE teams, too, I should mention.

Starting point is 00:59:54 This is not normal dev teams. It's 10 to 12. Okay, so now I feel better. Never mind. Yeah, I changed my mind about SRE teams. An SRE team for Gmail. 10 to 12 people. That's believable.

Starting point is 01:00:05 Yep. And one to two of them, that's believable. Yep. And one to two of them are working on monitoring full-time. Hey, and what they said, though, and this is really important, they don't expect the SRE to be staring at that screen the entire time, right? There's alerts and stuff set up. They're not having to look at the screen and say, hey, the graph went up here, press the button. That's not how that works.

Starting point is 01:00:26 Yeah, I figure it's more like, you know, somebody has a problem. They say, you know, we have this problem. It took too long to figure out. I think if we had a graph that shows this, that would be much better than the ones that we currently have. But in order to do that, someone needs to go through and add some more metrics here and there. And then those are tickets for that person.

Starting point is 01:00:43 So they also said that Google has moved on to simpler and faster monitoring systems. That's interesting. Which means that they've provided better tools for ad hoc analysis is basically what they said, right? So that's good. This was really interesting to me, especially from a company like Google. They try to avoid building systems that try to determine the causality of something. They want to leave that up to the people. They don't want the machines trying to figure out why something went wrong. And my guess is because

Starting point is 01:01:18 it's usually complex, right? And you don't want the system trying to come up with some reason that's completely off base. Yeah, it's wrong. Yeah, I totally agree with that. This, they said, this doesn't mean that they don't monitor for major changes and common trends and that kind of stuff, right? They totally are. They just don't want it to be like, oh, well, this is what the problem is, right? So in other words, they wouldn't have a system that says like, oh, hey, query performance has

Starting point is 01:01:47 degraded. It must be related to needing more CPU. I'm going to automatically add more CPU to the system and hope that that resolves it. Well, that, and you can imagine a web app can't talk to a database that says database is down in the error message, right?

Starting point is 01:02:04 Or the alerts. Well, that can anchor someone into thinking there's a problem with the database and they go look at it and realize like, oh no, it's the name for the database is wrong in the configs and so it was like trying to talk to something nonsensical. But just by having the error message saying one thing, it kind of anchors you. I'm sure you've

Starting point is 01:02:20 had a problem somewhere where the error message you got, you go and by the time you figure out what the actual problem was you were like wow i never would have guessed that based on this error message because it had me kind of running down the wrong path because it was it kind of made some assumptions about what was wrong that weren't true only every day right yeah true oh my gosh oh so this this also to me was very interesting. I'm curious what you guys think. They said that the SREs at Google almost never use tiered rules by tiered, basically hierarchical type things. Right. So they gave an example and I didn't put it in the notes and i completely forgot what it was something about like the if the data center is if they were draining thing uh usages out of a particular data center then you wouldn't alert on that data center on issues in that data center yeah right but that was the example of like

Starting point is 01:03:17 you know one of the few examples where like this is this is an example where we do use that type of tiered thing but the the thing that was that to me is I was like, well, I had never, I would never would have thought to like write tiered rules like this for my alerts until you brought it up. And now that you brought it up, you're telling me I shouldn't do it.

Starting point is 01:03:37 Now I'm like, Oh, well, I mean, I'm not going to do it, but I, cause also like, I still wouldn't even know.

Starting point is 01:03:43 Like, I think the problem here that i had while reading this part is like it's a scale kind of problem right and maybe some of the the scale of some of the things that they're talking about in this book is grander than the scale of things that i've had the pleasure to work on right and so i i uh i haven't ran into a need for where I would have wanted to do these tiered type of triggers. Well, they also call out the primary reason they don't do it is because they're constantly changing things so much that it doesn't make sense to create these tiered things. So they're changing their systems or the infrastructure or whatever.

Starting point is 01:04:26 And so putting those in place just complicates things. Right. And that's really what they were getting at the heart of was they like to keep their alerts simple. Yeah. Here I found the example. It's that they gave of like what they would not do. It is.

Starting point is 01:04:41 If I know the database is slow, alert for a slow database. Otherwise, alert for the website being generally slow. Oh, right. So instead, they're just advocating for like, just say that the response is slow, period, and you can go investigate it. Yep. They did say when they do alert on these dependent types of rules,

Starting point is 01:05:04 it's when there's a common task that's carried out. That's relatively simple, right? If it gets complex and they just don't do it. Um, what else I got here? Oh yeah. When, when there is an issue, it's critical that, that the alert happens quickly and that it's easy for the person to follow, right? So that they can actually go troubleshoot and fix the problem. So again, simplicity was at the heart of everything that they're saying here. And that's pretty much it. The last thing was like, hey, if there is an alert, it needs to be representative of what the failure was. Like just what Jay-Z said a second ago,

Starting point is 01:05:45 it can't give you some sort of red herring. So you're off chasing your tail for an hour while production's down. Right? Like it needs to be a good one. I mean, you know, again, going back to my problem that I have with this,

Starting point is 01:05:59 those was like, well, if it's, if the alerts are simple and you know what they are, then those are the things that you're trying to automatically fix, right? But I mean, I realize that there are like cases where that might not apply, but I kept having this like internal struggle with myself as I was reading this. Well, I think that's legit though. I think that if an alert came up and it was easy enough to go fix, that's when you find that it's the root cause. You fix that root cause and then you don't see that one again, right? I think it's an iterative approach to it over time.

Starting point is 01:06:33 So then they get into symptoms versus causes, right? This reminds me of medical stuff. You hear people saying they don't like going to a doctor because the doctor just gives them a pill and sends them home instead of trying to figure out what's going on. So you fix that. So you never have to come back to the doctor, right, for that same issue. It's the same thing here. Monitoring systems should ask two questions. What's broken?

Starting point is 01:06:54 That's a symptom. And why is it broken? The cause. And it says that drawing this line is really important, right, because this is the only way that you can create a monitoring system that has high quality signals with low noise. And we talked about those dashboards and kind of having the right level of detail. I think kind of what we're getting at there is that you kind of want one that just says like what is broken. And then from there, you want to be able to drill into a further dashboards in order to kind of figure out the why. But you need to have those tiers that are otherwise just noisy. be able to drill into a further dashboards in order to kind of figure out the why um but you

Starting point is 01:07:25 need to have those tiers that are otherwise just noisy and so they give us example here where basically if you imagine a web app is serving uh you know 500 error uh response codes so you know there's a problem and the cause might be that there's actually a problem with the database but just the symptom you're observing is over here in this web server yeah it was also worth calling out too that later they talk about like one person's symptom i'm sorry one human's symptoms might be another human's cause yeah yeah so these kind of cascade i'm gonna trigger him eventually it'll happen i'm not hearing the words anymore. Talk a little bit about Blackbox versus Whitebox.

Starting point is 01:08:09 Google SREs lean heavily on Whitebox monitoring, much less Blackbox monitoring for critical use cases. So like we mentioned there, the Whitebox monitoring was the Whitebox. It's the things that you have access to do. These are the things that you choose to expose, and they tend to be be they model the system. They're representative of the system. Whereas the black box

Starting point is 01:08:31 is more like customer facing. This is what your customers see and interact with and it's kind of treating the system as if it's something you don't have access to. This is what you can observe from the outside. This is more symptom oriented. It deals with more like unplanned issues or things that are just kind of more symptomatic. I mean, it's still valuable, though, right?

Starting point is 01:08:54 Because you can try going after all the causes with the white box monitoring and try to get in front of that with metrics and whatnot. But it's easy to like miss things or to overlook things or like, you know, there's so many complicated variables that, you know, you could easily, you know, miss one. And,

Starting point is 01:09:13 you know, that's where like the back, the black box monitoring could still show you like, oh, well there's still this symptom here that you missed. So. Yeah. Imagine this,

Starting point is 01:09:22 like imagine this scenario. I tell you 3% of all calls are failing. Like, ooh, yikes, we should look at that. What if I tell you 100% of checkouts in our shopping cart are not working? Like, whoa, that's terrifying, right? That's a different situation, though. Even though that might only represent 3% of the total calls, it's really important to our customers and our use cases and everything else.

Starting point is 01:09:44 And so, yeah, that's just a kind of example. really important to our customers and our use cases and you know everything else and so um yeah that's just the kind of examples like something where the black box might kind of have a different perspective or show you uh kind of a different uh different interface for that sort of thing yeah hey something something i did like is they talked about the fact that some issues are sort of hidden unless you have the white box monitoring in place. Like one of the things that they said was retries like for the black box experience, like you go to checkout, right? And your checkout works fine.

Starting point is 01:10:13 You didn't see any problems with it. But behind the scenes, it failed four times and did retries. And that would only be caught by, you know, telemetry and system monitoring. And it's kind of interesting to know that you wouldn't have known that there was something going on if you hadn't had that monitoring in place. You know, here's a real world example of that. Hard drives and SSDs, because they have errors all the time. I mean, there's a portion of the disk that's just, you know, like it's kind of like over-provisioned so that you can't have additional sectors to write to in case of problems. And, you know, it just happily marks that. It might try it a couple of times and then

Starting point is 01:10:52 eventually mark it as bad and move on and write it somewhere else. How much IO is your computer doing? You know, you're not monitoring all that, but I mean, you can, right? There are systems to do that. do that but you know that's an example of where like you know you could easily not know that that issue exists and and if you were tracking that ahead of time especially if you think about like large disk arrays you know network and and uh san type networks uh you know then then maybe that type of thing is a metric that you really do care to know about so that you can preemptively know, uh, like, Hey, this drive, this drive is, is about to die. And I need, I should go ahead and plan to replace it beforehand. Yep.

Starting point is 01:11:37 So the, the thing that we talked about earlier, and I think Jay-Z sort of touched on it at one point, was they had an example of this white box monitoring being crucial for telemetry. So the example they gave was, hey, somebody says the database is being slow. The website thinks the database is slow, but if you have explicit white box monitoring tied into both the website and the database, website thinks the database is slow, but you run these queries on the database and it thinks it's fast. So that might lead you to understand that, hey, there's a networking problem

Starting point is 01:12:11 between the web server and the database server, right? So you can draw improper conclusions without having the right telemetry being measured and monitored in different places. Yeah, and think about how long it would take you to track that down and figure out having the right telemetry being measured and monitored in different places. Yeah. And think about how long it would take you to track that down and figure out if you didn't have two graphs that you're able to kind of correlate and see like,

Starting point is 01:12:31 Hey, this one's as fast as it was yesterday. This one is way off. You know, what's the problem here? I think this is the one too, where they gave like an example, like it didn't even have to be anything wrong with either system.

Starting point is 01:12:42 It could like, they gave an example of like a crimped network network cable cable that was like intermittently dropping packets uh you know causing dropped packets and so therefore it was interfering with the performance and so you know you could run queries left and right on the database and it's like yep just returned it happily it's fast you know whatever but the connectivity to it, maybe for this one particular route, doesn't even have to be for every client. It could just be for, you know, one or two clients. You know what, this actually reminds me of the age old interview question that I'm sure we've all asked people, you know, Hey, you have a, page in your application that's performing poorly.

Starting point is 01:13:27 How do you figure out what's going on? I've never, ever had a person say, well, I would go look at our dashboard that was monitoring, you know, web request latencies and all that and database. Like it's always, well, I would go look at the web page first and see what's happening there. And then I would go back and look at the server. Like it would be awesome if somebody was like, well, I would go look at the webpage first and see what's happening there. And then I would go back and look at the server. Like it would be awesome if somebody was like, well, I would assume that I'd have monitoring set up for the, my various tiers and looking at this stuff. And then I could just, you know, correlate the things, right? Like that would be a shortcut. And it just reminds me that it's, it's easy to forget about some of the stuff when you're going through it, but that'd be an, that'd be an amazing answer to start there.

Starting point is 01:14:05 Let me look at my monitoring tools. And then you'd be like, oh, okay. Yeah, well, we don't have any of those. Yeah, yeah. I mean, I don't know which Google book you read, sir, but. That's right. We can't all live in a perfect world, fella. Yeah.

Starting point is 01:14:20 Or, yeah. I mean, whatever. That's awesome. So the four golden signals. So these were the things that like, if you were only going to measure four things out of your system, these are the, the four things that Google considers the most critical things to go after.

Starting point is 01:14:42 So first is the latency, the time that it takes to your service to, uh, to, I'm sorry, let me rephrase this. The time it takes to service a request, right? And it's important to separate successful requests, latency versus failed request latency. But they did also call out in this chapter though, too, where it's like, um, you might want to have, I forget, like you didn't want to just factor it out altogether, but you, you do want to have a consideration for those failed requests too, because otherwise like it could be misleading if you included it. Yeah, it was, that actually was a really good epiphany to me, too. When I read that is like, you know, I don't think I'd really considered that.

Starting point is 01:15:31 Like if something's failing, they said a slow error is worse than a fast error. And that makes total sense, right? Like if it's going to bomb, you don't want to sit there and wait 30 seconds for it to fail. You want it to fail fast. And it's just it's really interesting you want to filter that out from your from your real latencies that people are experiencing on a regular thing do you remember do you remember this is going to go back uh talk about dating do you remember like the days of the early days of the internet like you wanted ftp a

Starting point is 01:16:02 large file right and before like there were like easy to use tools that had resumable FTPs and instead it would, you know, you're trying to transfer and it failed. All right, let's start it over. Yeah. Right. And, and you, you know, you'd get so close to the end and you're like, Oh God, died again. Right. Like that, that's an example of like, cause you know, if you're thinking like, wait a minute,

Starting point is 01:16:27 why would I hate a slow error worse than a fast error? Because in like that type of an example, right, where it might take like, you know, 30 minutes to download something, you know, you'd rather know in 30 seconds or three seconds that,

Starting point is 01:16:39 Hey, it's not going to work versus you get 28 minutes into it. And then it's like, no, I just died. And I was trying to transfer a 10 meg file minutes into it. And then it's like, no, I just died. And I was trying to transfer a 10 meg file on my ISDN. Right, right,

Starting point is 01:16:49 right. Let me head out and go get a new modem. I'll be right back. I don't know what's better, this comment or your sound effect for it. I was like, yes, man, you've got mail oh man it does okay wait a minute wait a minute wait a minute wait a minute did any did of course i had aol

Starting point is 01:17:16 you didn't really yeah yeah you had to that's they had all the best chats am i the only one that didn't actually have aol you didn't have the real internet did you oh man i know you're on the sidelines of american culture you were about aol was the beginnings of why the internet is even a thing today yeah i viewed it as such uh like i viewed at the time i viewed AOL and like the people that were using AOL is like, okay, you don't really know computers. Then if you're like that,

Starting point is 01:17:52 that's AOL is for like, you know, the parents, that's what they need to use to get on the internet because they didn't know other ways to do it. Hey, what was your first version? You remember Joe?

Starting point is 01:18:04 Oh, first version of aol used of aol yeah first row i don't know mine was 3.0 i remember 3.0 i remember they'd send you out the cds and stuff like upgrade today i'm like oh yeah they they those cds were in everything like i had a stack of them and i never ever like they would send them to you in the mail yeah no it was great it was amazing man good time it was amazing that's where they had the best chat rooms i mean they did just that was the best yeah even on my 640 by 480 monitor i'd have like 12 of those things open right yeah and like the websites all were just terrible at the time you know it's all like angel fire type stuff which

Starting point is 01:18:40 is you know fun it was cool you know if you want to like look up stuff on bandulic or whatever but uh aside from that the web was terrible and then what, you know, fun. It was cool, you know, if you want to, like, look up stuff on Banjelic or whatever. But aside from that, the web was terrible. And then, you know, what else was there? It was multi-user dungeons. CopyServe. Geocities and Angelfire. Those were the two.

Starting point is 01:18:57 Golly, man. Oh, my. So, AOL is still a thing. Yeah. Yeah. And they have an article that is current as of 2020 on how to order a CD ROM to install AOL. Yes, sir.

Starting point is 01:19:16 That's amazing. Yes, sir. I, wait, how are you going to download it? How are you going to, you can't,

Starting point is 01:19:23 you need, you need to go off an internet. That's right. You gotta, how are you going to download it? You need AOL for the internet. That's right. How are you guys connecting now? A copy serve. I can't believe I'm the only one that didn't use AOL for real. I just had small ISPs. That's how I got on the internet.

Starting point is 01:19:44 Well, I mean, you are older than us. Whoa! Hey! Wow! I hunted again on the attack tonight! Whoa! Oh, God, that would hurt! That was like a little extra sting to it, too,

Starting point is 01:19:59 because it came from a spiteful kind of place. Wow! Wow! You've been sitting on that one huh he's earning a paycheck and i'm in high school i'm like hey mom dad could you get aol for me there's not that much hd hold on now it's it's not that much you're making making it out like, you know, I was in this a decade before you, you weren't.

Starting point is 01:20:34 I quit. I'm done. I think, I think he's for the record. I think he's like a year or two older than me. So it's really not that much, but yeah, totally.

Starting point is 01:20:43 I was asking my parents for internet, right? You know, I, older than me so it's really not that much but yeah totally i was asking my parents for internet right you know i i i got i got i gotta like clear this before we can go out like why did the a go to the bathroom and come out as an e i don't know because he had a vowel movement. All right. Now that we've got a little humor in our lives and we've cooled the air, it's a little bit better. We can move on and talk about the next thing that you should be monitoring from your four-gold signal. Yeah. Yeah, we took a little break there for a minute.

Starting point is 01:21:21 Yep. That's Alan's fault. Sorry. We need to like take him out and walk him or something more often. Get rid of some of that pent up energy. Is traffic. How much demand is being placed on your system? So this is like request per second for a web,

Starting point is 01:21:44 for a web request as an example. Or for streaming or video, video a web request, as an example. Or for streaming or video or audio application, it might be the IO throughput that you might measure. So depending on your use case, you would measure different things there. Or let me rephrase that. You would define traffic in a different way depending on the use case. Yep. So first was latency. Second, traffic. Third is errors, which is the rate of requests that fail uh so explicit errors obviously you know like a 500

Starting point is 01:22:12 api response something from like a rest uh rest api also implicit so anything that you mark that took over two seconds for example to finish it could be considered an error so these are things that you're kind of making decisions about and deciding to treat as an error, whether they are or not. I really liked that one because, you know, later they're talking about, you know, you should sit down and decide on like what your, you know, define success for whatever your thing is, right? For your service, for example.

Starting point is 01:22:42 I think you gave an example of like an API, a REST API. So define success for like how your REST API works. And if you say that, hey, all of our requests should be within like one, you gave two seconds here as an example. So if you say that like all of our requests should be served in under two seconds. And you find out that like, hey, 10% of the time, they are taking three to four seconds. Then that means you have a 10% error rate according to your own thing. Even though they were technically a success, they returned the data successfully

Starting point is 01:23:17 or the result or whatever successfully, it still didn't meet your criteria of what success means. So errors. Yeah, I what success means. So errors. Yeah, I like that one. So then the last one, again, that you should go as your four golden signals here is saturation. How full is your service? And my service is full of it, so I guess it's winning.

Starting point is 01:23:42 So, no, but this is measuring resources that are the most constrained. So CPU or IO are things that are usually start to degrade before a hundred percent utilization of your service, for example. So that's why having a utilization target is important. The latency increases are often indicators of saturation. So if you notice that your web requests are starting to take longer, you know, then that might be an indication of the saturation. So measuring the 99% response time over a small interval can be an early signal of saturation. And I believe, if I recall, they actually referred to it as like one minute, um, was what they, what they were going after here. So, um, they also say that saturations, saturation also concerns itself when predicting, uh, imminent issues like filling up hard drive

Starting point is 01:24:38 space and whatnot. So, I mean, we've all seen like examples of where like, um, you know, things can start to take longer as your system, like less and less disc space because, you know, maybe it can't use that for swap anymore. And have you ever like ran into it? You're like, why is it running like, like a dog all of a sudden? Like what's going on? And then you realize like, Oh, I just ran out of drive space and that's what it is. So, you know, that was just an example. So again, the four golden signals, latency, traffic, errors, saturation. And if you could measure nothing else, at least those four should be what you should go after. And they will give you a decent starting point for your monitoring of your service. Yeah, I think we're going to hit pause on this chapter here. for your monitoring of your service. Yeah.

Starting point is 01:25:26 I think we're going to hit pause on this chapter here. We haven't finished it, but it's a good stopping point. Yeah, and if you would like, you know, in this survey, you could tell us if you'd like us to finish this chapter. Right. Instead of the book. That's awesome. Well, that was a coincidence.

Starting point is 01:25:47 Yeah, so we'll have some links to resources we'd like. Obviously, the link to this book that Google has made freely available to the world. And we might even, you know, it's six episodes in, we might include a link

Starting point is 01:26:03 to this specific chapter. I don't know, though. It's a little difficult to do so maybe um and with that yeah hey uh i just want to mention too those uh three articles i mentioned the top of the show uh um the blog post and the two uh postmortems going to be in the resources killer yep and with that we head into uh mean old alan's favorite portion of the show. It's the tip of the week. I think I got it all out of my system now. We should be, we should be good. I'm, I'm, I'm reserving judgment until

Starting point is 01:26:35 yeah, man, I think it'd be too much. Um, so I didn't even plan on doing this particular tip, but the last thing that we said here at the show was this whole measuring 99% response time over a small interval can be an early sign of saturation. For anybody that's using Prometheus specifically, I thought this was worth throwing out there because I learned this the other day. If you are trying to do rate calculations and Prometheus, um, the way that that works is it measures changes between two points that are scraped in time. And I say two points, it can be more than two points. So if you're doing a five minute range and you have a 15% or a 15 second scrape interval, that's four points per minute times five, that's 20 points, right? And a rate calculates the difference between point one and two and then two and three and et cetera, right?

Starting point is 01:27:30 The important thing here is if you are trying to do rate calculations with Prometheus, they say you need to do at least two times the interval. So if you want to do a rate calculation and you have 15 second intervals, then you should do at least 30 seconds. But they go on to say that that's probably not even good enough. You want at least three times it because you need to make sure that you at least have two points in the thing, because sometimes Prometheus will drop data and all that kind of stuff. So I wanted to throw it out there. If you are somebody that is doing monitoring and you have these short interval things and you're using Prometheus as your stats getter, you'll at least

Starting point is 01:28:13 want to take a look at what the interval scrape is or the scrape interval is, and probably do that times three. If you're doing these short intervals, um, it looks like somebody is being nice enough. Type that in for me. That's awesome. All right. So, um, the next thing, uh, and what brought this up is Merleys log that, that Joe has shared the link to here. If you are somebody and you're listening to the show and you probably know a lot of people that are writing tech blogs and that kind of stuff. Me personally, I don't want to spend my time writing the engine that writes my blog, right? Like I don't, there's a reason why WordPress runs more than 30% of the internet, right? And it's because people want to focus on getting content out there and not having to write code to just get a message out.

Starting point is 01:29:03 So with that said, WordPress, if you want to set it up to be fast and all that kind of stuff can be really involved. If you want to use something like engine X along with the Redis cache and all kinds of other stuff, like you're talking about a decent amount of knowledge to go into it. If you don't want to have to take that hit to get something fast, it runs efficiently and all that, but you still want that setup. There's a thing called Webinoli. You can go to webinoli.com and essentially what it is, is a group of people have gone and set up scripts to set up all that stuff for you in the proper way. So it will set up your Nginx server, it will set up your Redis cache,

Starting point is 01:29:45 it will set up your WordPress installation, and it'll do all that stuff in a way that mostly does it as securely as possible from the get go. And with everything being done for you. And you can manage it with those same scripts, like you can upgrade, you can backup, you could do other things. So, um, if you're interested, go check that out, web and only.com. And then the last thing I want to bring up only because, um, I submerged myself in this, um, rather nasty thing of encryption, asymmetric and symmetric and AES and RSA and all the different algorithms and all that stuff around it. When I did my research, there was not a ton.

Starting point is 01:30:34 This is one of those things that reminds me of the Java language in general, in that you go search for how to do something and you'll get 9 million different answers for the same thing that you're asking for the language or the ecosystem. Yes. That's shame on me that's honestly that's my fault i'm sorry oh it's true though right like that that is one of the things about java that always drove me a little bit crazy is you know what should i use and there's like nine million different answers and everybody and it's a it's a holy war on all wrong. Yeah,

Starting point is 01:31:06 exactly. And it's so frustrating. Well, I found something similar with cryptography. The first thing you'll read when you get into cryptography is don't roll your own, right? Like that is the number one thing that you'll find.

Starting point is 01:31:19 And that is true. Don't do it. However, when you start trying to go figure out how to actually make it work, there's no great path on how to do things in a way that makes sense. And I'll give you an example. One really or well-adopted pattern of doing cryptography, especially with cloud systems, is what's called envelope encryption. And simply put, you'll have one key that's sort of your root level key that'll encrypt a second level key.

Starting point is 01:31:49 And then that second level key might encrypt some data, right? You might even have more layers. But the gist is this, if you're going to take a root key and encrypt another key, you need to know what key was used to encrypt that second key. And so you kind of got to keep track of this whole path of things. And I never did come across any really good documentation on how to do that stuff, right? And so more or less, I kind of tied my stuff together. And for better or worse, it works, but I don't know if it's the best way to do it. At any rate, somebody asked me the other day, like, Hey, why don't you just use one of the PGP, um, SDKs out there?

Starting point is 01:32:28 And I was like, um, I'm kind of annoyed that somebody has given it to me like that because I asked so many people the best way to do this. I looked it up. I Googled 9 million different things and never came up with anything. But apparently there are ways to do some of this. And so I'm sharing a link to Bounty Castle. They are one of the big ones out there that have written some cryptographic libraries in both C Sharp and Java. And they've got a PGP implementation that they've got. And they've also got so many other things and they are well regarded as one of the you know authorities

Starting point is 01:33:05 in the space so if you are ever looking to get into some cryptographic type things you need to worry about keys and encryption and aes and rsa and bit entropy and all that garbage go check them out read your eyes will bleed you'll fall asleep times, but maybe you'll grok something. It's a fair question when somebody hits you up like that though. But it's also like, I mean, I feel, I totally feel your pain there where you're like,

Starting point is 01:33:33 you write something and you're like, yeah, I wrote this great, amazing bit of code. Look, you know, I couldn't find anything to do what we need to do at the time. So I, you know,

Starting point is 01:33:40 I wrote it this way and this is, it solves all our needs. It's been very useful and very helpful. And then somebody's like, but there's this Apache project, or whatever it might be, this well-known thing. Oh, I don't know why my Google foo failed me, but when I was searching, that never came up. Yeah.

Starting point is 01:34:01 Well, the thing is you don't know that you need to search for pgp like i mean that was obvious right like why wouldn't i have just searched for hey pgp sdk instead of cryptographic encryption sdk or something right like this man maybe maybe the default go to in those situations now like i'm i'm rethinking like how how you know i could learn from past mistakes similar to this where like maybe now you should just like post the question out to stack overflow and let like a billion people throw their opinions to it maybe one of those will work try slack doing slack yeah or slack you know i think that is the right answer is like ask someone else and you really got to pay attention to their answer because if the answer is not like you know like oh you know that like it's obvious answer

Starting point is 01:34:49 here it is then maybe you're asking the wrong question maybe there's a different way to phrase the question that's more in line with the industry in fairness he did ask but what i'm saying is that like maybe instead of going after like targeted people you know or like even that are in like your own domain like why is that? Yeah. If you went out to a Stack Overflow, then it's like you're going to get expertise and responses from all the corners of the internet, right? I think you just gave me the answer, Elwha, is I should reap some of what I've sowed on this episode. And I should just put it on Reddit and be attacked. Oh, everybody on the internet. I was not wishing that on you, but you know,

Starting point is 01:35:29 maybe after some of the things you said tonight, maybe I should wish that. I could have, that's what I was saying to Slack. It's like, there's like, you know, 7,000 people signed up.

Starting point is 01:35:35 You ask on Stack Overflow, it's going to get closed as a duplicate and you're going to go look at that duplicate and it's going to have nothing to do with what you want to know. So much so. Yeah. But on Reddit, on Reddit, you'll probably get the answer,

Starting point is 01:35:49 but you'll cry yourself to sleep that night, right? So, you know, maybe it's useful. I don't know. Yeah, the top comment's going to be so snarky, and the second one's going to be lyrics to a song, and the third one's going to be lyrics to the next Latin song, and the fourth one's going to be the next Latin song. And someone's going gonna call you boomer uh yeah so my turn yeah i think so so i have got the tip of the week this is this is the winner

Starting point is 01:36:20 sorry alan you did great this is the winner uh big thanks to lars uh the bicycle from metallica bicycle no you're gonna like this guy so uh lars is his name already you're a fan huge fan bicycle repairman is his name in slack awesome yeah i know that's okay we'll have to we'll have to break pads yeah we'll have to compare uh bicycle repairman notes yeah so here is uh here is uh what he shared with me um so i'm just gonna tell you how to do it and i'll tell you what it does go to file preferences keyboard shortcuts and vs code and search for workbench.action.terminal.run selected text and enter a keyboard shortcut for it. I'm trying Shift-Enter now. I'm still evaluating that.

Starting point is 01:37:09 But what this lets you do is run the selected line in your terminal in VS Code. So you know how, like, if you open up, like, a file unit test, sometimes it'll give you a little play button and you can hit play. If you have an HTTP file, you can hit a little play and send a request well imagine you have a shell script right have you ever in your life had a shell script and you copied the first line down to the terminal and ran it and then when that worked you went and copied the second one and ran it and sometimes you go back to the first one you change it whatever why are you doing all that copy and pasting vs code has support for this natively.

Starting point is 01:37:45 It just doesn't have a shortcut for it. It's the craziest thing. This is amazing. Just make a shortcut for it. Just give me a play button. That's pretty amazing. Oh, yeah, I'm going to play buttons. Fine, too.

Starting point is 01:37:56 But this works. So this is like I tried it in a PowerShell file. I tried it in a Bash file. It doesn't care. It just whatever line you have there, it's going to put in your terminal and run it. So for me, I'm hitting shift enter so I'm working on a shell script

Starting point is 01:38:07 down down shift enter just ran it so shift enter the default key binding once you enable this thing no there is no default key binding I had that's the one I tried okay I think yours beats mine pretty good in practice like I've

Starting point is 01:38:23 set it up today and i've used it like 100 times since it's yeah man that's awesome huh okay well uh yeah i guess i'm gonna have to like do it all that way then instead of dollars copy and pasting all right well how about before i give you i'm gonna close it out with one last one, one last joke and say like, why did the human put his money in the freezer? Got to be something about a person because the human wanted cold,

Starting point is 01:38:59 hard cash. Nice. Okay. Pretty good. So, uh, here's my tip of the week. So have you ever found yourself in

Starting point is 01:39:05 a situation where like, you know, for whatever your reason, whatever you're working on, you just want to know like, Hey, I've made like a bunch of changes and I just want to know what are the, give me a listing of all the files that I've changed since I started this branch. Right. So if you, and you could figure out like, Hey, where did I start from? So Alan gave a previous tip in episode 182 about how you could, uh, use the dash T or dash dash track on your get checkout command to like, uh, to have it track the branch. And you could see with a git branch list what your branch started from, originated from. But once you know what that commit shaw is, that commit ID is, then you can do a git diff dash dash name dash only that commit ID space.

Starting point is 01:40:09 And then the word head, assuming that you want like wherever you currently are in your, in your current branch. And that will give you just a listing of the names of the files that you've changed since that previous commit. That's pretty awesome. I could have used that several times. Yeah.

Starting point is 01:40:25 Yeah. So, yeah. With that, subscribe to us if you haven't already. Maybe somebody just said, hey, here's a random link. You should listen to all the mean things Alan is saying. Michael, it's awesome. And, yeah. So subscribe to us on iTunes, Spotify, Stitcher,

Starting point is 01:40:43 wherever you like to find your podcasts. And if you haven't already left us a review, yeah, I mean, you can find some links at www.codingblocks.net slash review. I got to make it weird. I don't know because that's the way you said it. That's the way you did it. I just find you. I got to make it weird. Hey, while you're up there at codingblocks.net, check out our show notes, examples, discussions, and more.

Starting point is 01:41:05 And send your feedback, questions, and rants to our Slack channel at codingblocks.net slash Slack. Yeah, and follow us on Twitter at codingblocks or you can go to codingblocks.net. Find all our social links at the top of the page. You sure you don't want to mention, like, the email? Like, I know, Jay-Z, you like the emails. I love email. I love email. Right?

Starting point is 01:41:24 I checked it today. Oh, well, there you go. That's pretty cool. Pretty cool.

Coding Blocks - Site Reliability Engineering – Monitoring Distributed Systems

We haven't finished the Site Reliability Engineering book yet as we learn how to monitor our system while the deals at Costco as so good, Allen thinks they're fake, Joe hasn't attended a math class in... a while, and Michael never had AOL.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.