Coding Blocks - Software Reliability Engineering – Hope is not a strategy

Starting point is 00:00:00 You're listening to Coding Blocks, episode 181. Subscribe to us on iTunes, Spotify, Stitcher, and more using your favorite podcast apps. If you can, leave us a review. We would greatly appreciate it. We do love to hear those new reviews. Yep, visit us at codingblocks.net where you can find our show notes, examples, discussions, and more. And send your feedback, questions, and rants to comments at codingblocks.net. And if you're on the Bird site, you can follow us at CodingBlocks,

Starting point is 00:00:27 and we don't tweet often, but when we do, it's super good. And if you're looking for other places to interact with us, CodingBlocks.net has all our social links at the top of the page. With that, I'm Jerzak. Jerzak? Did you put an R in your name? I did. I'm changing. I'm trying things.

Starting point is 00:00:45 Okay. Well, I'm, I'm going to try to say I'm miracle outlaw. Would that be right? That sounds too much like America. What it said, I was trying to say Michael with an R like to follow suit,

Starting point is 00:00:58 but I guess it didn't work. Yeah. I don't think that'll work. That's good. Yeah. So I'm Alan Underwood. I don't know. That's so much funnier than it should be.

Starting point is 00:01:16 That's good, right? It's because it's late, man. We never start too late. Well, it kind of almost has like a German kind of take on it. Would you say it? Say it again? Alan Underwood. Or Underbar. That's what it makes me think of it? Say it again? Alan Underwood. Or Underbar.

Starting point is 00:01:29 I guess it's not German. Wait, wait. Or is it? I don't know. Whatever. Whoever we offended, we apologize. No offense meant. No offense meant. So tonight we're going to start in on a book that Outlaw was actually really interested in.

Starting point is 00:01:44 And it's site reliability engineering, how Google Tonight, we're going to start in on a book that Outlaw was actually really interested in, and it's site reliability engineering, how Google runs their production systems. So that's where we're going to kick off this evening. But first, first we want to get to our news section. All right. And so as we, like I said before, we like to say thanks to those who left us a review. We love to get those reviews. And so we got a new review from Audible from Amazon customer. That's excellent.

Starting point is 00:02:15 Yeah. I mean, everyone knows immediately who I'm talking about. You're like, oh, wait, really? You left a review? Like, yeah, you know exactly who that is. And it was actually a very nice review. So thank you for taking the time to leave it. Um,

Starting point is 00:02:32 but that's it for our news. So I guess we'll go ahead and jump into this thing. Yeah. And, uh, the first thing you're going to see down there in the notes is a to do for me. Uh,

Starting point is 00:02:43 unfortunately. So I'm scrambling a little bit here hang on so we're talking about a book it's called site reliability engineering you just said something I don't know that you actually said those words properly I thought that was what we were doing

Starting point is 00:02:57 you can't say his name right you expect him to say these words right the sir it's like the sweetest chef of talk show host this show is right off the rails we're just getting started yeah man oh boy okay well uh about that show so we're talking about is a book called Sight Reliability Engineering. It's almost 10 o'clock, yeah. Sight Reliability. Oh my gosh.

Starting point is 00:03:43 Oh my gosh. No, you'll get it. I swear you'll get it the third time. Oh never again can you see how purple i am in the camera yes it's gonna scare me uh we're talking about the sre book you've never heard of sre nobody says the words out loud because nobody can say them out loud obviously if i can't say it who can i mean we're not laughing at you joe all right so so where are you going with this so this book is interesting this is an o'reilly book right but this book is uh written by a bunch of google engineers about Google.

Starting point is 00:04:25 Hey, real quick though, before you get to that next bullet point, we're going to give away a free copy of this one. Absolutely. Okay. Yeah, we do that. Yeah. Okay. So we're doing that.

Starting point is 00:04:37 The book was published in 2016. And at the time, the term SRE, whatever it stands for, was pretty new. Maybe even literally new at the time of publishing the book. So no one had really heard of it very much. And it was written by a bunch of people who work at Google about what they do at Google. And even though the book is published by O'Reilly, you can actually go to sre.google and get the book for free. You can just download it. It's just a website hosted by

Starting point is 00:05:10 Google. They have a couple other things too, including workbooks. It's meant to kind of go along with it. Another book called Building Secure and Reliable Systems, also free. The easy link to just the books portion though is sre.google.com.

Starting point is 00:05:26 And then there's three books there that you can read online for free. Yeah, it's pretty cool, right? And so anyway, each kind of chapter of the book is essentially an essay that deals with kind of one aspect of what they're calling SREs, which we're going to be diving into. And so I just want to kind of get that out there. And one other thing before we kind of dive in, I just pulled some stats. I figured we'd throw them here at the beginning because we just talked about kind of salaries and stuff last episode. So I went and tried to look at like basically the career trajectory of SREs.

Starting point is 00:05:57 You know, we said it's kind of a new term, new field. And if you just do any sort of Googling on SREs, you'll see that people are pretty bullish on it, I guess you could say. I found some stats that kind of summed it up well from a site called global.com. I have no clue who that is, but they seem to agree with the other sites, and I liked how they kind of put it. Wait, global.com. .com, yes, global. Terrible name.

Starting point is 00:06:24 GlobalDOT.com. We'll. Terrible name. Global dot.com. We'll have a link in the show notes. And so they give you a median base salary for this position. $200,000. Hold up. That link does not take me to a site. Just so you know. It's down in the resources we like too so i'll grab it from there

Starting point is 00:06:46 and then we go down there yeah yeah and the subject of this article is why is sre becoming 2021's hottest hire it's global dots.com d-o-t-s.com still a terrible name all right yes all right moving on there we go keep going So median salary of $200,000. Remember, median, that's a good number. That means the person in the middle is making $200,000. It's not the average, so it's not like totally skewed by numbers. One of these companies, although I will say if a company has SREs, they're probably pretty big and mature. And we'll get into that.

Starting point is 00:07:22 And they're also pretty current. Pretty current. Yeah. I mean, that. And they're also pretty current. Pretty current. Yeah, I mean, this didn't even exist six years ago. It's pretty new. This podcast is older than this job title. That's true. We've never heard of it before. Career Advancement Score. They give it a score of

Starting point is 00:07:37 9 out of 10, which is pretty hot. Job Openings. Year-over-year growth. Up 72%. They got 1 got 1400 new job postings so yeah so pretty good so this is a hot field this is something that you might be interested in looking more into if you like this sort of thing that's that's pretty good info right there so let's go ahead and jump into where we're in this particular episode, we're going to cover the preface and the first chapter, just, just so that we can sort of set the baseline of what this is.

Starting point is 00:08:12 So the thing that makes this one unique and Joe already mentioned it is this is written by people at Google and it's only one company, right? So you don't have this mixture of pie in the sky type stuff. These are things that they actually did and their experience is doing it and so it's nice to get that from a company that deals with the kind of scale that they do right um and i can't wait for you to get like later in the book there's some super because this is just about google they get like into some interesting weeds as to like how Google runs behind the scenes. Now, with that said, you know, as we go through this book, there might be terminology or things like, hey, this is how Google did it, blah, blah, blah. And there might be current Googlers that would say like, oh, well,

Starting point is 00:09:05 that's not how we do it anymore or whatever. So we're coming at it from the perspective of this book, though, at least at the time of the writing. That's how it was said, like, hey, this is how we do things. So I'm sure that they would have advanced their own practices in the last five years, seven years, whatever. Totally. And just because Google does it doesn't mean it's appropriate for your company to do your, not Google,

Starting point is 00:09:31 but maybe it could be. Maybe there's some things in here that are good for you. And there's a lot of other companies that now have SRE teams. If nothing else, you get some ideas of things that can help improve your business, right? And one of the things that they said about this is they were interested in scaling the business process not just the machinery right so okay yeah that's that's huge because honestly it's the business processes that seem to get in the way a lot um the communication around those processes right and then just what joe said a second ago about, you know, hey, this not everybody's the size of Google. They actually called out, hey, this this tail should be for emulating and not copying.

Starting point is 00:10:12 Right. Like tweak it to whatever suits what you have in your business. This is not necessarily a blueprint that everybody should have to follow. You want the results results not the process did you say uh 40 to 90 percent of your effort is uh is what's the term i'm looking for um cost is the word i was looking for 40 to percent for 90 percent of your cost for delivering software happens after you've deployed a system which is funny because we usually talk about kind of the first launch and how much effort and how long things took to develop the first time. And we so rarely talk about how long it takes after it gets launched,

Starting point is 00:10:50 how long it takes to maintain what new features. And so we've got this kind of like industry focus on this kind of first period, even though the second period, the second half goes on much longer. Well, they literally refer to it as the labor involved. And they make the analogy that software development has one thing in common with childbirth, and that is that the labor and delivery

Starting point is 00:11:24 up front is painful and difficult, but it's the labor afterwards that is where you as a parent spend the majority of your time. Right. And that's what it's like with with software development, too, is that, you know, we have this like industry practice of, you know, that's where we put all of our focus on. Like, hey, let's develop this green filled app and we're going going to like put all of these best practices in from the front, blah, blah, blah, blah. And like, um, and then, and then magically like we're going to deploy it and we'll never look at it again. But the reality is, is that once you do put that thing out there in the world, you know, from that point forward is where you're going to spend the bulk of your time. Yeah. I mean, look at Google for example, right? Like they, they started out in the nineties, right? But they're still constantly, um, you know, iterating on and advancing their search engine that started them.

Starting point is 00:12:17 Right. Yeah. And, and one of the key takeaways here for me was just when you call a system stable, that doesn't mean that you're not dealing with it. Right. And that was one of their key, their key call outs is stable. Doesn't mean you don't touch it. You're still putting time and effort into the thing. It just means that it mostly runs the way it's supposed to at that point. Right. All right. So I guess the next thing is what is site reliability engineering? And they actually put together a definition that I really liked. And it's engineers who apply the principles of computer science and engineering to design and the development of computing systems, usually large distributed ones. So, so software engineers doing what they do, except you're applying it to the operation side of things. You write software for those systems and you're building all the additional pieces of systems need. So like backups, load balancers, all the other things that come into play that, that you need to run your system. And then also how to apply solutions to new problems.

Starting point is 00:13:28 So, you know, taking the software engineering approach to doing those things. And I mentioned that they consider reliability to be the most fundamental feature of any product. And that I think kind of flies the face of what you see on like Hacker News or Reddit or whatever, where you hear a lot about kind of innovation. So it's kind of refreshing in a way to hear a company talking about reliability being more of a priority. And that's something you're going to see echoed in this book throughout as they keep coming back to reliability.

Starting point is 00:13:59 Because software doesn't matter much if it can't be used and needs to be reliable enough. That doesn't matter much if it can't be used and needs to be reliable enough that doesn't mean perfect but once you've achieved reliable enough and we'll get into defining that then you can spend more time developing new features or new products yeah they they actually um later in the book they get super into like defining enough you know what constitutes enough. Because I think there's this perfectionist part in us where we want to develop something that's, quote, perfect and bug-free and not have to worry about it, right? That's always our goal. Talk about all the unit testing always our goal. Talk about

Starting point is 00:14:45 all the unit testing that we've ever talked about and trying to abstract things perfectly and put interfaces to things so that it is low-maintenance kind of things. But the reality is that if you ever... Let's suppose that

Starting point is 00:15:02 you did create a system that was perfect. Well, for one, it couldn't have been very complex, right? I mean, what's the chances that it was complex and you made it perfect? But also too, how useful would it really be? And like, how much cost did it take you in time and effort in order to make it 100% perfect? And they go into numbers where like, uh, later in the book, where like just adding, like you've heard terms like three nines of reliability or four nines of reliability. And if you haven't, um, if you're new to software field, then I mean like three

Starting point is 00:15:37 nines of not reliability would be like 99.9% or four nines would be 99.99 and they talk about the cost later in the book about like adding that additional nine uh you know a decimal point uh to it and like you know how do you decide whether or not it's worthwhile to go after that uh that additional nine you know things like that it gets a lot more expensive every ad for going from nine percent to ninety nine percent uh you get a lot of value there for not as much effort going from five nine six nines it's gonna cost it's gonna cost you a lot to get there and is it really gonna be worth it especially when your customers are uh you know communicating to you over the internet and they've got their own outages you know they're not even going to receive the advantages of that and so

Starting point is 00:16:24 you got to figure out where to draw that line and that's one of the things that sres do yeah it's actually a repeating thought to you though that was in designing data intensive applications too because they talked about it over there about how just increasing that nine was like there's there's a point of not even diminishing returns of just negative returns right because you're going to spend a ton of development money on trying to get that, and it may not even matter. So there's going to be some crossing of the streams between designing data-intensive applications maybe,

Starting point is 00:16:55 but I was thinking more about the DevOps handbook in this book. But yeah, they actually do later in the book, like, hey, here's how you can quantify some of this stuff to see if it's even worth going after that additional nine. But, you know, to Joe's point, like, yeah, they, they actually call out in the book that like, you know, your users may not even recognize the need or that, that you added that extra nine, for example, or, or let's, let's just say that you were able to get to 100% reliability, right? But it took you an exorbitant amount of time and effort and money to get to

Starting point is 00:17:33 there, right? They said the reality is that the way your customers are even getting to use your service, they likely wouldn't even notice the difference between 99.9 and 100% reliability because the phone that they're using might be slow, the connectivity of their phone service to their cellular service at that time, or whatever, because they're traveling and swapping cells as they're trying to browse it, they might not even notice it. So you put all that effort in for this, you know, fictitional character that could use the system that already has the perfect, you know, access to it that just doesn't exist.

Starting point is 00:18:18 Yep. All right. So one of the other things that they tacked on, like some of the other focuses of these SREs are are like managing storage or an email service or a search service. Right. Like they they also try and keep these things alive. And this was specifically Google, obviously. Right. Like they were talking about their Gmail and their and, you know, their storage platform and all that. Um, so reliability is regarded as the primary focus of the SRE. And it, this was really cool to hear them call this out.

Starting point is 00:18:55 They said that they wrote the book largely to help the community as a whole by exposing what Google did to solve their problems, right? Their post deploy problems. And they also did this to help them define what they believe the role should be. Right. So it was like this, this multifaceted thing that they were doing here that was helping everybody else and themselves internally to, to fix some of their processes and things that they had going on internally. That was one of the things that I thought was super cool about it was that they were trying to help the industry by solving, there's this need for this role, but we don't know what to call it.

Starting point is 00:19:33 But here's what we think are the responsibilities, and we're going to throw it out there to the rest of the world. And the rest of the world can buy into it or add on to it or tailor it to their needs. you know, the rest of the world can, you know, buy into it or add onto it or, you know, tailor it to their needs. But, you know, just Google's way of trying to help. And the guy who actually coined the term at Google, who is one of the Google VPs of like 24-7 operations, he's one of the contributors to the book. His name was, uh, Ben, uh, you know how I am with names. So get ready. Hear me out. Ben train or sloss. Is that,

Starting point is 00:20:15 I don't remember. That was it. That was it. Hey, I won one. I got one. I feel like I'm in Ghostbusters. We got one. All right. So, but, yeah, so, so, you know,

Starting point is 00:20:32 I also found that kind of interesting about this book, though, too, is that like, you know, a VP took time out of his busy day to help write this book. No doubt. So kind of open source is almost like they kind of released something out in the wild. They got some feedback and they incorporated it and other people using it.

Starting point is 00:20:46 Yeah, that's awesome. And they called out to, like we mentioned that if you're a small business, you're not going to necessarily be able to do everything here, but you should be able to take away some of these concepts and, and, and they may be able to help you in your business. Um, this next part, this next part kills me because I feel like everything I read this is the case the earlier you care about reliability the better right and what they mean by that is it's way less costly to implement some things up front right like even if it's just a lightweight support capability then to do it after you're way further along,

Starting point is 00:21:25 you have nothing in place and it's going to be really expensive to try and get those things in. But the reason why this is crazy to me is don't we hear this about security and just about everything else in software development? Oh, you got to do it up front. Otherwise I was just going to say the same thing. I was going to say like,

Starting point is 00:21:41 it's literally like pick a, pick a topic in computer science or in software development and it's like yeah it's it's much easier if you care about that up front like unit testing it's so much easier to test your application up front if you've like coded for it otherwise like you'll have leaked in dependencies and whatnot and you know be harder to abstract those and mock those out so you gotta do it up front or like oh man if you don't start with devops up front if you don't have like an automated pipeline then it's so much harder to come after the you know come back behind the scenes and add it so yeah that was the one that was the one rub here that i have was like yeah okay i get it it's easier to care about the

Starting point is 00:22:21 reliability up front yeah okay but also that's the same truth for everything yeah it's easier to care about the reliability up front. Yeah. Okay. But also that's the same truth for everything. Yeah. It's easier to carry, to care about pointer math up front as you're iterating through your array than it is to later find out, Oh, I went too far. I mean,

Starting point is 00:22:37 also MVP, right? It's, you should get your thing out to the customers and start selling before you even build it. But yeah, they're all competing i mean they're not wrong right i mean what they're saying is absolutely correct but i

Starting point is 00:22:50 mean how much stuff can you build up front for your mvp it's frustrating yeah so so uh who was the first sre uh cool story here so um they uh they kind of picked on Margaret Hamilton here, who was an Apollo program director from MIT. And there's a story that they told in the book about basically how this woman's kid came into the office one day and ended up pushing some buttons while they were running a simulation. And the Apollo rocket ended up crashing during simulation. And it wasn't supposed to happen so they looked into it and found out that the the kid had ended up triggering like sort of sequence that wasn't supposed to happen at that point it lost the navigation data and uh so you know the margaret hamilton ended up like trying to write a defect and documented it and uh they kind of pushed back on her and said well this isn't going to happen this isn't that wasn't supposed to happen the

Starting point is 00:23:44 astronaut would never do this. Right. This is not the thing. This is something that would never happen in production, essentially. And so they ended up not fixing the bug. They didn't prevent it. And sure enough, guess what happened? But, you know, the cool part, though, is she knew like exactly what you said.

Starting point is 00:24:02 She was like, OK, so maybe it's not supposed to happen but i'm gonna write up some steps to recover from it anyways and i'm gonna put it in some documentation so that you know hey even though it's not supposed to happen if it ever does at least somebody will have something to go back to and that was super important because guess yeah like joe zack said somebody screwed up yeah and. And did it. So, so what's the takeaway? The takeaway is it's easier to write your read me up front than it is to do it after the fact.

Starting point is 00:24:35 That's right. That's right. You have a good read me up front. Then you're okay. How many times have you heard that? Like, this is something that would never happen in production. So we're just going to manually do this, whatever.

Starting point is 00:24:44 And we're not going to try to code around it. Nobody's ever going to try to drop the table, so we don't need to worry about those permissions. It's fine. So that's why you need a big F statement that says, if you're trying to drop the table, don't do it, right? Too specific. It's hard.

Starting point is 00:24:56 Software is too hard. It really is. Let's all just agree to stop. Yeah, we should. I mean, computers are smart enough to write it all, right? No. Yeah yeah pretty much they've never they've never gotten the design requirements from product management yes all right so we're not better not at all no no i mean it's just the way it is um so the sre way right thorough dedication, belief in the value and preparation and documentation.

Starting point is 00:25:29 So what we just talked about. I told you that readme would get you. The readme is important. And awareness of what could go wrong and the strong desire to prevent it. So that's the SRE way. And that is the end of the preface, the opening to the book. Now, I will say chapter one, I loved the opening quote. You want to tell us?

Starting point is 00:25:58 Yeah. Hope is not a strategy. And they later go on to explain, because you read that at first, and you're like, wow, that sounds really dark. Like, okay, I guess. But I mean, rebellions were built on hope, but okay, fine.

Starting point is 00:26:16 Hope is not a strategy. But they later go on to say that the point that they explain is that from an SRE perspective, you don't rely on hope as your answer to anything. Like, well, that will never happen in production. I hope that will never happen in production, so we're not going to deal with it. Instead, just ensure that it can't happen in production.

Starting point is 00:26:38 Yeah. Another term you've probably heard in your expression is basically saying, we don't even know if this is a problem in production yet. So let's not put any effort into solving it yet and those are both there's times and places like absolutely those are totally fair comments and you can make decisions you know it's totally fair but it's not those are kind of at odds with the sre's kind of mindset about things so i think an sre in a situation where somebody said that's never going to happen prod or um you know we don't know if this is a problem yet as they would say okay well let's document it somewhere let's get some thoughts together so if it does happen we're prepared just like the margaret uh hamilton example but so you know what's really

Starting point is 00:27:16 cool about what you just said there is when you call that out specifically from two different perspectives right like the developer of the product, the person who's doing the product features might say that like, hey, this is never going to happen there. That's it all the time. But they have different, we've got different concerns than an SRE does, right?

Starting point is 00:27:36 Like somebody who's tasked with keeping the systems running is going to look at that and be like, oh, well, you say that's not going to happen, but I can clearly see a case where this could be a problem and it might cause X, Y, and Z you say that's not going to happen, but I can clearly see a case where this could be a problem and it might cause X, Y, and Z. And so I'm going to put some attention on that. Right. So you have two different, two different perspectives, you know, the, the product developers, they're trying to get something out the door, um, for the customers

Starting point is 00:28:00 and, and the SREs are trying to keep it running for their customers right so you have you have two different perspectives and and so it's really good to have those two different visions on it i think well and the sre is not a blocker the sre uh we'll get into setting budgets and for disruption and stuff with all sorts of stuff but um the idea isn't to block changes or to not take risks it's okay okay for systems to go down. It's about how fast to recover and managing that downtime and managing your reliability. I do want to be careful, though, because the one thing that you said, Alan, with the SRE, it almost made it sound like they were just purely in charge of operations, just purely in charge of making it run. And that's not,

Starting point is 00:28:45 that's not their role. And so like in this specific chapter, like heavily, like I was thinking of this book compared to the DevOps handbook and, you know, the, the, um,

Starting point is 00:29:01 the clash that the dev, the DevOps handbook and specifically the Phoenix project book, you know, the companion book that went along with it, you know, illustrated a good story, you know, clash of like developers versus the operations team.

Starting point is 00:29:17 Right. Right. And this book, you know, as you're reading this first chapter and they're like laying the groundwork for the introduction into SREs, right? You know, there were some heavy comparisons running through my mind there. Yeah. They are not purely the ops team, right? So I think that's what you're drilling out there. Well, yeah. Cause, cause, okay. So let's, so let's step into this. So

Starting point is 00:29:42 the old way is that you would have this a sysadmin So the old way is that you would have this sysadmin. You'd have the sysadmin approach to systems management, right? So you'd have a system administrator to run services and respond to events and update to those systems as they would happen, right? So that person was, you know, like, let's go back into the 90s, for example, you know, or early 2000s. And, you know, that would be the person that's like, hey, they manage this rack of machines. And if you want anything installed on it, they do it. If there's ever an update that comes out from a vendor from those, um, for any of the software running on those machines,

Starting point is 00:30:26 they take care of it. And like how they ever determine like which patches they were going to apply, they're willing to apply and which ones they weren't like was always like magic to me. Cause it was like, how do you really, do you know, you're just going to like,

Starting point is 00:30:42 you're just guessing, right? Like you're just going to do it, but you're guessing, right? But at any rate, uh, yeah, that was the old school way. Right. And, and you know, those, you would have teams of these people is it depending on the number of machines and different software packages that needed to be maintained and the skill sets that were required for those, then, you know, those teams would grow as that capacity was needed. So you might have one group of sysadmins that's just in charge of one particular database technology.

Starting point is 00:31:14 Another group of sysadmins is just in charge of one particular operating system. These are my Windows sysadmins. These are my Linux sysadmins. These are my Linux sysadmins. Right. And, and, um, you know, you could have like multiple sysadmins that are responsible for the same physical machine, right? Like one guy who's, who's the OS guy and another guy that's, you know, whatever software package happens to be installed on that thing. Right. Um, and, and, uh, yeah, the, the, the, what did we have here? The, usually the skills for a product developer and assist admin are, are different. Right. So I think I've even said this before on the show that like, I always had this kind of like mindset that a, a good software developer was a decent, uh, admin and a good admin was a decent

Starting point is 00:32:08 software developer. Right. But, but that's two ends of the spectrum. And, and I just kind of like have always had this vision that they kind of crossed somewhere, you know,

Starting point is 00:32:17 but therefore that's why they would end up on different teams though, because there's different strengths and weaknesses among those two skill sets. And there's conflicting interests, too. Like the sysadmin wants to keep things stable. The developers want to introduce new features. And so there's always been this kind of pushback with sysadmins or operations teams. And that's one of the downsides. And you think about, well it doesn't sound great

Starting point is 00:32:45 right you know to keep something stable sounds sounds ideal in one mindset but another point that they make in this book though is that if something is truly stable and you're you know because the the ops team or the sysadmins aren't letting you put new things on it, then that also means that it's just growing stale and it's going to be boring and it'll eventually not be used. Well, that's what they said, right? The two are constantly at odds with each other because what makes the system unstable is changes to the system, right? And the whole role of a product developer is to make changes to the system, right? And the whole role of a product developer is to make changes to the system. So obviously your sysadmins don't want to change anything and your developers are wanting to shove something in there every other minute, right? So they are at conflict. And what sucks

Starting point is 00:33:37 about it is they actually talk about some of the disadvantages to splitting up these teams. And one of them is the direct cost. They said, it's actually really easy to see these costs because, um, you know, as outlaw said, when the systems grow, so does your need for more sysadmins. And so you actually are growing your team and you can see these costs as they happen. Right. Um, and it doesn't scale well because you have to have more manpower. So if you're going to add a hundred more machines, um, with, you know, however many operating systems are going to be on those in today's virtualization environment, it's not necessarily

Starting point is 00:34:14 a hundred. Now you've got the need for well more manpower. So it doesn't scale well. Um, but then they talked about the indirect costs, which was interesting. They said, this is subtle and it's not quite as easy to see, right? But it usually costs more than the direct cost. So like the manpower and all that, you don't realize how much money you end up spending here. So it's like communication, people building the wrong things, processes.

Starting point is 00:34:41 Yeah, the communication is key because they said that developers and sysadmins often use different language, right? So they're not even communicating on the same level, right? Like what the developers say may not commute, may not translate exactly to what the sysadmin needs to hear. And so you're going to have miscommunication there. And that's big. That can cause a lot more problems. And because communication is hard, that's why we need a site where jaya betty is the engineer and everyone's done oh and here's another thing too this goes back to the whole conflict and the two being sort of

Starting point is 00:35:20 at odds with each other is you have different assessments about risk and the possibilities for technical solutions, right? Like the developers, like I just put it in, it's not gonna be a big deal. And the, and the sysadmin is like, no way, dude, I'm going to be here all weekend dealing with this thing if, if you mess with that. Right. So that's a thing. I mean, this is exactly everything that the DevOps handbook was talking about, right? It was like how these two groups are, you know,

Starting point is 00:35:49 they're at odds with one another. They have different metrics for success and those metrics are, you know, opposing, you know, are opposed to one another. So how, how do you win in these environments?

Starting point is 00:36:04 So it was interesting that like after because the devops movement had already started by this point by the time that this term came came to be a thing or at least uh from the the book uh became to be a thing who um did they even say like when sre the term and job title was created. They just said the book. They Googled it. It's funny. But yeah, in the book, the 2016. Well, you're right.

Starting point is 00:36:29 No, you're right. Okay, so the book was published in 2016, which is I think the first time most people had heard it. I couldn't find a reference sooner. DevOps first coin was 2007. It's almost 10 years prior. So, okay, so then we suspect that maybe DevOps had existed before Google had this term. We don't know. Google might have had this as like an internal thing, you know, for some time before, but we suspect that it was DevOps first. But the point is, is that it's still interesting that even after the advent of DevOps, they still felt the need to like take it to this extra step.

Starting point is 00:37:05 Yep. And so what's interesting, after all that, like the being at odds with each other, they did call out some of the things that happened from this, and this is where some of the other costs come in and the inefficiencies in the company is, hey, because you don't want things broken, operations introduces launch and change gates, right?

Starting point is 00:37:27 So now it's going to be harder for you to get your stuff into production. They want to check for every problem that's ever happened before they approve something to go into production, right? So you have a new button that you're going to put on the page, and all of a sudden they're like, well, is this going to break the 500 other problems that we've seen before? Okay, that's going to be rough. And then that causes the dev teams to introduce fewer changes

Starting point is 00:37:50 because they're like, wait a second, every time I go to push something in production, I got to go through this change gate process? No, I ain't doing that. And so they end up putting in more feature flags. And this is interesting. I hadn't actually heard of this, but I could totally see it happening is dev teams will start sharding their features into separate

Starting point is 00:38:10 branches. So they don't even have to talk about them when they go to release this change, right? Because it wasn't part of the code base. So they don't want to have to go through this review process. So all of this is a lot of added cost, both time, money, everything, just to try and get things released. Okay, so one, I love that they actually referred to it as trench warfare. Right. You know, it was the way the two teams would operate against each other. But I don't know about you two.

Starting point is 00:38:42 As I was reading this, I thought back to a shared experience that the three of us had in a previous life, right? Where we used to work in a situation where we would do deployments as needed. We might do three a day, you know, if that was the thing, or, you know, we might not do one at all, but you know, depending on what the situation was,

Starting point is 00:39:10 we would do what was needed. And then we got a new director that came in and one of the first change gates that he put in place was, he was like, we're going to only do deployments twice a week. And on these days, and that's it. And you're going to only do deployments twice a week and on these days. And that's it. And you're going to have 12 miles of documentation and,

Starting point is 00:39:33 and everything else behind it before you ever like it, the process of releasing took a day. So if you were, if you were doing two releases a week, you spent two of those days literally just documenting things to be able to get things out the door. And it made it to where people just didn't even want to release anymore. And it really made me think back to that experience with his bosses, he had some kind of incentive to ensure that from a reliability point of view that the site didn't go down and whatnot

Starting point is 00:40:14 and they kept making money. And every time that we would do a deployment, then that technically meant risk of us introducing something that might bring the site down. And so from he, he might have had unbeknownst to us a very, you know, in, you know,

Starting point is 00:40:35 a cost incentive that would affect his own wallet, you know, to keep us from wanting to do that. Maybe. Yeah. I don't know if you guys thought if that came to your mind when you guys were reading that part of the book. No. Try not to think about the past.

Starting point is 00:40:51 Oh, yeah. That's probably a healthy outlook on life. Is that your tip of the week? That should be, you should have used that one. That was good, yeah. All right, well, oh, boy. I i guess i guess i will i will do this

Starting point is 00:41:11 review thing because like the last time you guys made it weird so listen we've never i think like didn't you last time go into like a super like subtle voice like it was like super NPR kind of voice or something or was that Jay-Z that did it? No, that's why we got a review though. Dang it. When you're right, you're right. That's right.

Starting point is 00:41:35 Alright. Blame Amazon customer because here it comes. Hi, listener. If you haven't already left us a review, we would greatly appreciate it. You can find some helpful links at www.codingbox.net

Starting point is 00:41:55 slash review. And if you are a Spotify listener, you can also leave a rating within the Spotify app. We greatly appreciate all of those reviews and everything that you can do to help spread the word about Coding Blocks. This one going out across the line to Delilah in Kansas. Thank you.

Starting point is 00:42:23 Shouldn't it have been Delilah in New York? Yeah, it might have been. Isn't that the song? Maybe. Hey there, Delilah. What's it like in New York City? Okay. All right.

Starting point is 00:42:38 A few episodes back, we asked... It was definitely a while back. We're behind on some of these surveys because this is from new year's or no maybe this was from like uh no this is probably from like direct right after new year's do you stick with your new year's resolution is the question your choices were for the first couple of weeks or I'm pretty good until spring ish, which I guess for like,

Starting point is 00:43:11 this is about the time that everybody would stop being good about it. Right. Or I'm like a machine. Resolutions are rules that are not meant to be broken or wait. Those things are to be taken seriously. They're broken by new new year's day or what are resolutions? All right. So this is what?

Starting point is 00:43:33 181 Alan, you're up first. Yeah. This one's clearly wait. Those things are to be taken seriously. I'm going to go with 50% here. Like I'm fairly certain. Most people are like,

Starting point is 00:43:44 I ain't messing with these 50% here. I'm fairly certain most people are like, I ain't messing with these. 50%, okay. Okay, and just for fun, I'm going to go with, I'm pretty good until spring-ish with 22%. Almost got me there. Almost got me. Okay, math my chicken. Strikes again. All right.

Starting point is 00:44:11 Well, you're both wrong. What are resolutions is the far and away winner with 48% of the vote? Oh, I was close on the percentage. All right, good. And I had the right notion. Yeah. Yeah. Yeah.

Starting point is 00:44:28 Those were two. Uh, wait, those things are gonna be taken serious. All right. Yeah. Like, um,

Starting point is 00:44:35 I guess nobody really follows through on, on their resolution. I've done terrible last couple of years. So yeah, that's why I don't even try. All right. So here's the survey for this episode. You ready?

Starting point is 00:44:48 So since we're talking about SRE and we've already given some throwbacks to the DevOps handbook, the question is, so DevOps is a culture, but SRE is a job title? And your choices are, wait, what? Or, yeah, I get it.

Starting point is 00:45:11 Or, meh. Nice. Hey, a reminder, if you drop us a comment, we'll send you the book. I should mention to you like the – Whoa, whoa, whoa, whoa, whoa. Leave a comment and you have a chance to get the book what did i say what is going on here we're about to go broke oh yeah there's a chance to chance to win yeah well uh i was gonna say the digital copy is free so i'm assuming if you leave comment that you're going to want the physical.

Starting point is 00:45:47 So yeah. Yeah. Well, that's why I was joking at the start when I asked like, Hey, are we going to give a copy away for free? Because the very next line of the notes is, Hey,

Starting point is 00:45:56 this book is available for free. Nice. All right. That's funny. Yeah. That's true. I don't know. Yeah.

Starting point is 00:46:01 We can give away a physical version too. So, you know, it's fine. Yeah. If you're dying to have a copy let us know there is an audiobook version maybe maybe it would be nice to give that away yeah totally it's awful i've been listening to it yeah so if if yeah if you leave a comment on this episode you can let us know if you want a physical copy or if you just want to leave a comment or if you'd like an audiobook you know there you go. Before we start this next chance section though,

Starting point is 00:46:28 I just have like one small rant. I need to get off my chest. If you, if you would let me, I think circles are pointless. All right, we can go on. I got that out.

Starting point is 00:46:42 Oh boy. Okay. Let like it. Let's talk about Google's approach to this problem. So, site reliability engineer. The idea is to focus on hiring software engineers to run their products, not sysadmins. They create systems to accomplish the work that would have historically been done by sysadmins and i remember i used to work with a guy that was like a sysadmin slash dba this person had an amazing superpower they could sleep sitting up just in their chair and then you'd be like hey george can you put this in production george george i email can you

Starting point is 00:47:24 run this query and then he would eventually do it and they're trying to get away from that by building systems rather than having kind of people

Starting point is 00:47:31 sitting in those spots sitting idle often or either vacillating between sitting idle and freaking out all the time because something's wrong

Starting point is 00:47:39 so a nice quote they had here is that SRE is what happens when you ask a software engineer to design an operations team. Isn't that awesome? Yeah.

Starting point is 00:47:51 That is pretty cool. I like that. And they've got a nice breakdown coming up right now. They say the SRE role, the responsibilities can be broken down into two main categories. 50-60% are Google software engineers or people that were hired during the standard hiring procedure and 40 50 percent were candidates who were very close to the google software engineer qualifications but didn't quite make the original cut this isn't the breakdown i thought it was all right so so hold up hold up right it's a hard breakdown yeah it's very hard

Starting point is 00:48:21 this is not what i thought it was gosh like yo um so this is so outlaw you did pretty well on the interview but you're not quite software engineer material for us but they say this but i think you'd be great at being an sre like you want to come work for us like that that sounds pretty good right i guess i mean i'm sure they don't frame it like that when they call you back right but but that is kind of brutal that they say it although joe did mention at the top of the show what their compensation is and it's not terrible so so being being second best and still and still making the cut is not terrible but what a weird thing to say and i do have a feeling they're talking about like the initial kind of wave of sres here uh and they mentioned and additionally that skills would be very valuable for sres but that were not as common for typical software engineers like

Starting point is 00:49:22 being good with uh you know knowing un Unix or whatever, networking knowledge. And so Google has tracked the progress of these two kind of career paths. And they say there was very little difference in the performance over time. So whether they were more kind of traditional software engineers or candidates that were a little lighter on software engineering but had those other skills like linux or networking that kind of bolster there yeah so the key takeaway here is if you were in that 40 that didn't quite make the original cut like the the top tier cut it didn't seem to really matter and so total tangent here we've talked about this in the past, right? Like just some of the interviews at some of these larger companies, right? Like the fan companies, they can be really difficult, right? And some people aren't going to make that cut, but does that mean that they wouldn't have been a great fit and a great

Starting point is 00:50:18 developer for that company? Not necessarily. It doesn't mean that. Like, I mean, we've laughed about the fact that you have to do the traveling salesman algorithm in an interview, and then you get in there and you're shifting pixels around on the page because you're doing UI stuff. Yeah, or you get questions like how many golf balls can fit into a 747. Here's one. I don't know if you've heard this one. Why can't your nose be 12 inches long? And be a foot.

Starting point is 00:50:45 There you go. Joe gets the job. Yes. But, but yeah, so it is good to know that they did call this out. Like right after they talked about the fact that, you know,

Starting point is 00:50:58 the first part are the top tier hires and the second ones are like the, the ones that didn't quite make that cut. They both tracked almost identically in terms of career growth and and their path and all that so that's that's really good to hear yeah and this isn't what they're doing today this is what like their kind of initial you know launch who knows what they're doing today but when they first started hiring sres they took you know half the past and half they took a chance on and figured out that they worked out the same. Yep. So one of the things that is cool about this is they looked at these software engineers and these ones that are going to be automating these old sysadmin tasks.

Starting point is 00:51:40 And there were some things that stood out to them in this hiring process. Software engineers get bored doing repetitive tasks. That's so spot on. I know, like, give me a boring task and I have a hard time staying focused on it. Like, I seriously struggle with it. It is terrible. But software engineers, when they get handed these things, they think, well, how can I get rid of this repetitive task? Right. Like, how can I make this go away?

Starting point is 00:52:09 And that's a really interesting hiring perspective. I don't even I never even thought of it as more of like, you know, our nature to get bored or by or repetitive tasks. It's just more like, OK, how can I just make my life a little bit easier? And I don't want to ever have to do that thing again. So I'm just going to like write something so I can write it the one time. And then the next time you asked me to do it, I'd be like, oh yeah, I'm going to spend the next three hours working on it. And then instead I'm going to click this button and then I'm going to go

Starting point is 00:52:35 ride my mountain bike. And then I'm back and be like, well, still slaving away on it. Boss man. Yep. Just finished. So yours is inconvenience. Mine is straight up boredom. I cannot do repetitive tasks.

Starting point is 00:52:49 No, see, my deal with that is I automate the first time. So the second time when I'm asked to do it, I can quickly push the button and run it because I'm actually like three weeks behind on the other stuff I was supposed to have done. So I need to be automating some of this stuff. Otherwise, I'll really never get done. And the sad reality, though, is like as we get older in life, well life well i mean we will get older i'm 21 at the moment so i don't

Starting point is 00:53:09 have to worry about this yet but you know eventually like you know the memory cells aren't what they used to be so you're like you want to like not necessarily like automate it but it's automated as a form of documentation so you could just ensure that like you did you don't have any typos anymore in your execution of that thing. So yeah, I got to get this written down because I don't got much thinking left. The site room loves

Starting point is 00:53:35 the engineering stuff. Is that a boom hour? Yeah, boom hour. There you go. You know that show's coming back. Oh, for real? Yeah, it's coming back. That's awesome.

Starting point is 00:53:55 Anyway, so SRE teams must be focused on engineering. Traditional ops groups scale linearly by server size. The bigger the service, the more people you hire. By contrast, SRE teams... ops groups scale linearly by server size the bigger the service the more people you hire by contrast sre teams don't more efficient way to cap yeah i had to cut that off there hey this is too late for me y'all it's late. You know what's so good is I think that was actually a great presentation tact that you did right there. That was – leave him hanging for a minute. Make sure he's still awake.

Starting point is 00:54:39 I'm sorry. I'm so sorry. So in order to encourage the kinds of behavior that Google wants to see, they put a 50% utilization cap on SREs doing traditional ops works. They say they do not want you doing more than 50% of your job doing things like backups, deploys, monitoring, learning production support type stuff. They want to cap it at 50%. This ensures that the SRE team has the time to automate and stabilize the software through automation. Because if they don't give you time to automate,

Starting point is 00:55:16 you're just going to be fighting fires all day. This was my favorite part of this chapter far and away, was the fact that they just automatically time box it to say like, you can only work on this, but so much of the time we need you to like actually focus on making things better.

Starting point is 00:55:35 Isn't that cool? I mean like imagine they say, all right, Monday, Tuesday you are on ops calls, right? Um, Wednesday through Friday we want you to automate the things that you dealt with Monday and

Starting point is 00:55:47 Tuesday. And then basically what they say is, hey, after a little bit of time, you're no longer spending any time on those ops calls because you automated all that stuff. Yeah, this requires strong management when you think about it, though. What you're kind of saying is like, hey, you know what? We've already spent 50% of our week on deploys this week so we're not doing this deploy for you on thursday or and we're not going to check on your backups for you on friday we're not going to look in your logs on where we're going to work on automating and that only works

Starting point is 00:56:18 if your management is willing to say you know what they're right they spent the time they were supposed to and you've got to leave them alone to automate this or else we'll never catch up well my guess is they'd probably have a rotation right like yeah you know there'd be a few guys doing it or gals doing it monday tuesday and then a couple doing it wednesday thursday whatever kind of kind of oh you read an asterisk on that yeah we'll get it we'll get there because later in the book there's a whole section on like error budget. And one of the things is that like you could decide that like, hey, I can only afford to have, we'll get in this more detail when we get to that chapter, but you might decide like, hey, I can only afford 15 minutes of downtime, you know, for whatever given time interval you have, right? However you spend that 15 minutes, you know, if you spend it all, then,

Starting point is 00:57:09 you know, you might not be able to do any more new releases because those new releases could potentially add more downtime. And so, therefore, like once your downtime budget is used up, it's used up, period. And they actually made a point of calling out that it might not be you or any member of your team that's responsible for it. It might not even be anyone in charge of the physical machine. It could be something like the power distribution to the rack that died, but it spent your entire error budget. And so now for the quarter, you might not get to do new deployment, right? Interesting.

Starting point is 00:57:52 Yeah, it sounds totally crazy to me, but it's interesting. And so I'm keeping my mind open. Well, so that's why his point about the strong management is so relevant though, because it really does require that everybody agree. Everybody buys in to this and they agree to this right in there. So,

Starting point is 00:58:14 yep, we, we've spent our budget and we have to wait until whatever our time interval is before we can do it again. Yep. Next quote here too, is they said they want systems to be automatic not just automated the only thing i mentioned here is that sres tend to build up like a playbook of basically you know common things scripts um tasks you know things that need to

Starting point is 00:58:36 happen on a regular basis and that's a good stepping stone to fully automating that stuff and but the ultimate goal is to have that stuff be built into the system itself. So whatever, you know, human decisions are being made there, get kind of built in. Yeah.

Starting point is 00:58:51 I mean, I love this because think about like how many times do you have an alert get triggered because some, some condition happened and then you have a playbook of like, Hey, if there's ever this alert, here's how you go and fix it. And their point was,

Starting point is 00:59:03 well then just automate the fix right yeah exactly that shouldn't be an alert anymore um so one of the key call outs here too is hey the only way to make sure that that they are spending 50 on on development instead of you know pure ops work is you have to measure it. Right. And that again goes back to good management and making sure that that's actually happening. And this is my least part of the book. I don't want, I don't want to be checking time.

Starting point is 00:59:31 Yeah. Yeah. Same, same. Um, but they also called out that they, Google has looked at this and SRE teams are cheaper than ops teams because the SRE,

Starting point is 00:59:44 the SRE teams know the product well and they find out ways to prevent the problems that come up. And so you don't have people doing the same thing over and over and over, over months and years. And there are a couple of challenges with SREs that they brought up here. So I mentioned hiring being hard. First of all, the kinds of people you're trying to hire, the kinds of people that everyone wants to hire, the people who know how to do stuff, whether it's product work or even networking stuff, it's kind of the middle position. So it's competitive even with your own org, let alone like other work. And at the time, this was a new title.

Starting point is 01:00:22 So how do you tell somebody, hey, I'm trying to hire you to be a flibbity flobbity flibbity. They're like, wait, what is that job? You're like, site, site, bid, and shibbity shibbity. They're like, what? Yeah. And the book doesn't mention it, but also pretty sure SREs all wear pagers. So that to me is a big downside. So it's like, you're telling me you're going to pay me this or pay me that.

Starting point is 01:00:43 Both the same. Great. One has a pager. It's like, you're telling me you're going to pay me this or pay me that. Both the same. Great. One has a pager. One doesn't. Hmm. I didn't see. Now, come on now.

Starting point is 01:00:52 They didn't say anything about pagers. No, they didn't. That's because they're biased. That's the dirty secret truth. I mean, they honestly don't, but they don't talk about any kind of pager duty or anything, at least not in the portions that I've read so far. They talk about on-call schedules and how many defects you should be looking into per on-call rotation. Yeah. So, yeah.

Starting point is 01:01:19 On-call usually implies that. And then, like he has here here it requires dev skills as well as system engineering right like that was that thing that outlaw was talking about you know good software engineer with with decent um system skills or vice versa you need that mix and i didn't mention uh here either but um also i think good communication is of course important for everyone but especially good for sre's because uh because the postmortems, I think it's really tricky to kind of get those right. And we'll be talking about postmortems later as well. Also, one last thing we already touched on, this is requiring strong management in order to kind of protect those boundaries and being able to kind of have the back of the team in order to say no.

Starting point is 01:02:08 So, all right. So this is just another DevOps title, right? Is that? Yeah. Yeah, I mean, the book goes there. They say that DevOps is kind of a generalization of several core SRE principles applied to kind of a wider range of orgs and stuff. But basically they say that an SRE is a specific implementation of DevOps with some idiosyncratic extensions.

Starting point is 01:02:32 And they are saying that SRE is a role and that SRE is DevOps. It's part of DevOps. So in a way they are making the argument that DevOps is a role. Is a role. Is a culture. this reminds me of bonnie python right yeah what else floats ducks does she weigh as much as a duck yes she's a

Starting point is 01:02:54 witch and it's something we struggle with all the time because we do know you know the reason why we fight about this is we say that it's very important that the dev ops people be like intimately aware of the product and how it works and how it needs to work and how you know the what the things that are important to they need to know the product and that's why it's important that your people working on the product know how to run their own stuff and the sre i think accomplishes that by having that 50 kind of budget where they work on the product we're going to automate the work and so i think that's how they they uh reconcile that at least in my mind. Well,

Starting point is 01:03:26 I was thinking of it as like, uh, if you, if you adopt DevOps as your culture, then SRE is the position that comes out of it. I like it. Yeah. That's,

Starting point is 01:03:39 that sounds about right. So we agree. DevOps is a culture. I win! With positions. With DevOps, everybody wins. Yes, I believe so. So here are some tenets of SRE.

Starting point is 01:03:56 We're going to just kind of blast through the list here, and then we're going to focus on each one a little bit. Roughly, roughly. So, availability. And you got this first one. Latency. Performance. Efficiency.

Starting point is 01:04:12 Change management. Ooh. Yeah, I beat to that one. Monitoring. Emergency response. This one is definitely Joe's. I think the vowels on my keyboard aren't working very well. Or in my mouth, I guess.

Starting point is 01:04:33 Emergency. And finally, a capacity plan. What's the first letter of that word? I don't know what's going on so first let's talk a little bit about availability which they refer to here as basically a durable focus on engineering so in order to keep the time for project work we said

Starting point is 01:04:58 SRE should receive a maximum of two events per 8 to 12 hour on call shift that's a very specific and small number i'd be good with that number yeah it's pretty funny they say if you have less than two did you got too many sres or you're not doing enough publishing right you're not taking enough chances you're not you're doing it's kind of like gold plating right it's kind of funny but two is still pretty small number i mean i i've i've had days where i would have loved to have it for it to only have been two yeah that's yeah and there's days when you know one one outage basically take

Starting point is 01:05:37 more than one day too so sure they don't really say how big it is but uh the idea is just that the low volume allows the engineer to to really in there, spend the adequate amount of time in order to fix the problem, and then write up a good post-mortem about what happened. That's the really time-consuming part, though. That post-mortem can be consuming. Yeah, if you do it well with timestamps and everything and exactly what happened, how you found the problem, that's not fun. That's not fun.

Starting point is 01:06:13 So sometimes I think I want to be an SRE and then I think about postmortems and pagers and I think maybe no. Yeah, but I don't know. I think I still like it. I like production support stuff so all right so uh we'll just give all the production support stuff to to joe and let him be the on-call person yeah but i uh well i only i only do like

Starting point is 01:06:37 half the work as anyone else anyway so a big part of those poor swimsuit uh is that they have to be blame free which is something we've talked about uh this on the show before which is just the idea that you're not attaching really people to it it's just more about the processes and where things went wrong and you're not looking to try and blame someone which is very easy to do another another concept from the devops handbook yep was the blame free yeah and they also say that the postmortem should be written for all significant incidents when paged or not. So even if you saw an issue and it didn't alert anything, you should

Starting point is 01:07:16 still write the same postmortem for it. Yep, which is pretty disciplined. Yeah. I guess where I struggle with some of that too, and this is where the limitation of the two events could come from because I could see spending a pretty good amount of time just on some of the write-ups for those things, depending on the level of event, obviously, but in the severity of it.

Starting point is 01:07:44 And so I kind of wonder like, you know, if they have any kind of guidance, maybe they'll get to it later in the book, but if they have any kind of guidance on that postmortem, like you shouldn't be spending more than like 30 minutes max writing that thing. If you can't write it in that amount of time, then it's, you're either putting too much detail into it or it's like a much, much larger problem, you know?

Starting point is 01:08:11 Yeah. And I will say too, so they mentioned eight to 12 hour on call shifts and doing these post wardens. Are you talking about hours of work? These on call shifts are happening during your kind of work hours. Basically, this isn't your on-call from

Starting point is 01:08:26 you know saturday night whatever type stuff that where you're writing these post-mortems and doing you know these hours of work so you know i i've been joking about the pager thing but i think a big part of it is that these are kind of your normal work hours you're just the person with the bat phone you know that needs to kind of take point on it. Max change velocity. So this section is referred to latency. And normally when we talk about latency, you talk about like the amount of time, idle in a browser or something,

Starting point is 01:08:57 waiting for input and output. But in this case, we're actually talking about limiting the amount of change. And this is where we talk about things like an error budget which is an interesting way to kind of balance innovation and reliability because uh if you are pushing out new features breaking things doing things that require maybe scheduled maintenance right and let's face it like most changes can be done even to databases a lot of times these days without any sort of downtime. But you might choose to take the database down because it's a lot faster and easier than trying to migrate in such a way that where you like, I don't know, bring up a replica, spin it up, sync it over while you make changes.

Starting point is 01:09:52 You know, it's just expensive and time consuming a couple of factors we're going to talk about in a second, how much tolerance your users have for disruption. Yeah, this goes back to where that 100% uptime is generally considered not worth it. And it gets more expensive as you get closer to that 100% mark. And your customers might not even realize it. And so therefore, trying to get to that 100% is just wasteful. Yeah. And what is the right number? It's a business decision. It depends on a lot of factors like what the users expect, how important your –

Starting point is 01:10:24 I shouldn't say important, how critical your service is. Like, do they have a workaround? How well does the experience degrade if part of it is working but another isn't? I was terrible with the typos today. And have you ever had a manager push back when you're talking about uh tenant uh technical day you're talking about what you need to do and you know it's important that we refactored this and that because it's taking too long to uh to actually make changes here and the manager pushes back and say it's fine for now we've got things that need to go out the door because it goes

Starting point is 01:10:56 we're waiting and the business is going to die if you don't have this done by friday night um i thought one thing that's kind of nice about just having the measurements here to kind of show what your actual disruption is and your disruption budget. It's kind of a nice way to say like, hey, this is what we're losing by pushing stuff out faster or this is what we're losing

Starting point is 01:11:17 by not pushing things out faster. The problem is though like when you're a large enterprise company, it's easy to have it's easier to have these types of tenants in your company and for management to buy in. But when you're a small business and maybe there's 20 developers total in the company, it's a much that, for the management to buy into some of these things. Yeah. Take that especially hard to man or to measure because you're talking about measuring something that is an idea to a certain degree. Right. Unless unless you take measurements early, like it's much easier to do at the beginning if you do this. But if you measure, hey, how long did it take to get a feature out and deploy?

Starting point is 01:12:12 You know, when we first started this versus, you know, a year in now, these releases are taking two weeks longer. You know, if you measure it all the way through, then you can do that. But when you're just talking about things that people have short-term memory loss on, it's hard to throw those metrics at a manager and be like, look, if we don't spend time on this tech debt, this is just going to get worse and worse. And that's just an idea. They're like, work harder, you know, and that's as hard to fight. Yeah. There was like going back to like, what's the amount of uptime, you know, and that being a business decision. Later in the book, they talk about an example where, you know, even for the same product, the same service, you might decide to have two versions of that thing that run at different levels of uptime. And, you know, you charge differently for them, right?

Starting point is 01:13:09 Because maybe they're actually targeting different customers. Like maybe different customers want to do it as for batch type jobs to where, you know, they don't need it to be super reliable, you know, 24-7. But when they do, they want to slam it and they want like fast as possible versus another customer who might want, you know, five nines of reliability 24-7 because of whatever their need might be. And so that ultra high reliability use case, you know, we're talking about the same software, but we're going to configure it in two different ways for different purposes. Right. So that's where that, that business decision comes into play that, you know, we, as the software developers won't,

Starting point is 01:14:13 won't care. We won't decide that. It doesn't S3 have two levels of durability. Do you like, I think it's like six nines or 11 nines or something. It doesn't matter. But yeah, good call. So fun question.

Starting point is 01:14:30 What could a team do if there's no more room in the budget for for for downtime? Yep. As far as like they don't get to do anything, you mean in terms of like new deployments, right? Or did you mean like what can they do to solve that for next time i just been like what do you do is basically nothing like you you do you think you don't do anything risky right you calm it down just you know a hard pill to swallow sometimes and that was google's answer by the way was that well then you know they they talked about it on like quarterly budgets and they're like well you know you uh unfortunately that um that networking switch or that uh power distribution block whatever the case might be

Starting point is 01:15:12 you know it just happened to die in that rack and so it spent the entire budget all right well you don't get to do any more deployments for the quarter you know yep that's that seems so unlikely to me but i mean that's the premise of i'm trying to keep it on with mine just imagine like s3 someone spills a uh champagne on january 1st a little too much drink in the server room and they knock a rack out and uh you know there's an outage for one day and that's the entire budget for the quarters seems crazy to think that you know they would accept that but just that's the premise we're the quarters. Seems crazy to think that, you know, they would accept that. But that's the premise we're going after.

Starting point is 01:15:47 So I guess, you know, in an extreme case, they would maybe make an exception. But that is what we're kind of the assumption we're operating under. Here's a more fun question. What do you do if you're at the end of the quarter and you still got budget? You just release everything. I was going got budget, you just release everything. You just turn it off. Just take a break. No,

Starting point is 01:16:11 be back on Monday. That's awesome. Yeah. I don't think it works like that. I think, I mean, the real question is, yeah,

Starting point is 01:16:19 I mentioned like database migrations and stuff like there's lots of times where like a rolling upgrade is the right answer. And then you apply a migration. And then once everybody's lots of times where like a rolling upgrade is the right answer and then you apply a migration and then once everybody's transitioned you do like another rolling upgrade and then you do whatever or you just take the site down for three hours and have it all done and everyone there is one room and you know you could do something in three hours that would have taken three weeks otherwise so i think you know that's the kind of stuff you can do but you gotta got to be careful with that budget. It's good. Yeah. But even in that scenario, though, it's going to depend on what the services that you're doing, because like if the thing that you're trying to do is like Google dot com, then you're probably not going to want to take any like you're not going to want to like

Starting point is 01:16:58 purposely introduce that downtime. But some of these examples are like hard to discuss too in the as as it relates specific to google because we're talking about like such a massively uh distributed worldwide system and they even call this out in like uh part a part of the book where they're talking about like you know like i i've twice at least twice now given an example about like losing a power distribution or or network switch within the rack, right? But even in Google, the way Google is set up, everything is distributed across so many systems and across so many different regions and across so many different areas around the globe that even if they were to lose an entire continent, there's a possibility that some portion of that service might be serviced by another country or you know or something like that right right so here here's a question i

Starting point is 01:17:51 didn't ask what should have uh as a like a i don't know or owner of a company or a director of a organization how do you enforce your budget you tie it to bonuses that's how you tie it to raises you tie it to whatever you say is your uh air budget and if you stay within this budget you tie it to bonuses that's how you tie it to raises you tie it to whatever you say is your error budget and if you stay within this budget you get this much more bonus that's a way to kind of encourage everyone to take it very seriously yeah i could see that working all right so next we're on to monitoring and this one's pretty important so monitoring is how you track the system's health, right? Like that's, everybody's probably pretty familiar with that. Well, the classic approach to this, and this is probably what most companies out there do, because this is how people have operated forever is when there's a problem, an alert gets sent out. And, and that happens when

Starting point is 01:18:40 there's like some sort of event that happens in the process or some threshold was crossed, whatever, but you know, the, the gist, right? Some alert was sent out and, and typically when that alert gets sent out, then somebody goes and handles it somehow, right? Like going to go look at production or whatever the case may be. Well, they say that this is a flawed approach because anything that requires human intervention is by its very definition, not automated. So they're saying that software should be interpreting whatever is happening and people should only be involved if software can't take care of the problem. And honestly, this, this makes a lot of sense, but this isn't how most people have thought about it over time. Right. Um, and so they say that there are three types of valid

Starting point is 01:19:34 monitoring that you have. You have an alert and that means that a person needs to take immediate action, right? That they have to get involved. There's a ticket. So this is when the system can't automatically handle whatever happens, but it doesn't need to be looked at immediately, right? Create a ticket. Somebody can get it done within the next couple of days. And then logging. Logging, nobody needs to do anything. Probably never even look at them unless there's some sort of event that says hey you need to go look at these logs so that's what they call out and they really want to minimize the amount of human interaction that happens in the system at all and then the next piece that we got here is

Starting point is 01:20:20 the emergency or emergency response. It's actually the same thing, just pronounced differently. It is the same thing, that's right, emergency. So I have a quote in here because you just couldn't have said it any better. Reliability is a function of mean time to failure, which you've probably seen as MTTF before, and mean time to repair, which you've probably seen as MTTF before and mean time to repair. So MTTR. So the best metric for determining the effectiveness of the emergency response is the mean time to repair how quickly you got things back into a healthy state. Um, people add latency.

Starting point is 01:21:00 This is really good. This is why they want systems to handle everything because people are slower um yeah you're away at dinner when you get the alert or you're driving your car so you you can't possibly you know get to a computer within the next 30 minutes to even look at it like those are those are that's latency that the people are adding yeah and and usually there's even communication on top of that, right? Like we we've talked about in the past and alert goes out. All right, well now I need to coordinate with this person, that person, whatever. Um, but what they're saying here, and, and this is pretty cool is if you can avoid having a person be involved at all, even if there was a

Starting point is 01:21:43 problem and it required, and it takes a little bit of downtime, if the system can handle it, ultimately it'll probably be more available than it was if a person had to touch it in the first place. And that's pretty cool. Now here's another part, and this is pretty good. Thinking through problems before they happen and creating the playbooks they said resulted in a three times improvement in the meantime to um reliable or repair as opposed to winging it so like what that margaret hamilton did right where she had written that thing up the fact that she had it there meant that the people could go look at the instructions on how to get the thing back into a healthy state instead of people

Starting point is 01:22:30 going, oh, I think you do this. And maybe if you do that, it just goes a lot smoother. And then they also said that their on-call SREs, they always have playbooks when they're doing things. And then they also go through these exercises they call wheel of misfortune that allows them to prepare for these events. So my guess is they probably simulate some sort of failure and say, all right, go fix it. Right.

Starting point is 01:22:57 And, and that's probably what happens. I really liked the idea of that, but the wheel of misfortune. But then I was also thinking like, okay, wait a minute. Is this just like part of the hiring process? You give a new candidate the wheel of misfortune, see what they do to fix the environment? Or do you like really play this like, hey, on Fridays, we're going to play wheel of misfortune.

Starting point is 01:23:18 The chaos monkey, but you're part of it. Yeah. So change management. Interesting stat here. 70% of outages are due to changes in a live system, which I guess makes sense. You wouldn't expect outages to happen in dead systems, but it was still kind of surprising to me to see that. A couple of best practices we got here. Progressive rollouts.

Starting point is 01:23:42 So we've talked about canary deployments before, just rolling upgrades where nodes or instances or whatever kind of go out one at a time. And if things are not looking good, you can stop at a rollback to like a blue-green environment. There's all sorts of different ways of doing this. But that's the gist of it. In order to do that, you need to be able to actually detect problems accurately and quickly and then actually be able to do that roll you need to be able to actually detect problems accurately and quickly and then

Starting point is 01:24:05 actually be able to do that rollback if possible. So that's the kind of stuff that SRE is going to be building into systems because that stuff is not trivial, especially once you start talking about data migrations. And the idea here is just to remove people from the loop. So automate, automate, automate. Another fun one.

Starting point is 01:24:23 This is one that i especially enjoy demand forecasting man forecasting and capacity planning so the idea here is that forecasting helps you ensure service availability and keeping your costs uh you know in check and kind of within budget and the idea here is to account for both organic which is like your normal usage patterns but also to try and account for inorganic uh growth which is things like uh major launches or marketing events or maybe uh you know uh some celebrity tweets about your product or something getting some sort of unnatural spike and um that's you know really hard to do but you can kind of imagine what that would look like

Starting point is 01:25:06 with like 10x or 100x growth on um any sorts of numbers that you come up with uh three steps here so i like this one uh you need to have accurate organic forecast and the important um bit here is that you need to have your forecast extend beyond the lead time for adding capacity so if it takes you three months to order a new server then you better be forecasting out more than three months if it takes you an hour to add a new node or to i don't know add a new load balancer or whatever it is then you need to forecast at least that much so i thought it was kind of interesting way of saying like figure out what your lead time is for capacity, whether it's disk, instances, nodes, like all that stuff, because you may need it.

Starting point is 01:25:57 Also, you need to try and incorporate inorganic demand. They didn't go into this. Maybe they do in a later portion, but I imagine this is just kind of like trying to say like, well, this is what a spike would look like if you try to imagine what would happen if you sign on a big tenant or a big client or Taylor Swift tweets out your product or something. Imagine there's some just multiplication of numbers there. And then the final piece here of the Triforce, regular load testing. This is not something you do just once when you launch a product. This is something that needs to be ongoing because you're making changes to the system all the

Starting point is 01:26:30 time, so you need to have that be a part of it. That stuff takes time. But also consistent with the DevOps handbook. Yeah, totally. Totally. At least it's like the regular load

Starting point is 01:26:44 testing. Yeah, totally. Totally. At least in terms of like the regular load testing. Yeah, absolutely. This is very similar. So provisioning, and just kind of like we said earlier, the faster your provisioning is, the later you can do it. So if something takes three months, you need to figure out three months, you know, ahead of time in order to order it.

Starting point is 01:27:04 If something only takes five seconds to provision, like maybe adding more RAM or something or adding another node, then you can do that like five seconds before you need it. And what that means is you can be much more efficient with it, right? So if you're ordering servers three months ahead of time, there's a good chance you're going to get that server before you need it. It's going to sit around idle. It's going to take a while to get plugged in you may not even need it at all just less efficient overall to bite things off in bigger chunks like that so the another way of saying

Starting point is 01:27:34 this is the later you can do it the less expensive it's going to be and they also mentioned that not all scaling is created equally so like adding a new instance to a stateless workload like trivial right you can set up autoscaler and you don't even have to think about it anymore uh adding another uh partition to a kafka topic like oh that's uh you know there's a lot of implications there that have effects on producers and consumers and all sorts of stuff. Maybe even replication. And it's going to take a little while to roll out. It's going to be a process.

Starting point is 01:28:11 And so it's just about kind of figuring out everything you need to care about and what it takes to maintain it. Looking at the next section. So the last one here is efficiency and performance so basically the sres are in charge of provisioning and usage so they're close to the money they're responsible for it people are going to ask them how much is this going to cost how much do we spend how much can you save us next quarter and so it's important that you know how to maximize your resources which uh fundamentally affects the which fundamentally affects the success of a project, which is pretty cool when you think about it. So you were joking earlier when you said that we could tie it to the bonuses, but maybe not so much.

Starting point is 01:28:58 No, I really you to save 9% operational costs. And if you do, you get 100% of your bonus. If not, it's going to start scaling down or whatever. And it sounds a little cold to say that, but I think the idea is that by associating it with a financial incentive, you're really sending a strong message to the whole organization that this is important. And people are going to really get upset if your organization or your developers aren't taking these goals seriously.

Starting point is 01:29:40 So people are going to come after you. And it just keeps the org on track. Well, I mean, it might sound cold, but at least it puts it into, uh, something that's within your control, you know, totally.

Starting point is 01:29:52 Like how, how many jobs have you ever had where like, uh, you know, anything about bonuses or whatever, or, you know, uh,

Starting point is 01:29:59 pay raises or anything like that. We're like completely outside of your control. It's like, well, it depends on how well sales did, but I'm not in sales. Yeah. But sales drives everything and so you know you clearly if you did a good job making the product and they would have no trouble at all selling it right and so therefore you know that's why your bonus is tied to how well they do yep really got your bonuses tied to uh net profits

Starting point is 01:30:27 but we're having a good year this year so we're going to do a stock buyback we're going to slow down on infrastructure or do whatever we can in order to make sure that profit doesn't show up on the books yeah totally uh so uh you know kind of a cool little balance here they mentioned is that systems get slower as loaded added it's never going to speed up as you add load to a system right uh and slowed can also be viewed as a loss of capacity so your system starts blank slate zero users is full capacity every user every you know system that you bring, all your traffic is reducing that capacity. So you're trading off how much money you spend to set that stuff up and have it available and the speed at which your system runs. So it's just kind of a cool way to think about systems basically being a balance between capacity and usage and what that means to your cost and how much things cost to run. So you're excited to be an SRE, right?

Starting point is 01:31:29 Like this is next in the career path? I think I would be happy. I tend to kind of like production support stuff too. If you want me to focus on a problem, just tell me that there's something wrong with it. I want to know. So I think I would want, I would, I would want to be a SRE. I actually liked the, um, I don't know the feeling of accomplishment that comes with automating a task, right? Like there's, there's almost, I don't want to say instant gratification

Starting point is 01:32:00 here, but, but fairly quick gratification. if you were to automate things that people had to be involved in previously. And seeing that happen is rewarding, right? Like that's part of what I like about doing software in general. So yeah, I think that that would be kind of interesting, right? Like this self-healing, self-reliable system, that's pretty cool cool i think it'd be cool to have goals like uh can i make builds uh fail 10 less often or can i save uh you know 10 time or you know whatever like increase up time that all sounds like cool stuff to like come up with and go after to me yeah yeah i was just gonna say like you know automating this stuff is also fun too, right? But yeah, so Alan said it better though. So did it. So, uh, yeah. So we'll have a lot of links in this episode. Uh, you know,

Starting point is 01:32:54 namely we'll have links to the book itself, which you can find for free, uh, SRE dot Google slash books. Um, but we'll have that and other links in the resources we like section of that, of the episode. And with that, we head into Alan's favorite portion of the show. It's the tip of the week. All right.

Starting point is 01:33:16 So this one's kind of apropos for this particular episode because we're talking about Google and, and their SRE stuff. Well, I had something, an interesting thing come up the other day that had to do with caching and evictions at certain timeouts, like seeing if something had come in before. And if it had, not doing it again, right?

Starting point is 01:33:38 Like trying to de-dupe stuff in sort of a smart way. And so in my mind, I was thinking, okay, well, if I had some sort of cache or some sort of hash table, and if I could evict those members that were put in 15 minutes ago, right? Like anything that's older than 15 minutes, kick it out because I don't care about it anymore. Then that would be interesting. It'd be a nice way to handle this particular caching type thing that I wanted to do. And so I got thinking about, I was like, man, there's got to be something out there. Right. And I, and I'm working in Java or more specifically, I'm working in Kotlin, but I can use some Java, um, to make this happen.

Starting point is 01:34:18 And Google, I've seen this library come up in a number of projects that I've looked at. I want to say maybe even Kafka stuff, Flink possibly. I don't even know. But Guava. So Google has a library called Guava. I've put a link directly to the wiki. I didn't put a link straight to the project because there's not a bunch of information about what it offers in here, but this is a whole set of utilities, collections, graph capabilities, like all kinds of stuff that can help you in your regular Java application. So the one that I'm talking about that would have solved the problem that I was just mentioning

Starting point is 01:35:01 is they have a caching thing. And in this cache, they have the ability to populate and evict from the cache on an automated type basis. And that's fantastic. They have immutable collections. So like if, you know, typically when you're looking in Java, you have, you know, your hash map or your map or whatever, and those are mutable. Well, they have immutable collections. They have these graph libraries, they have all kinds of things. So it's worth looking at this library because it solved a lot of problems that, that Google uses in their own distributed systems to solve a lot of issues that you may encounter. So I'd say, you know, I say it all the time with folks I work with and in general, maybe even on the podcast is I'm not opposed to writing something myself.

Starting point is 01:35:51 Right. And I'm not opposed to other people writing something themselves. But a lot of times, if there is something out there that is already battle tested and it does what you need it to do, it's probably well-tested and been proven, so maybe it's worth looking at that. So definitely check this out if you are in the Java world because it may help you out in a number of different ways. Well, I got a tip from my go-out-law today. Oh, bring it. Have you ever heard of uh get cherry pick well let's you do is uh pick a commit

Starting point is 01:36:30 uh over from one committed branch into another and this is great if you have like multiple current releases or sometimes if you just goof something up or you need to bring something over from uh from another branch into your branch it's. You just do get cherry pick and pass the commit hash. You can look up from any variety of ways and you'll get it. Well, today I did something on accident. I accidentally pasted the branch name that I was going to cherry pick from instead of the commit. I copied the wrong thing somehow.

Starting point is 01:36:59 And it worked in the way that I expected. It grabbed the commit that wasn't in the branch. I was cherry picking it too, uh, which is very surprising to me. And so I went and I looked at what all you could pass the cherry pick and surprisingly the docs, uh,

Starting point is 01:37:17 not great. Like it doesn't actually mention that you can, uh, pass a branch name. It's the very first example they give. What's the current branch head points together oh okay yeah so that was going to be my question jay-z is if you'd had multiple commits in your other branch would it have grabbed those multiple commits yeah so uh so i did a little bit of reading on that and it would have grabbed uh

Starting point is 01:37:46 the top commit the latest yeah the latest so just the one but you can cherry pick multiple commits at one time which i did not know you can actually pass a range you can say like this commit hash dot dot it's just two dots to that one and it's going to grab a range which is something I didn't know about many times I've gone and cherry picked a bunch in a row and so that was kind of nice to know there's actually a bunch of other different flags that are pretty cool too

Starting point is 01:38:15 like there's even one for no commit which is pretty interesting it applies the changes from the cherry pick blah blah blah without actually making the commit. So it gives you a chance to kind of slice it up and do it a little bit differently. But I just thought it was pretty cool to see that there were other things there. There's also a sign off flag in case you want to kind of

Starting point is 01:38:37 update a commit message. Yeah, the takeaway though is that if you're going to use that get cherry pick and then the branch name pattern, you have to be careful because you have to know that you only want the one single commit that is the tip of that other branch. Because if you only specify the branch name and nothing else, that's all you're going to get is that single tip commit. And I'm fairly certain that this would not work if that commit well no that branch couldn't be emerged so well no i guess what i'm trying to say is if the tip of that was if that the tip of that branch was a merge commit i still think you would have to apply the dash M to specify the main line. I believe I,

Starting point is 01:39:28 the point is, is like you might, your mileage may vary and you may run into trouble if that branch, if the top is the tip of that branch is a merge commit. You can also do a dot dot in your branch name. It'll get all of them. But so I don't think you should do any of these though. These are all

Starting point is 01:39:45 terrible it's all really confusing like you really have to know this command and all its various flags in order to do this correctly so this is a terrible tip and you should just not do this if you can avoid it i wasn't trying to go there at all please don't take that away from it no no no i think i seriously think that you should not be it's such like a weird behavior that you're relying on so i don't think anyone should actually use this. This is an anti-tip. Don't do this. The real tip is you should read the documentation for commands.

Starting point is 01:40:11 Even the ones that you've run a million times before. And just see. Because sometimes there's stuff in there that might help you out. That you've kind of overlooked. And there's good stuff in there. Not in this case though. These are all terrible. But sometimes there are.

Starting point is 01:40:24 You said RTFM is your tip. Yeah. Okay. Just sometimes. They're all swell. You were shot. Yeah. Okay.

Starting point is 01:40:32 Okay. RTFM. I'll have to make a note of that. What does that mean? Yeah. I can't say it. Yeah. I can't.

Starting point is 01:40:40 Okay. Well, so for my tip of the week, then I had this one because like the three of us, we do a lot of stuff just using the keyboard, right? And so like I know like iTerm, for example, we're users of iTerm, right? And right? Yes. Wait, whoa, right? I love it. Yeah, definitely. So, so like, you know, you can create multiple tabs in it. Like I have this habit of like, I'll create tabs for different things. And, and it's in like, I have the habit like on a Mac of

Starting point is 01:41:16 going like, uh, well, I guess iTerm would only be on the Mac, but like, uh, command, you know, one or two or whatever the number of the tab is that I want to navigate to. And I was going back to Chrome where I have like 18 billion tabs open in Chrome. And also Chrome introduced a feature a few releases back where you can group your tabs. I don't know if you've been doing that. Maybe that should have been my tip of the week. But yeah, you can like... Was it? Okay.

Starting point is 01:41:51 Because you could like right click on a tab and create a group and you can give it a number so you can have like different things that you're working on in different groups of tabs together that are all color coded together and everything. And it just makes like if you are like me and you have to it just makes like, if you, if you are like me and you have to context switch a bunch, you know, sometimes it's easier to have just a big group of tabs already together that, you know, or whatever that context switching is. But I also tend to have like some tabs pinned. And when you pin a tab in Chrome, they automatically go to the front, right?

Starting point is 01:42:25 So like for my Gmail, right? I'll typically have Gmail as one of my first tabs. I'll have Slack as another tab. I'll have my calendar as another tab, things like that, right? And with all these 18 billion tabs open, Sometimes it's handy. Like I might be on, you know, tab 123 and, but yet I want to quickly go back and check my email because I'll see like the notification or I'll see the Slack. I'll see that one of those two things has a thing. And I'm like, it got, you know, you can just like control your way through like control function, think uh and arrow keys on the mac keyboard to

Starting point is 01:43:06 like scroll one direction or the other through all your 18 billion tabs that are open until you get to that one but i found out you could also just on the mac press control and the number and boom navigate to that tab now that's great for your first 10 or nine tabs at least. But in my case, that was first eight. Yeah. Sorry. That was good enough because, uh, you know, like I said, like Gmail and Slack were like the two big ones and those are, those are pinned. So they're always in that one and two position. Right? So at any rate, I'm going a long way around saying like control plus a number, and then you can navigate to that tab. But so I'll also have a link there to just the Chrome shortcuts in general

Starting point is 01:43:55 that apply to both Windows, Mac, and Linux. But hold up, hold up. You said you have Slack open in a tab. Why do you not have the app installed sir because i'm in chrome more than i am anything else so why would i because then you just then you can command tab to get to your to get to your slack i mean you no but then but then actually that's more annoying so So, uh, I, I will advocate for this. We will fight. So, um, no, because now, like if, if you're in, especially like if you're working on your laptop and like, when I work on my laptop, I tend to like take things into full screen mode on, on the Mac.

Starting point is 01:44:41 So like if I have Chrome in full screen mode, that's all I see. And, um, but you can see all the tabs, you know, the headers for the tabs. And if there was a notification in Slack or if I had emails that came in, like I can see that notification while I'm in some other, you know, like while I'm reviewing a pull request or I'm, you know, reading a build log or whatever, you know, I can see that and be like, oh, what is that? Control one or control two. And I can automatically go and check that thing, you know, as time allows for it. Right. Whereas if it was in another window, well, then until I happened to go click to that window, I might not even notice it. Yeah, that might not even notice it.

Starting point is 01:45:25 Yeah, that's full screen and Mac. I almost never do full screen and Mac for that reason. It annoys me. So that would be the difference. Even when you're just working solely on the laptop, you don't do full screen in the Mac? Nah, I hate full screen on Mac. Okay, we found our survey. Okay, so forget that other survey I asked.

Starting point is 01:45:42 What? That is insanity. I straight up hate it for that very reason that it hides so many things from me. And there's one other reason. There's one other reason. And this one's fully fair is because I use the Kinesis Advantage and trying to freaking contort your fingers to do something to switch between screens is like I mentally have to jump through hoops to do that. But you wouldn't be using the Kinesis advantage when you're on your laptop. And I'm saying when you're on the laptop,

Starting point is 01:46:12 you said you don't have any other peripherals connected to it or else you're not on the laptop. So, so, uh, I mean, there's two things. One is you reclaim some real estate,

Starting point is 01:46:25 which is valuable if all the monitor you have is the laptop. But number two reason is actually exactly opposed to your number one reason. I want, I go full screen because I want the focus of whatever that app is. I don't want the distractions of the other thing. Yeah. Right? I can't have that. That's the downside of having,

Starting point is 01:46:51 of seeing that there is like an email or a Slack notification when I'm in Chrome full screen because then I'm like, oh, well, all right. Control one. What was it? Okay.

Starting point is 01:47:00 Nothing big. Go back to my other window. And I can't like control 18 billion so i gotta like fine i'll use the mouse to click on that tab it's like dialing an old school phone number for you control one two four five right right i need a rotary dial to get through all my chrome tabs that's right all right well uh we'll argue about why Alan is wrong later. But in the meantime, you can subscribe to us on iTunes, Spotify, Stitcher, more using your favorite podcast app. And if you haven't already left us a review, you can find some helpful links at www.codingbox.net slash review, where you can also let Alan know why he's wrong.

Starting point is 01:47:47 I'm not wrong. Nobody likes full screen on Mac. So while you're up there at codingblocks.net, check out our show notes, examples, discussions, and more. And leave a comment on this episode at codingblocks.net slash episode 181. And send your feedback, questions, and rants

Starting point is 01:48:03 to the Slack channel at coding blocks.net slash slack. And Hey, make sure to follow us on the bird site at coding blocks or head over to coding blocks.net and you find all the deals at the top of the page.

Coding Blocks - Software Reliability Engineering – Hope is not a strategy

It's finally time to learn what Site Reliability Engineering is all about, while Jer can't speak nor type, Merkle got one (!!!), and Mr. Wunderwood is wrong....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.