Coding Blocks - Site Reliability Engineering – Embracing Risk

Starting point is 00:00:00 You're listening to Coding Blocks, episode 182. Subscribe to us on iTunes, Spotify, Stitcher, wherever you like to find your podcasts. I certainly hope we're there by now. And hey, if you can, leave us a review. Yep. Visit us at CodingBlocks.net where you can find our show notes, examples, discussions, and more. And send your feedback, questions, and rants to comments at CodingBlocks.net.

Starting point is 00:00:24 And see what tweets we got for you we got lots of tweets over at codingblocks and if you go to codingblocks.net you can find all our other dillies at the top of the page with that I'm Joe Zach I'm Michael Outlaw and I am Alan Underwood

Starting point is 00:00:38 this episode is sponsored by Retool stop wrestling with UI libraries hacking together data sources and figuring out access controls, and instead start shipping apps that move your business forward. Lost you guys. Okay. All right. So I guess this episode,

Starting point is 00:00:55 we are going to continue on with the site reliability engineering book that we started with from Google. And this one is going to be all about embracing risk. But before we do that we want to talk about our news and some of the reviews that we got so outlaw reader of names is that my new job title that's right that's like the lord of the rings style i wasn't prepared for that but uh yeah i i would be awful at that that would be like the Lord of the Rings style. I wasn't prepared for that. But yeah, I would be awful at that. That would be like the worst job title ever. So fortunately, this one works out to where it's not so bad.

Starting point is 00:01:35 So thank you to Richard Hopkins and JR for their new reviews. Both very good. Thank you very much. All right. So before I get into this next one I have in there, so I do have to mention like our Slack channel. I love our, our, our, it's not Slack channel, our Slack group, right? With lots of channels. Well, there's one channel in there that, um, there's a few of us that are a part of the whiskey channel. And, and I have to mention this one that my wife got me. Um, I think micro G you will like this, uh, Devin, you'll probably like it. Sean, um,

Starting point is 00:02:17 Garrison brothers. My wife bought me one called honeydew. And we all had a conversation in the past where people like four roses. And I was like, you got to take a sugar cube, melt it up and put it in there. And it's amazing. That's what this tastes like. It is so good. So, um, I need to post a picture up in there, but again, my wife got it for me, so I thought I would bring it up. It's a good reason to go up to Slack anyways, and get involved in just an amazing group of people up there. So if you haven't already joined the group, go to codingblocks.net slash Slack, and that's it for that. Now, Hey, before the next thing, while we're talking about Slack, go ahead and mention,

Starting point is 00:02:46 um, we have an episode discussion channel and had an SRE who worked at Google during, uh, some of the period that we talked about, uh, was in the channel, uh,

Starting point is 00:02:54 sharing a lot of really great, just notes and tips and corrections and just really great perspective on, uh, everything. And we had just really great discussion in general after the last episode. So, um, you should pop in there and,

Starting point is 00:03:04 uh, see what that MSR has to say. that's awesome all right now for a bit of sad news that we got so jim hummel sign has been just amazing and and shared with us over the past probably three years i want to say that yeah maybe three years only gets you to the beginning of covid good god it's crazy um so so he shared with us a long time ago that acm if you joined acm then you got the entire o'reilly um library of books and audiobooks and all kinds of stuff that came along with it for 99 bucks a year uh sadly both outlaw and i received an email this past week saying that they are basically O'Reilly didn't want to renew their partnership with the ACM. And so that feature is going away, sadly.

Starting point is 00:03:53 So if you signed up for the ACM like we did so that you could partake of all of it, you're going to lose a big chunk of it at the end of June of this year. So very sad. Don't know what to do about it. That's the thing that like, I get that they have, you know, they couldn't get like whatever country contracts set up between them and O'Reilly.

Starting point is 00:04:14 I just hated that. Like midway through my membership year, I'm like losing benefits. Like it's okay. Like I might, I would have felt a little bit different about it it was like if at least like the current year you got all that benefit and then they were saying like hey next year you won't get that if you if you decide to renew right but midway through

Starting point is 00:04:36 like after you've already paid whatever you paid you know for for the subscription that they're like hey uh yeah yeah it does away it does. It stinks. Um, I mean, there's still a lot of good content out there. It's hard to say whether or not I'll keep mine. I don't know. Like the deal, right? Yeah. Maybe not a big one, but I don't know. It does make me sad, but at any rate, just be aware of that. So if you have listened to a bunch of the past episodes where like, Hey, go get an ACM membership. And this was the reason why you were going to do it. You know, you might want to look elsewhere. So moving on from there, um, I guess let's go ahead and dive into this. Yeah. Let's talk about embracing risk as it relates to site reliability.

Starting point is 00:05:23 So, man, I actually love the opening of this thing, right? Like they were talking about, you know, we all use Google. Everybody probably on the planet at this point, if you're in a country that has Google, probably uses Google, right? And you probably assume that they aim for 100% uptime reliability on everything, right? No, not at all. Because this is what's really interesting. They're like, well, increasing reliability has always got to be better for the service, right? Like it has to be, no, it's super, super expensive, super cost prohibitive

Starting point is 00:05:59 to add one more nine of reliability, right? Like I think, I think outlaw explained it in the past. Like when we're talking about nines of reliability, you usually have 99% uptime. And then every nine after the decimal is another nine of reliability. So if you say you have one nine of reliability, I think it's 99.9, right? If, if you're two nines and you're ninety nine point nine nine, et cetera. Right. Well, every one of those decimal points that you add there increase the cost. And they say that it not only increases its cost, sometimes it can go up more than one hundred times the cost just for getting one more nine of reliability. When not only that is that the hoops that you might have to go through to make that additional bit of reliability might actually impede the user experience. So, you know, you have that.

Starting point is 00:06:52 And then even if you don't on an upside, let's assume you don't, like, are the users going to notice it? Right? Yeah, I think, I don't know if you have this in the notes anywhere, but they actually said that the internet service in general has a between point one and one percent failure rate. So just alone, you may get blamed for it. You may not. But that's the reliability of just the average Internet user. Yeah. So one of the things that they were basically saying is when you are focused solely on increasing reliability, that means that you're not able to iterate on the features that you want to

Starting point is 00:07:27 add to the product, right? Like customers might want like, you know, Gmail, if you were on that way back in the day, there's way more features in Gmail now than there were when it first started.

Starting point is 00:07:36 And part of that is because they spend time developing features and not just trying to keep the thing, you know, up 100% of the time. Um, Oh yeah. You know what? Um, we didn't say at the start, not just trying to keep the thing up 100% of the time. Oh, yeah. You know what? We didn't say at the start. We are skipping Chapter 2. We're not talking about Chapter 2.

Starting point is 00:07:54 We've moved on to Chapter 3. A way to talk about what we weren't going to talk about. What we're not going to talk about? Oh. Yeah, apparently not. Nothing we're going to talk about. It's just that Chapter 2 is a lot about Google services. It kind of sets some stages for some stuff. And I talked about how they did things internally, which was interesting and good.

Starting point is 00:08:11 And you should read it, but just not really podcast. It was a super interesting chapter. It's just not like necessarily a lot of takeaway lessons that you can learn from it because it's more about like what they do how they do the the the biggest takeaway that i had from it was from a terminology perspective was as it might relate to like the word server for example like how many times have you walked into like hey well i mean i'm about to say server room right or uh you know you see some some computers and on racks and people are like, Oh,

Starting point is 00:08:46 that server over there does blah, blah, blah, blah, blah. And in Google terminology that they would never call it that they call those machines. And,

Starting point is 00:08:55 uh, a server is the software that might serve up something like Gmail or a webpage or a database query or whatever. Yep. So server was, you know, like they, they,

Starting point is 00:09:09 yeah, it was software. All right. So now we're done not talking about chapter two. Um, totally not going to talk about it. Not going to talk about it. Um,

Starting point is 00:09:20 so, so Jay Z hit on the fact that, you know, the, the reliability of the systems that are actually using these other systems are usually lower. And so their whole point with that is because those systems aren't 100% reliable, the chances of you even noticing something on Google services, not being reliable or really low, because you might chalk it up to your cell phone service, not being or or your internet connection being slow or whatever so it's not even that important because nothing's perfect is what it boils down to um and so what they say is really what sres are really trying to do is they're trying to balance the risk of unavailability with innovations and features right you don't want to stop innovating

Starting point is 00:10:03 you don't want to stop releasing features but you also need to make sure the stuff runs. And so it's a, it's a fine line, right? And that's really what they're trying to do in this. When they talk about embracing risk, it's just accepting that it's, it's a, it's a reality. It's a fact of life. It's going to happen. And you need to accept that and learn how to manage it, which is a great segue for the next section called managing risk. Yep. And they make the most obvious statement here, right? When you have an unstable system, it diminishes user confidence, right? Like that's, as a software developer, you know that, right?

Starting point is 00:10:42 Like if you have to keep going back to your customer and being like, yeah, I know this isn't working. We'll get this fixed. Next day, something else isn't working. We'll get it fixed. Then eventually they're going to be like, you know, do we just need to use another piece of software? Because like this is this is getting ridiculous. And you don't want to be in that in that mode. It doesn't even have to be like your thing that you're offering to your customers. It could

Starting point is 00:11:05 be even internal things. How many times have you had like something as simple as like a set of unit tests? You're like, oh, those always, those always return an error. So we just ignore those. And then over time you like, you know, like your whole, you'll see people just ignore all the unit tests and they won't even run it. And it's like, oh, cause I always got some errors on it. And it's like, yeah, that's why it's important to like, you know, fix these things as they go.

Starting point is 00:11:29 Like otherwise people are going to lose confidence in it and not rely on it. Yep. It didn't take a lot to lose confidence, you know, something that could probably fail, you know, small percentage of time and it feel like it happens all the time, especially if it fails when you need it.

Starting point is 00:11:43 Yeah. That's a, that's a super important point, right? Like it doesn't, it doesn't have to be all the time. Yeah, it's true. So one of the things that they point out here is the cost does not scale with improvements to reliability. So basically what they mean is as you're increasing the reliability of your system, the costs go way up. Like, it's not like, hey, I made this a little bit more reliable. It just costs a little bit more money. No, I made it a little bit more reliable, and it cost me 10 times the amount of money.

Starting point is 00:12:13 Go ahead. Yeah, well, I mean, we'll get into it later, but I mean, they actually get into the math of like, yeah, I mean, we just saw stuff we kind of hinted at, too, in the last episode. You know, we're like quantifying that and measuring it. You know, like, hey, is it even worth your time to deal with that? Right. And so they have what they call their two dimensions of costs in here. There's the cost of redundancy and compute resources.

Starting point is 00:12:37 Right. So the computer, the CPU that you're using, and then the opportunity cost, which we already talked about a little bit, which is you're basically trading features that you could be developing for reliability, right? So you only got so many developers, you only got so much money, you got to choose where you're going to put that. Even if you try to say like, oh, we'll just hire somebody else to do that thing,

Starting point is 00:13:00 that's still money that you're paying somebody else to do that instead of paying somebody to develop new features. Yep. Yeah, i was really happy with how they phrased things and just thought about stuff the the the structure because they're really aligned the sres with the business interest and the business costs and i think that's a really good thing for especially such an infrastructure and kind of devops oriented role is that it's kind of odd to pair those together those that you don't usually think about in business you know goals and expenses and profit aligned with infrastructure but here we are

Starting point is 00:13:31 yeah i think this chapter specifically got into like knowing uh you know what the value was going to be right was it was it this chapter yeah okay so they actually hit on this right like what he was talking about is the sre's goals sort of align with the business's goals. So, if the business goal is for 99.9% up not like another nine of reliability, but they got to hit 99.91. We'll say, right. They try to treat that business goal as their minimum and their maximum, because that's how they can maximize their effectiveness and what they're going to do. Right. If they spend too much time trying to go past that 99.9%, then they're wasting time, right? They just need to meet the business goals. Now their word for that, you can call it gold plating. That's true. Yeah.

Starting point is 00:14:32 If you're going too far. Yeah, absolutely. Um, all right. So the next thing that we have is, is service risk measuring service risk. Is there a chapter on this? I haven't read ahead yet, but I could, I kept thinking, I'm thinking, well, what if you work on internal tools? This whole section is basically about identifying an objective metric for a property of the system to optimize. Google, for example, they talk about how they actually focus on measuring downtime in terms of the success rate is the total number measuring downtime in terms of, um, which, uh, where is it? We've gotten this shit, but there we go.

Starting point is 00:15:06 There's success rate is the total number of successful requests over the total number of requests because something like, um, counting downtime in terms of minutes and hours doesn't really make sense when you're this global service that, you know, you never really go down totally. Right. So this is a better metric for them. And I just kept thinking like they were, they were, they were tying these services to like dollars earned and dollars cost and, you know, coming up with, um,

Starting point is 00:15:30 a ways to really measure that in order to come up with it. I just keep thinking like, if you work on like internal tools or something, like it's really hard to kind of map those to, to dollars sometimes. Or if you have a business where, um, you know, if your site goes down, you don't necessarily lose any money because your customers are all like, I don't say the subscription based or something. Whereas Amazon.com, if Amazon.com is not available or you can't check out or something, it's very easy to say, well, this is how much we made yesterday at this time. So that's how much we lost. So they did talk about that a little bit in this chapter because they went through exactly what you're talking about. When you have a customer facing type thing or something that is revenue tied, then it's

Starting point is 00:16:10 real easy to measure what you're doing, right? Like X amount of uptime equals X amount of dollars, right? There's that type of thing. But then they said that they do have internal systems that maybe a bunch of different teams rely on, but you can't tie directly to a target revenue or something but they do have SLOs for that as well because it does impact so many other ones so so there are owners of that and then they have to negotiate what their SLOs will be within the organization and they have to measure it right because it does impact everybody

Starting point is 00:16:43 it's hard to come up with numbers if we say like well it's all internal customers and so if we go down we just annoy people say well okay let's just take it down all the time if the cost is zero right if the benefit is zero dollars why are you even have this at all and so obviously that's not the number but it's hard to find a number so maybe you take people's salaries and kind of do some math like how much time is it safe you know so i's, you know, what the right answer is. It's just tough. I mean, I don't recall that to the level of what you're getting at. I don't, I don't think I recall seeing that so far in what I've read, but they do call

Starting point is 00:17:17 out that like, that's where, you know, the business owner for whatever that thing is, you know, part of that person's responsibility would be to know what that value is, whether, and they don't, they didn't, um, distinguish from an external versus an internal type, uh, service. Yeah. Because even internal service like has some like external value. Sorry to interrupt you, Alan to interrupt you but like if let's say let's say for example you were running the help desk for for google employees you know just an internal help desk for google right where they can like hey there's something wrong with my um machine i need you know service or whatever blah blah blah you know uh that that developer you know

Starting point is 00:18:03 whoever is whoever has that downtime because of that need, that's costing the company money if they're not productive. So just because it's internal doesn't mean it doesn't have a value. Yeah, it's just harder to calculate. Let's say you're working on an HR system. An HR system goes down, and because of that, people can't change their tax withholdings, marital status, um, you know, new employees, hiring stuff, stuff goes down, but the business as a whole keeps going. Customers don't know at all. So, you know,

Starting point is 00:18:34 presumably you need less nines on an HR system than you do for like a 24 seven storefront or something. Um, but trying to come up with a dollar amount for that and trying to figure out how many nines you should have, like how long you can go without it becoming a big problem. It's just kind of hard to, to come up with a number. I think. Yeah, I would agree.

Starting point is 00:18:52 Internal money always is harder to measure, but there's, there's gotta be some way to reflect the business value, right? It's almost like tell, it's like, Hey, how many nines can you give me for the HR system?

Starting point is 00:19:05 It's like, well, how much money do you have to spend? How much time do you have for me to spend on this? That's how many nines you get. Right. So, I mean, going to all of this, this actually hits on the next point, which is what you have to do when you're trying to measure the service risk is you have to identify an objective metric to use. Because only by identifying that metric, can you start to measure and optimize for that thing? Right? So like you guys, I know this has always driven me crazy. Like somebody will come to you and be like, Hey, the system's broken. And it's like,

Starting point is 00:19:36 well, what do you mean? Like, what do you mean the system's broken? What's broken? Can you not log in? Can you not click on a list? Can you like, please explain because there's a vast difference between, between not knowing exactly what you're looking for and some vague thing. Right. You know what you just reminded me of? We've talked about this before, but you remember that website, the website is down.

Starting point is 00:19:58 Yeah. I don't remember that one. Really? Come on, Alan, man, it's been a long time, but that's what it reminds me of though

Starting point is 00:20:05 it was like when the guy calls in and the website's down and he has the guy the sysadmin reboot it but it turns out like there wasn't anything wrong with the website it was you know his computer was the issue that's awesome yeah i mean that's exactly what it is. Maybe it's wrong. It's working for me. I keep typing in the wrong password and I can't log in. Right? Yeah. So by having this metric, this one metric that you're going to focus on, you can measure the improvements and any degradations that happen over time, right? That's important because just because you're measuring doesn't mean things are all going to be glorious, right? Yeah, that's a good point too. The HR system, we've said it's hard to tie back to a system, to a real cost. Well, just start with anything. And then if you

Starting point is 00:20:55 find out it's inaccurate or a problem, it's not worth spending days or weeks and coming up with a cost center there. And it's also not worth not doing it because you can't get perfect. You can't say, like, let's just not worry about the HR system because it's too hard to come up with a number. Just start somewhere and adapt. Yeah, so to that point, it may not be a dollar amount, right? It might just be, hey, what is the uptime of the page where somebody can change their surname, right?

Starting point is 00:21:22 Like that might be the metric that they use to report on. And then that way they can find out if they need to do anything to fix things later. So one of the things that is really interesting that they called out here is Google focuses on unplanned downtime. So it's really important to know the difference, right? Like if you've, if you've worked for a software company for long enough, there's going to be planned maintenance windows, right? Where, Hey, we need to upgrade the OS on these servers. We plan on having them down between one and 2 AM on these days, right? That's planned. That's okay. You did it because you knew you're going to do it. What's bad is when the system goes down in the middle of the day, because just something went sideways and you don't know what it is. So that's unplanned downtime, right? And Jay-Z already hit on the fact that Google doesn't use time as a thing because

Starting point is 00:22:16 you mentioned it briefly. They're distributed, right? They have servers all around the world. So while Google search might be up here in Atlanta, it may be down there in Florida where Jay-Z is. So it's hard to measure uptime when you have sporadic services all over the world. So instead of focusing on that, they do what he said earlier, which is the number of requests. You've read the article, your nines are not my nines. No. Rachel by the Bay. Oh man. She had wonderful blog. It's amazing to me. Like how often this blog comes up on the hacker news.

Starting point is 00:22:56 Like this is the article from 2019. We're just basically talking about just because the cloud service has a green check Mark. Does it mean that your business is operating well because, you know, different companies count things different ways. It doesn't mean that, you know, they're what they're calling functional. Does it mean that, you know, it's something's not working for you because it's typically reflected in that number usually reflects like the service as a whole.

Starting point is 00:23:17 So just because S3 isn't down, doesn't mean that, you know, the rack that your stuff is on isn't down. Right. That's a really good point. Hey, did I, I'm assuming you guys have looked at the notes here. Um, if you haven't, do either one of you know what 99.99% uptime is, how much downtime you're allowed to have in a year without looking? Uh, well, I mean, I have it like right in front of my face. So you looked, all right. It was a shocking number to me. I never really thought about it it's 52.56 minutes a year so two nines of reliability means you're allowed to be down less than an hour in an entire year that's that's that's really good uptime man yeah

Starting point is 00:24:01 i i was curious i went back and looked for that that video that i just the website is down video we talked about it in uh episode 122 designing data intensive applications coincidentally the sub topic was maintainability oh nice hey in fairness that was about two years ago how am i supposed to remember that? It was December of 19. Yeah, you know. Yeah, it's been a minute. Remember how good life was back then? December of 19?

Starting point is 00:24:33 We didn't know anything about the pandemic or anything. We would go out all the time to restaurants. No way we thought about it. Life was grand. I was heading to London in two months. You wanted to travel? You're like, let's get on a plane. Let's it was a big deal man i miss it you would stand in line like right up against people you know while you're like waiting to get on the roller coaster yeah

Starting point is 00:24:54 you're bumping into each other in line for the roller coaster and everything and that's great it's great yeah i love having people just breathing on you. It was just such a nice feeling. Yeah. Anywhere, you know, drugstore or whatever. We didn't know how good we had it. Well,

Starting point is 00:25:10 you know, you've noticed that Joe Zach no longer invites people to kick his shins. He doesn't like people to touch him anymore. Yeah, that's right. Yeah. It's done. Oh,

Starting point is 00:25:17 I'm going to, Hey, you know, I just got an elastic meetup. Yeah. I just, I have some, I'm owed some kickers. Wait, wait.

Starting point is 00:25:29 You said something about Orlando? What's going on? Orlando has an Elastic meetup now. Oh, really? Are you actually meeting? Maybe I'll be there. I don't know. There's nothing scheduled yet.

Starting point is 00:25:37 It just started up. I'm excited about it. It's fictional. It's a fictional meetup is what you're saying. We'll see. We'll see. I guess one of the dev rels just moved here. Okay.

Starting point is 00:25:47 So it may just end up being virtual or who knows how it's going to go. It's cool. I joined. All right. Well, let us know. So back to availability. So less than 52, let's just call it 52 and a half minutes if you're going to measure it based on time. So like if that's your if that's your four nines of availability is based on time, you can afford to be down roughly 52 and a half minutes before it becomes a problem and you're no longer meeting your objective for an entire year. Yeah, that's that's a crazy number to me.

Starting point is 00:26:22 So so 14 minutes a quarter. Right. Another way to look at that. It's again, insane. Well, so Jay-Z hit on it. Google does it by request, right? So if they have a service, um, like, like an S3 service or which they don't, Google doesn't, um, uh, like GCS, right? Like if you hit their, their GCS API, they may be aiming for 99.99% reliability on that. And here's where things get interesting when you're doing it on a rate level. If they have 2.5 million requests a day, let's say for, for GCS, then 250 of those can fail.

Starting point is 00:27:03 When you look at it like that, that's a little bit more palatable, right? That's, that's sort of easier to swallow. And for them, it makes a lot more sense because again, they have this service hosted across the world in multiple data centers and all

Starting point is 00:27:17 that kind of stuff. So it's easier for them to, to measure that type of, of reliability versus the uptime thing. That's still not a lot of failures though though it's not it's really not that's only we're only talking about four nines of availability so far so you know yeah yeah i mean getting it i actually three nines i don't even know that i've seen anybody say that they do three nines on most things? Oh no, there's a, there's an Amazon storage that has this beyond that, isn't there?

Starting point is 00:27:48 Or is it like one of the, maybe it's like a, a backblaze or somebody like that, that, or no wasabi. I think it's wasabi. They have a, oh gosh,

Starting point is 00:28:03 what's their uptime? I'm pretty sure that one of, there was like one that's like six nines or something crazy. Sound effects are dinging. And well, oh, here it is. Durability is what they go after. Eleven nines. That's crazy. Durability.

Starting point is 00:28:31 Now, that's durability, not necessarily reliability. So, you know, you're talking about storage. So you're just saying like, hey, we guarantee your stuff's going to be on that desk. But yeah, I mean, depending on what your metric is, it's, it's really insane to think about. Um, actually I was going to see what three nines of reliability would be.

Starting point is 00:28:54 No, not, no, not, not two nights. Yeah. Three nines of reliability be for 2.5 million requests. It looks like that you can have 25 failures that's insane to me 25 if you're going for three

Starting point is 00:29:10 nines of reliability so i mean again that's just that's my god to get to that point the amount of engineering effort and and redundancy and failover and everything else you have to put into play for that is hyper expensive you know what that fourth nine would be, right? I mean, if anybody's falling along the two, it's two, right? Point five. You're just moving the decimal point. Yeah, you're just moving it one more. Yeah, man, it's it's insane.

Starting point is 00:29:39 So this is this is what's interesting. So we're talking about this right here in these crazy numbers. But the reality is not all services should be judged the same. Um, and they, they give a perfect example. So in their thing, I don't want to give away too much of the book because we want people to read it.

Starting point is 00:29:57 Um, but there's a big difference. If you're not already reading this book, you need to go to SRE.google and you can get the book uh for free uh for i think it was like forward slash books right i don't know yeah yeah i think it was something like that sre.google uh and then you have such books yeah so although that's not all they have up there. They have, uh, Google has other books.

Starting point is 00:30:26 This is just our SRE books. Yeah. So one of the things, yeah, I guess we could give it away and we could talk about every sentence in it, but we were not going to speak it up. We are giving away a physical copy of the book. If you,

Starting point is 00:30:39 uh, drop a comment here. Yeah. On what is it? Cutting blocks.net slash episode one 82. That's right all right so what i was saying is like there's a big difference between having a sign-up form for a new customer right like you probably want that thing to work you don't want to push away new customers

Starting point is 00:30:58 however like if we take gmail for instance g Gmail all the time behind the scenes is checking for new messages, right? Like if you've ever been sitting in your Gmail inbox, you'll see it pop up a new message at the top. It's because it's, you know, polling for info. If some of those back-end service calls fail, that's not quite as big of a deal as that new user that's trying to sign up for the site. Right. Or for some subservice. Really?

Starting point is 00:31:31 Did that throw you off? He's got a pick stuck on the middle of his forehead. I didn't mean a guitar pick. Yeah. I didn't mean to like completely derail the conversation. No, Jay-Z. He's got on his saint patty's day version

Starting point is 00:31:46 um so yeah all right see i actually look at you guys while we're talking this it's funny like how quickly that derailed it i totally didn't intend for it the point being like this conversation the uptime of this conversation was really important and uh i totally like blew out our slo on that you know so yeah jay-z's got an alien pick now that's amazing um all right so the other thing that they say is um google they check this success rate for non-customer facing systems as well this is this goes back to what jay-z was saying earlier about hey what about systems that are that are only internal right that don't matter to the customer um google sets quarterly availability targets and may track these weekly or daily i'm so sorry we are jerks yeah you guys are you're just sticking sticking stuff to our heads now at

Starting point is 00:32:50 this point this is what happens when it gets super late at night and you've already had long days right uh so so at any rate by tracking these things daily and weekly that means you could address these issues quicker, right? Like you don't have to wait a whole quarter to be like, yeah, things kind of went south this time. We should focus on it next go around, right? But there's an implicit requirement there that requires that you have strong observability. Yeah, there's got to be some metrics. And that's always the problem

Starting point is 00:33:25 with it right like you know the the problems you're looking for that you're watching for that they tend to not happen because you're watching for them and you take care of them right and so like you fix that it's always those problems that you're like i didn't even know that was a problem we weren't looking for that that like that's the thing that always is the one that bites you right i mean you know what's funny is when we covered those um devops books that like, that's the thing that always is the one that bites you. Right. I mean, you know, what's funny is when we covered those, um, DevOps books that we did back in the day that tie really nicely into this, that's probably the one takeaway that I've truly gotten on board with is having metrics for things. Measurability within your system is so key to actually knowing what the heck's going on that I want to put metrics on everything,

Starting point is 00:34:09 which isn't quite possible, but, but it really does help you overall when you're developing software. It comes in handy, really a lot of debugging too, just like times you're like, well, let me go see,

Starting point is 00:34:19 maybe I have something or you don't even, you don't even have a question for us. You just go look at a dashboard. You're like, huh, that's weird. This number is way higher than it was yesterday. It just goes on from there.

Starting point is 00:34:29 It's great. I mean, it's funny. Just an example, like one of the things that I got shot at with this week was, hey, something failed over here. And because we didn't have metrics around certain things i had to go digging through logs right and i had to start aligning time stamps with when things happen in the system over here and over here and if we'd have metrics to say hey what was the latency between um when this thing hit this particular service and when it hit this service and i could have looked at it and said oh this thing went backwards in time that's not even. There's a problem in the pipeline here, right?

Starting point is 00:35:06 And it's that kind of stuff that once you start getting it in place, you really miss it when it's not there. So, all right. So the next thing, risk tolerance services. Somebody else want to take this? I'm tired of talking. My throat hurts. The idea here is, you know, we talked about that. SRE should work directly with the business to define goals that can be engineered.

Starting point is 00:35:30 And sometimes it can be difficult because measuring consumer services is clearly definable. So this is kind of the example we talked about. And the idea of the SRE working with the business is really cool, as we kind of mentioned. And I was kind of thinking I'm still obsessed with the HR example. So I kind of thought if I go to the HR director and said, hey, would it bother you if the system was down one hour a day? You just take a long lunch or whatever. And, you know, in my head, that sounded like a reasonable thing, you know, like seven eighths hours, seven eighths percent. And but then the HR director might say, well, are you kidding me?

Starting point is 00:36:01 Here's another way to think about it. Imagine if one percent of our employees couldn't do something they need to and sent me an email. There's no way I could deal with that traffic. It needs to be far less. And that's a much more reasonable number, I think. And I think that's the kind of conversation that might happen between two people. And that's the kind of conversations I think you'd be doing. Yeah. I mean, there was one extra part though that you skipped over, which was that it can be difficult to do the measuring because consumer services can be clearly definable, but infrastructure services may not have a direct owner. So going back to your HR example, those internal ones can sometimes be a little bit more difficult to deal with. Yep, absolutely.

Starting point is 00:36:48 Also, just identifying the risk tolerance of consumer services. Sometimes a service will have its own dedicated team, which is really nice. And I mean, this is basically what you just said. Sometimes there is no owning team. And I was thinking like Jenkins might be an example here. Like if a build isn't working, is it, you know, who's responsible for it? Maybe it's a problem with the stage of the build. Maybe it's, you know, a problem with one of the services that's, you know, you're pulling from an artifact or something.

Starting point is 00:37:16 Maybe it's a problem provisioning a build agent. But someone has to take the initiative to go look and start. And, you know, I would have mentioned it might be three different teams there. Like, who starts with it? Yeah, I mean, here's an example. Imagine you live in a world where multiple teams are contributing to an ultimate site. And let's pretend this was in, like, a Kubernetes or some kind of, like, you know, shared space like that, right? And maybe one of the services that is being deployed in this environment is a redis right now all the teams

Starting point is 00:37:47 are using this thing right who's responsible for it when it goes down right and like you know uh an even better example that came up if we're if we speak strictly to kubernetes what happens when like sed gets full right yeah like everybody in the cluster is relying on that thing, right? And now nobody can write like secrets or config maps or stuff, you know, because like, you know, there's something like more overarching. So that's the point is that sometimes like deciding who is that owner can be a little bit more difficult. But what they say is a lot of times, if there is no clear-cut owner, the engineers will end up taking ownership of it and then defining the reliability requirements themselves. Because, I mean, think about it, right? I know Jay-Z, I know Outlaw, we're all

Starting point is 00:38:38 sort of like this. If you encounter the same problem enough times, like I'm not manually handling it the third time, right? I'm, I'm going and finding a way to write something to take care of it. So I don't have to deal with it anymore. And I think that's ultimately what's, what ends up happening here, right? Like if, if developers look at it and say, man, it's costing me two days a week to keep Redis going, we need to figure out how to fix that. Then they're going to start figuring out how can we make it to where we're not having to even touch this thing. Right. And so they start defining what they want the reliability of Redis to be so that they never have to go look at it again. This also goes back to like the strong culture, though, within Google, where like, you know,

Starting point is 00:39:20 the people are kind of empowered to do that and to make those decisions. So, like, you know, you have to have buy-in from the management, which was something that we had mentioned from the, the previous episode, uh, as it, as it relates to this topic is, you know,

Starting point is 00:39:33 this can't just happen because you alone as a developer on your team want it to happen, you know? And so because they have that kind of, uh, culture where this is, you know, a thing, then they can do that. Yep.

Starting point is 00:39:50 And so some vectors in assessing the risk tolerance, level of availability is needed. We've talked about that. Do different failures have different effects on the service? Redis is a good example because maybe it just slows things down or maybe it leads to a database failure. So you got to figure out what's going to happen there. If you use the service cost to help identify where on the risk continuum it belongs. Okay, so hold on. We need to pause here because this was in the book and I didn't really go into the notes and describe this very well. They call this risk continuum. It's sort of like this line, like where does this thing fall?

Starting point is 00:40:30 Right. And, and you're trying to align it with, with the objectives versus the reliability stuff. Right. And so this risk continuum, they're basically saying, Hey, if you can align this thing on the line using cost, then you could sort of help figure out where it should belong. Javier, I think we talked a little bit about DREAD scores in terms of security. Like many years ago, DREAD was an acronym where it's like discoverability, repeatability. It just basically had to do with kind of classifying, coming up with a single number for vulnerability. So you could say, well, you know, it's a five on discoverability

Starting point is 00:41:05 because it's probably not going to, you know know someone have to have a lot of knowledge but the reproducibility once you know about it is really high it's really easy to reproduce so that would factor in in the end you kind of average those things up and get a single score so that's what i kind of imagine with there being a continuum here well i mean it was kind of a weird one though because are they literally saying like hey it cost us let's say a million dollars to build this service so on the continuum like if it's down if it's completely down then that's a million dollars that we lost to build that service right like that's what the service cost us so So what I was imagining is, you know, the HR example I keep coming back to

Starting point is 00:41:47 is saying that that's hard to come up with some good metrics and good numbers around. But one way to look at it is to say, well, how much does HR system cost us? It costs us $3,000 a day. So if it's down for an hour, it costs us this much money. So it's kind of a way of saying the service,

Starting point is 00:42:04 this is how much money we just kind of burned on having stuff running that wasn't helping anybody. Whereas compare that to like Google search, right? Where it's like, well, that costs us a million dollars a day to run. And so an hour there is much more valuable. So it costs more. So it's just kind of saying like a way of saying like, this service costs this much money to run. So if it goes down, it's that much more important. Just because of how much it costs us. It's just one factor.

Starting point is 00:42:33 It's not the only one by far. But if you're having a hard time coming up with things, you know, it's something to consider. Yeah. So in fairness, this one is identifying the risk tolerance of consumer services. So I think it might be the cost to the consumer. So, you know, if it's a super high expensive thing, like, I don't know, big table, then maybe you want to make sure that it's more reliable depending on what it is. So I think that's where they were going with this one. It makes sense.

Starting point is 00:42:59 So if, you know, if I'm paying for Kubernetes and a whole bunch of nodes and it's down and I'm, I just paid you a bunch of money for this thing, you know, then that stinks. Whereas a Netflix, Netflix goes down, you know, for an hour, like,

Starting point is 00:43:12 well, it's like, okay, I paid $8 a month for it or whatever, you know, so it's kind of, okay. So said another way,

Starting point is 00:43:17 like, you know, you can create a Gmail account for free and, and check your Gmail. Yeah. So if it's down, eh, not such a big deal, right?

Starting point is 00:43:25 But if I'm like a big customer of like specifically, let's say Google, right, Google Cloud, and now Google Cloud goes down, then that's a much bigger service cost. And so on the risk continuum, they are on an opposite extreme from the free gmail account totally yes i think they use the example of like google apps business apps they do later on yep so so yeah next up we got the target level availability so this one's kind of interesting

Starting point is 00:44:00 you know what do the users expect? Is the service directly linked to any revenue? So this would be, you know, are you paying for a Gmail account or something, right? Like if you're a business customer, is it a free or a paid service? Is there a competing service and what is their level of service, right? So, so somebody like Google may take a look at their GCS product and say, Hey, what, what is Azure blob storage? What is their reliability SLO or what is AWS is S3 SLO, right? Like that's probably something that they all shop around to make sure that they're being competitive there. What's the target market? Is it consumers, is it enterprises, you know, whatever. And then this is where Jay-Z, if you

Starting point is 00:44:43 want to jump into this one, you just kind of brought it up a second ago, the, the apps. Yeah. And so kind of an idea that, um, if I'm paying for something, I expect a better service.

Starting point is 00:44:52 And there's also even, uh, amongst the, the Google docs. Like if you think about like maybe a PowerPoint presentation might, or, you know, uh,

Starting point is 00:44:59 whatever they call it, um, slide slide, it might be more valuable because the chances of it going down while someone's actually about to do a presentation you know the the severity might be worse uh compared to email where you just refresh and remember google measures stuff in terms of um percentage of failure request failure so we're not really talking about total outages we're talking about number of requests failing but can be really scary if you're camping your presentation up

Starting point is 00:45:22 right before a big presentation whereas email email, you know, so what? And so that's the kind of thing. Same with Maps. Google Maps is a free service. But if I can't get directions when I need it, if I'm running late and I can't get the service I've been relying on, that's going to sting me more as a customer than maybe something else like email. It was interesting, though, when they talked about their apps, because you know, when companies buy into the like G suite of products, their company business is sort of running on Google's infrastructure at that point. Right.

Starting point is 00:45:55 And so they prioritize that kind of stuff super high because they know that if you're, um, you know, like you said, your slide decks are down or if your internal email or internal calendars are down, like that can actually cost a business a lot of time and money. Yeah. And sometimes it's more than just that time to like, you know, you're doing a presentation on the board of directors and, you know, you can't bring up your stuff. Like that's a big problem. You may not get the opportunity again. Right. I mean, they got to be pretty good at it right because would you care to take a guess at how many times google has been down google the i remember website the google google services let's say google services how many times do you think Google services have gone down? What does Google services mean?

Starting point is 00:46:46 Like Gmail or YouTube or Drive, whatever. Any of the Google products. Like down, down, like can't reach it down? Down, down. I'm going to say single digits. It's probably ridiculous. Are we talking all time? Yeah.

Starting point is 00:47:08 All time. Give me a number, man 20 20 okay that do you realize like google was formed in the late 90s right 90s 20 is an incredibly low number right agree jay-z went even further extreme to cut that in half and say somewhere in single digits. He said nine. Are you ready for this number? I don't even, I'm not going to believe it. It's four. Wow. Four.

Starting point is 00:47:34 That's incredible. There was an outage in October of 2018 that took out YouTube for a period of time. And then the other three were all in 2020. Oh, wow. There was a six hour one that took out Gmail, Google Drive, Google Docs, Google Meet and Google Voice. Then that was August of 2020. In November of 2020, there was an outage that took out YouTube and Google TV.

Starting point is 00:48:03 And then in December of 2020, there was an outage that took out YouTube and Google TV. And then in December of 2020, there was an outage that took out Gmail, YouTube, Google drive, Google docs, Google calendar, and Google play. And that one,

Starting point is 00:48:12 uh, how I forget how long that one was. That one was for like 40 minutes or so. Wow. Four outages. They blew those budgets for like years, right? That's impressive.

Starting point is 00:48:23 There's a Wikipedia page that lists all four of the outages. It is only four. Yeah, that's impressive. When you are is older than that. But in fairness, in fairness, they got a little bit of compute power, right? But the point is, is that like what we're talking about here with the SRE, right? It is no joke. I mean, like they obviously know what they're doing.

Starting point is 00:48:51 Not only did they write the book, but I mean, they have the. They practice it. They have the proof is in the pudding, right? Like you can see here, they only had the four outages. Hey, so there was actually a cool thing in here that they're talking about. Like you mentioned, YouTube went down and all that. When Google first purchased YouTube, they didn't care as much about the reliability side of things. So if you went to hit a video on occasion and it was down or whatever, they were okay with that because they were way more focused on adding more features

Starting point is 00:49:24 to the platform. I mean, you've got to imagine, they wanted to get some ads on there real quick because they knew that was going to be a moneymaker for them. So they were willing to take the hit on the reliability side. So it's interesting that even within their own company, they look at things and they say, hey, yeah, this is okay. And we're going to iterate on this as fast as we can, and then we'll come back to it later. Yeah. I just looked up to see the time that I remembered it was in 2013. It was down for five minutes and they say that global internet traffic went

Starting point is 00:49:56 down 40%. Wow. Wow. You know, it was like for Google search going down. That's awesome. I remember when Amazon went down. When we worked at Amazon, I remember hitting the page and going, wait, is my internet down? No?

Starting point is 00:50:14 NBA.com's loading. Remember that S3? There's an S3 outage when someone does some misconfigured router or something around the holidays. Yep. Woo! Fun times. Bad things. some some misconfigured router or something around the holidays yo fun times bad things so uh what like when we talk about failures what kind of failures are we talking about uh nestle section here on the shapes of errors and they talk about uh like what's worse a constant trickle of errors throughout the day or a full site outage that happens for a short amount of

Starting point is 00:50:42 time answer of course is that it depends. Some services, you just go to lunch and try back later. And there's other times when it's really important that people keep trying. Like the map example, if you're trying to figure out how to meet somebody in 30 minutes and it takes about 30 minutes to get there and you can't pull up the restaurant, that's something that you want. A trickle would probably be more desirable. You don't want that to be gone for two hours. You'd rather have it just try a few times and have it eventually work

Starting point is 00:51:07 yeah but they did share one that was like really bad that you would probably rather have an outage if if there was a potential bug in some service that you had out there that could allow allow people to get private information that they shouldn't have access to then they were like you know what it'd be worth having the unplanned outage taking the thing down so that nobody could get that private data right right yeah yeah private data was a good example of something right yeah fine fine to be down um and they'd always prioritize uh security over uh i think it's like security number one reliability reliability number two.

Starting point is 00:51:45 Yeah. Oh, there was another interesting one that they brought up, too, is their ads. So they said that typically their ads were accessed during working hours, right? Which is not surprising. If you have marketers or whatever, they're hitting that stuff between eight and five, nine and six whatever and so they were okay with taking the service down in later times at night when they knew that there weren't going to be that many people on or impacted by that that time so they even take a look at what they're doing and say hey you know we don't necessarily have to have 100 uptime you know depending on the service and depending on the usage patterns, we're totally fine with setting up planned timed outages, right?

Starting point is 00:52:29 Yep. And as far as cost goes, it ranks very highly on the deciding factors for how much money you make, how much money you cost. So just a couple of questions you wanted to ask to help determine the

Starting point is 00:52:43 cost-reliability kind of ratio is if you built in one more nine of reliable reliability, how much revenue go up? Essentially, this is this is a recurring theme in this book. It keeps coming up. It could be very expensive. This is where the additional that business owner knowing the value of whatever that business is, you know, for the company comes into play. Yep. And if that additional revenue offsets the actual cost of the reliability. And they have an example of basically if, you know, getting that extra nine costs you $900 a year, but it brought in $1,000 a year.

Starting point is 00:53:19 Yeah. You know, you may like, you know, in that case, it's a good example. It's a hundred bucks. So, yeah, sure. Go ahead. you know you may like you know in that case it's a good example it's 100 bucks so yeah yeah sure go ahead but uh you know you can imagine or if it could definitely be more expensive to get that um then that extra nine than it's ever going to be worth i mean if you didn't well i mean it depends how long is it going to take me to get that nine right now we're talking about my time,

Starting point is 00:53:45 not the company time. Right. Yeah. I think I just got, I would have gotten flied by that interview question right there. Did not get the job.

Starting point is 00:53:55 So other service metrics. So knowing which metrics are important, which ones aren't, help you make better decisions, of course. They mentioned, you know,

Starting point is 00:54:02 we mentioned the example of the AdSense and search. uh searches primary metric is speed to results we want the lowest latency possible and of course uh you know we want the best research uh results up top the adsense's primary metric is uh was making sure that it didn't slow down the page load so uh this is kind of example where these are um you know they work in tandem they work together to make for a good user user experience but ultimately we don't care as much about ad search sorry ad since being late or being slow as long as it doesn't affect uh the primary search And so because we have a looser goal, basically on AdSense, it's okay to basically pop those in later and just go ahead and show

Starting point is 00:54:50 the search results first. No one has ever visited a webpage. And as they're like trying to read whatever the blog or, you know, the site is or whatever thought, you know what, I'm going to, I need to leave the site right now because all of the ads targeting uh you know the last guitar i looked at are not showing up in the right rail and you know i'm fed up with it if i can't see it again that's right right like that's never happened but also people love when this the ads actually pop in after you started reading the page they bump the text it moves your text all over. Oh man, drives me crazy. Hey,

Starting point is 00:55:27 but you know, you know what was really cool about this? What they, what they explained behind the scenes on this particular one is with the ad sense, because they don't care about it loading later into the page. What they said is that reduces the cost of their infrastructure because they don't need to have that stuff running in as

Starting point is 00:55:46 many data centers and regions and whatever else, because they're not trying to get like sub second, like craziness, like they are on their search results. Right? So their search is got so much compute power and, and is redundant all over the globe where the AdSense stuff, they're like, ah, it doesn't matter if it loads from, from that's two states away. It's not going to kill us. And so they can save on costs because they know the metrics that they're aiming for there. And so they're not going for the fastest numbers. So it's really interesting that they use those metrics to drive decisions on how they configure their internal infrastructure and all that. Yeah. It's funny. If you think about it, if the ad doesn't show, it's a better customer experience, right? It's a better user experience.

Starting point is 00:56:29 And the person, you know, presumably who bought the ad didn't pay for it because it wasn't shown and wasn't clicked. So it's like, hey, it's a win-win. So just don't show ads, Google. You fixed it. Yeah, I'm fixed. There you go. Now, million dollars a day step for profit that's right yeah oh man so uh last six last little thing here note is uh just the different requirements with customer services typically um because they often have uh yeah sorry different

Starting point is 00:57:03 wait help me out here what different requirements than consumer services typically because they are serving multiple clients. So infrastructure services are different than consumer services. Okay. Because they're serving multiple clients. Okay. The header was part of it. The header was part of it. Yes.

Starting point is 00:57:22 Yes. I didn't say that right either. So I didn't help you there at all we cleared that up we're good sorry it's like as we know sometimes i have a hard time talking about things like um you know like um sheila boof i'm pretty sure that's not how you pronounce his name i'm the one who's the you know, what was my new job title? A sayer of names or something. What is it?

Starting point is 00:57:49 What is this book? Uh, she, she, she, she, she, she,

Starting point is 00:57:57 she, she, she, she, she, she, she, she,

Starting point is 00:58:04 she, she, she, she, she, she, she, she, she, she, she is difficult. It's like, what's the difference between a poorly dressed man on a tricycle and a well dressed man on a bicycle? A wheel.

Starting point is 00:58:12 Oh, Oh no. A tire. A tire. That's good. Very nice. That's good. Told you.

Starting point is 00:58:24 Yes. One last one real quick. When does a joke become a dad joke? When Outlaw tells it? I don't know. I like that. I'll take that. All right, we'll just end it there.

Starting point is 00:58:36 No. That's when his punchline becomes apparent. Oh. That's good. Thank you, Gregory. This episode is sponsored by Retool. Building internal tools from scratch is slow. It takes a lot of engineering time and resources.

Starting point is 00:58:55 So most companies just resigned to prioritizing a select few and settling for inefficient hacks and workarounds for every other internal business process. Hey, but Retool helps developers build internal tools faster so they can focus on development time on the core product. Retool offers a complete UI component library. So building forms, tables and workflows is as easy as drag and drop. And that's no joke. We saw that in the, you know, the UI, we were given a demo of it. And it is silly simple how easy you can construct these pages. And I like easy, but more importantly, Retool connects to basically any data source, database or API. They offer app environments, permissions and single sign on out of the box. And they offer an escape hatch to use custom JavaScript whenever you need it.

Starting point is 00:59:46 With Retool, you can build user dashboards, database GUIs, CRUD apps, and any other software to speed up and simplify your work without Googling for component libraries, debugging dependencies, or rewriting boilerplate code. Thousands of teams at companies like Amazon, DoorDash, Peloton, Brex collaborate around custom-built retool apps to solve internal workflows. And when Jay-Z was talking about all the integrations, whether it be a database or an API, he's not joking there.

Starting point is 01:00:18 You can go to retool.com slash integrations to see everything they have there. If you want to hook up to GitHub, for example, or maybe you just have some GraphQL query that you want to connect to our friends at Datadog. They've got integrations with them, CircleCI, PostgreSQL, my favorite database. Or, you know, maybe you want a SQL server or, you know, we talked about Redis. Whatever your integration is, they've got plenty of uh integrations there to help you out learn more visit retool.com oh all right here we go okay so so seeing as i naturally have the late night dj voice right now because of pollen in the atlanta area ah So good. Pollen. Man. God. So I am going to do the bag.

Starting point is 01:01:14 So if you have not had a chance to leave us a review and you would really like to give back to us and put a smile on our faces, we have a nice little link set up at coding blocks dot net slash review where you can go and we have links to leave a review on either iTunes or audible or I don't know what else we have on there, and we have links to leave a review on either iTunes or audible, or I don't know what else we have on there, but we have stuff. So again, we really do appreciate it. We, we super love reading that when, when people leave nice little messages for us. So if you wouldn't mind, please do that. And now you can even see reviews in Spotify, which outlaw has graceful grace graciously put a link up there for all that gratefully gracefully and graciously did he do this um the sayer of names so yes i really think

Starting point is 01:01:55 that like alan should be talking to us right now about how he fell into a ring of fire like i really want to hear him say it right now uh i i don't have the johnny cash isn't that johnny cash yeah come on i just just one time alan the ring of fire there you go oh my god it was better than i thought it would be that's not johnny because that's social distortion gosh well played sir all right well uh okay so let's get into my favorite portion of the show survey says all right so a few episodes back we asked hey for this year's game jam you are and your choices were super prepared been practicing all year i'm ready yeah i'll figure something out or oh my god i have no idea what i'm doing so i see you there i see you there uh johnny underwood sticking something on his forehead trying to get back at uh joe and i to distract us as we now try to talk

Starting point is 01:03:05 but i am a professional and i'm going to stay on topic here so right um this was this is episode 182 so it's even number to tuck his trademark rules of engagement jay-z you are up first all right i'm gonna say first. I'm going to say I'll figure something out. 10%. That's cheap. Is that enough? He shot the moon. Is that

Starting point is 01:03:37 chicken strikes again? He's counting on his fingers and his toes. You can't see it off screen. You can't see it. This is, oh my god, I have no idea what I'm doing. He's counting on his fingers and his toes. You can't see it. All right. So this is, oh, my God, I have no idea what I'm doing. We'll go 11%. I mean, we're going crazy high here. And, of course, it is, oh, my God, I have no idea what I'm doing.

Starting point is 01:03:57 61% of the vote. I have no idea what I'm doing. That's awesome. That's good. And yet great stuff came out of it. So that's awesome. and yet yet great stuff came out of it so that's awesome absolutely really good stuff came out of it yeah um all right so for this episode's survey i thought you know we're last episode we kind of tied in devops to the topic and and even in this conversation so far we've kind of tied in dev into the conversation. There's a lot of overlap there I think we could all agree to, right?

Starting point is 01:04:27 So, you know, in general, we've spent several years now talking about DevOps, gone through several books and whatnot. So just as a general rule of thumb, like, how do we feel about DevOps? And your choices are, love it, it's the greatest. Or, it's great when things work or it's okay it's overrated or i wish we had a good devops pipeline or it's a dream nobody does that should we should we leave the witnesses on this? no no no

Starting point is 01:05:06 I know the answer it'll be 10% I was going to say 10% of the vote I think he learned his lesson he's going to go with like 13 or 14% next time maybe you never know also don't forget we're giving away a book so drop a comment

Starting point is 01:05:23 here and we'll hit you up yep yep and uh you know those those comments really do help um you know anything that you could do to uh uh well i was thinking more about the comments on like the leaving the review comments but yeah the alan was referring to leave a comment on the episode too we appreciate that too but those comments the the reviews that Alan referred to in his not Johnny Cash pollen voice help because they bring in more listeners and then we can have ads that we get to pay more money for. We have expensive costs, right? We have to wear these headphones

Starting point is 01:06:00 all the time. Now I'm going deaf, so I sent my hearing aids in for repair two weeks ago. That's right. I haven't heard anything since, but you know, I can't even work without headphones on anymore. Like I don't even have to have anything playing.

Starting point is 01:06:16 I don't have to be in a meeting. I just, I'm like, where, where they block out sound. Yeah, for sure. No,

Starting point is 01:06:23 not do we need to go back to the shoppings pre-episode? Because with these beautiful Calis. That's right. That's right. I'm trying to block out the world, man. No. Thank you, Derek, for that joke, by the way. All right. So let's talk about target level of availability. So one approach, I think we kind of just hinted on this, though. One approach of reliability may not be suitable for all the needs, right? So what was the example that we just gave was related to the business apps, the Google Apps versus AdSense.

Starting point is 01:06:57 So they gave an example of Bigable in the book. And it was actually a really cool story because they were saying that like, depending on what the usage is of the application, you could actually even like have different tiers of service that you could charge different levels of reliability for. So they gave this example where they're saying like, Hey, if you want like super high reliability, and I think I hinted on this even in the last episode, if you need to like super high reliability, then, uh, you know, it's going to be hyper expensive, right. Due to the additional compute that you're going to get for that. But if that's what your use case is and you're willing to pay for it, then here's a service level for you. Like here's a, here's a cost that you can pay. But if what you want to do is like a more like offline batch analytical type programming or, uh, you know, processing, then, you know, you might not need that higher level

Starting point is 01:07:56 or liability. And so therefore, uh, you know, you can be in a different tier of service and pay less for it. And, you know, but as a result, you know, you're willing to take that hit of, uh, you know, it might not be quite as fast as the other one, but it's always running.

Starting point is 01:08:16 Right. Yep. Oh, I should mention too, that, actually, uh, you re you mispronounced a big double there.

Starting point is 01:08:23 Big double. Yeah. It's not a capital T. That's always bothered me about Bigtable. And they actually do pronounce it as Bigtable. But it's little t, one word. It really is. So it looks like Bigtable.

Starting point is 01:08:39 That was your tip of the week. A little journey through Jay-Z's mind. Kind of scary, if I'm honest. Me too. So different types of failures. So real-time querying once requests queues to almost always be empty. So can service requests as soon as possible. Offline analytical resourcing.

Starting point is 01:09:04 Oh my gosh. Offline analytical resourcing. Oh, my gosh. Offline analytical processing. It's more about getting right answers and just throughput in general. So we care less about latency and more about just always getting the work done. So it's two different cues but different goals with it. One cue is we want to always be empty and the other is what we don't want to always be full. And you know,

Starting point is 01:09:27 what's weird about that is it's literally the same technology stack, right? Like it's same exact stuff, but what would be successful for one would be viewed as failing on the other. Right. And so it's, it's pretty interesting that you can't even look at the same freaking thing and say,

Starting point is 01:09:44 uh, uh, Hey, what's my success criteria, right? It's not, it's the use case that you have to go after. Yeah, that's funny.

Starting point is 01:09:53 Um, so cost, uh, can you partition the services such that, uh, different clusters can have those different needs. And we kind of talked about that a little bit. Um,

Starting point is 01:10:02 you know, maybe for some big table, uh, customers, they care more about, let me fix this, low latency and high availability. And others may, you know, care more about throughput. And exposing those cost savings, giving the customers the leverage to make the right decisions for their business is fantastic. And it kind of takes some of the decisions away from your SREs, although it does complicate things, right? So you've got to kind of split your – you'll have different levels of objectives for those.

Starting point is 01:10:34 But I don't know that it complicates things too much because what they said is typically it's the same exact stack. It's just configuration levers that they change, right? So does it really complicate things? I guess you have more things you need to test out to make sure they operate properly in those different environments. But technically you're spinning up the same software, just, you know,

Starting point is 01:10:54 changing variables here and there. Yeah. I guess I was expecting to have different SLS based on, on, uh, the kind of the, the service tier, but yeah, I don't know. Maybe you wouldn't. Yeah. It is kind of interesting service tier but yeah i don't know maybe you wouldn't yeah it is kind of

Starting point is 01:11:08 interesting i'm guessing what they were talking about with like always wanting empty queues versus always full queue or queues that have things to to be processed would they set up separate metrics for those different types of environments i'm guessing they would i would assume so like yeah i mean because you know like like you just said with the queue thing for example right like um what was it the where was it the real-time querying you wanted to have always an empty queue so that it can be real time you know versus the offline you you're trying to just do as much as you possibly can so you always want it we're doing something so you you'd want to just do as much as you possibly can. So you always want it. We're doing something. So you,

Starting point is 01:11:45 you'd want to know like, Hey, is the real time queue backing up? Because if so, I need to address that somehow you, you'd have to have different observability for it. I think you'd probably, you might have the same metric, right? Like the queue size. Um, but you might have different alerts set up for, for different sets or something. I don't know. Yeah. It's, it's definitely interesting. Um, but you might have different alerts set up for, for different sets or something. I don't know.

Starting point is 01:12:05 Yeah. It's, it's definitely interesting. Um, yeah. So, oh, this, this was actually my favorite part of this entire chapter here. I don't know about you guys, but this, uh, this motivation for error budgets. So when I read this title, I had no idea what this meant. I still didn't know what it meant even until until I got down to like another couple paragraphs. But basically what they get at here is there's tension between SRE teams and feature development teams, right?

Starting point is 01:12:35 We've talked about this in the past, too, right? Like SRE, they want to keep things going. They want things running. And development teams want to release features. And those two things are at odds because every time you release something, you're introducing risk and potential downtime. But I jumped way too far ahead here. So there's a few things that they need to look at here. So software fault tolerance. How fault tolerant should the software be, right? Like how well does it handle unexpected events?

Starting point is 01:13:05 Testing. If there's too little testing, then it could be a bad user experience, right? And we're talking about unit testing. We're talking about end-to-end testing, all kinds of testing, not just one particular type. If you have too much, you never ship, right? Like if you're trying to make everything absolutely perfect, hit every edge case, you're not going to ship the software. Let's go back to like that DevOps handbook, right? remember like there was this whole pyramid of testing that you might do so like you might have like uh unit testing integration testing uh end-to-end testing

Starting point is 01:13:33 performance testing uh user acceptance testing right like if you were to say well we can't even ship it until we get to that top tier right you you you might not never ship depending on like what your product is you know and you got to be willing to accept some totally and the worst thing is you know we say you'll never ship which i mean chances of that happening are slim to none but but what you could do is you could miss your opportunity right like if the market is positioned in a way to where if you get your software out the door, you can profit on that. If you're trying to wait to get to perfect, then you may miss that opportunity, right? So you could have missed the boat.

Starting point is 01:14:16 Push frequency. Code updates are risky. We've talked about that. Anytime you push to production or wherever, you're introducing potential risk, right? Cause if it's been running fine for the past month and then you change something, there's a chance that it might not run, run fine for the next day. Um,

Starting point is 01:14:33 so should you reduce the number of pushes or should you, um, like work on getting more features out there? Like that's a question you have to ask it to feature that. I still want to live in a world where like we've read stories about you know like i remember facebook had a article out like well over a decade ago where the developers could um you know they weren't done until they saw it in production and they could literally do their own deployment for their thing as part of the effort. Right. So, you know, you get a ticket, you're going to start like, oh yeah, I need to like move the pixel, you know, the, the image three pixels to the left, or I need to make the logo on fire and you could like

Starting point is 01:15:19 do it and then deploy it. And that was all within your capability because they had so much automation in play and so much testing of that in play to prevent you, you know, the kind of active is like gates, automated gates to, you know, protect you. But because of that,

Starting point is 01:15:38 there was like this huge confidence that they could just get it out there as soon as possible. Right. So I definitely want to live in that world where like, you know, don't reduce the pushes to production, get the feature or the bug fix out as soon as, as soon as it's ready.

Starting point is 01:15:53 And there's also something to be said for like smaller deployments too. So totally smaller deployments are safer. We've talked about it in the past, but for sure. This last part that they had here in this section for the motivation for air budgets was canary deployments, right? The duration and size. Now, this is interesting. I hadn't really thought about it in these terms before.

Starting point is 01:16:15 So canary deployments, you're typically trying to see how something will go, but you're usually doing it on a subset of the workload. So you're sizing it down, right? Hey, does this thing operate well in this environment? And the questions that they asked were, how long do you wait on canary testing to see if something does go wrong? And how big do you make the canary? Meaning like what size of the subset of data do you do, right? Like we have talked about this, in different terminology uh regards so um most notably in like the way of feature flagging to where you could feature flag uh you know uh you could use feature flags so that a portion of your traffic gets directed to that

Starting point is 01:17:01 new feature so specifically the example that we did talk about bringing up Facebook again was how they had like the messaging was already out there. The Facebook messenger was already out and deployed in the wild before, you know, you know, everybody realized it and they were able to like slowly add on people and see, you know, how well it was working and get some metrics about it beforehand. And then over time, you could keep increasing that, in this case, canary size until you're ready to make it a feature that everybody can have. And of course, I assume they're referring to canarying as the canary in the coal mine kind of thing where you're just seeing like,

Starting point is 01:17:45 which is kind of a gruesome, uh, you know, way to approach this, you know, like, you know, you put the canary in the coal mine and see if he dies like that.

Starting point is 01:17:54 Right. An awful, it doesn't come back. Yeah. Way to talk about this. So yeah. Um, yeah.

Starting point is 01:18:01 What'd you do? Hey, we talked about tracer rounds in a previous episode. So you're shooting your code, right? Yeah. So, you know, why not fly a bird into something to get her set? Yeah, we did. All right.

Starting point is 01:18:11 So, Hey, there was actually a quote in here and I think it's from somewhere else too. I don't remember, but we talked about it. Yeah. Did we? Oh, I thought it was literally the tie, the title of the last of chapter one, I believe. All right. So we're skipping this. Um,

Starting point is 01:18:29 they got to tell us now that people are going to go back and be like, what are they talking about? That's our motto. Um, hope is not a strategy. Um, all right. So this, this is the part that I thought was really good.

Starting point is 01:18:40 So forming your error budget. Now, what in the world is this? So both teams, the SRE team and your, your feature development team should define a quarterly budget based on the services SLO, right? So whether it's 99% uptime or 99.9 or whatever, um, what's cool about this is this determines how unreliable a service could be within a quarter. And it removes the politics between the SRE and the product development team. So you say, hey, we want our service.

Starting point is 01:19:11 If you're Google in this case, right, the number of requests have to fall within our 99% SLA, right, or SLO, service level. Objective. Objective, yeah. So going back to what we said earlier 99.99 percent on 2.5 million is 250 failed requests right so that's how many you get for the quarter if if there's only 2.5 million requests that are made in a quarter I got 99.99 problems but service request ain't one. Ain't one. That's right. Um,

Starting point is 01:19:45 maybe it is some of the times it depends 52 and a half minutes out of the year. It is, but God, that's still crazy. That's so crazy to think about. Um, so here's,

Starting point is 01:19:59 here's where this gets interesting, right? So this removes politics, but product management sets the SLO, right? So this removes politics, but product management sets the SLO, right? So, Hey, I've, I've got my new Gmail thing out there. I'm going to set the SLO. I want it to be 99% uptime because you know, who cares if some of the background polls for new emails fail, that's fine for the quarter, the actual uptime then needs to be measured. But the important part is it needs to be by an uninvolved

Starting point is 01:20:25 third party, right? And Google says basically they have their own monitoring system out there. So that's like the third party. Now, this is where things get interesting. The difference between the actual downtime and what the SLO was, is your error budget. So if you said that you're allowed to have, just for easy numbers, let's say you're allowed to have 10,000 failures for the quarter for your service. If you've only had 10 failures, then you've got 9,990 failures left in your budget. That's kind of a cool way of looking at things. And as long as you have budget left, then you can do a release.

Starting point is 01:21:14 I love this approach. We actually did a hint on this last episode. I kind of like got ahead because there was an example that they gave where they were talking about the rack where you might have like the switch at the top of the rack, right? And they were saying that like, hey, you know, sometimes your error budget can be shot, not because of anything that you did, right? That switch goes out, that networking switch goes out and takes out everything in it. And if that's the only place where your, uh, application was, well then for the quarter,

Starting point is 01:21:51 you, your entire air budget is spent. You don't get to do any deployments. And that's another one of those examples that we talked about where it requires strong buy-in from management to be able to say like, Hey, Hey, I know it's only January 2nd, but we're not going to do another deployment until April 1st. So I did want to say, I talked to, I'm sorry. I'm sorry.

Starting point is 01:22:17 And it's like, and he mentioned that security fixes are still getting out. Bug fixes are still going to, it's really about those feature releases that are not going to go out. No one's going to say like, ooh, log4j exploit, you know, but... Sure. Heartbleed comes out. You're going to go ahead and fix that.

Starting point is 01:22:34 He also mentioned just how important the management angle was. Yeah, they all have to have buy-in. But how cool is that, though? I mean, when you think about that, really, that's a really nice approach, right? Like, hey, as long as the development team is doing a good job in making their software reliable and when it deploys, there's not problems when it deploys. They can keep deploying every day if they want, right? Like, you're not burning your budget. However, if you did something particularly nasty that took you down for a while and you ate into seven thousand of your ten thousand budget, you just blew through two thirds of your budget.

Starting point is 01:23:14 And and you're at the beginning of your quarter. You now got to think about what you're going to do over the next three months because you had a particularly nasty release. I think that is a really good way for teams to sort of do a good job themselves, making sure they're putting out a quality product. It's also a super mature way of addressing your service, right? Like, you know, there's a maturity level there. The management's bought into it. You have the observability and the metrics for it. You know, you were able to calculate what is a reasonable error budget and you're able to track on that. And, you know, yeah. So, I mean, there's a lot of buildup is the point, is that, you know, in order to get to this point. Well, I think the first building block,

Starting point is 01:24:14 and we've talked about it before, is the metrics. If you don't have metrics, how are you going to measure anything, right? Like, how do you know how successful you are or how bad you are or how you don't, right? So getting that in place allows you to start making decisions without it. You're flying blind. Well, I mean, like where I'm where I was thinking about this, though, as I was saying that is that, OK, let's take our current, you know,

Starting point is 01:24:37 work, you know, life. Right. We couldn't just say like, hey, you know, here's, here's, you know, our error budget. Right. Like we would have to go, we would, you'd have to sit down and think about that. That's not just something you can just arbitrarily. Like, I think, you know what? A thousand, a thousand errors sound feels right. That that's what my gut's telling me. And I'm going to go with it. Like, no, you can't do that. Like you, you really need to have the time, taking the time to sit down and figure out like, okay, what's realistic. Like what is, uh, you know, what does it matter to my customers? Like what's, what's the, the perceived loss or, or actually, you know, a law actual loss with, you know, any customers for outages and things like

Starting point is 01:25:25 that. So like, you know, it, it, there's a level of maturity there to be able to get to this point. Yeah. It's pretty awesome to think about. So the benefits we'll touch on real quick here. I mean, we've already talked about some of them, but this approach actually provides a good balance for both teams to succeed, both the development, the product development team and the SRE team, right? It's, they can look at it and they can see what their budget is. Um, and if the budget's nearly empty, then the product developers start spending more time testing and hardening their product, right? It's, they don't stop working, but they're not going to be releasing more features. And so they can take that time and make it to where the next time they

Starting point is 01:26:05 release, it goes a little bit smoother, right? With, with fewer problems. So that's pretty awesome. Or write more testing around previous errors to make sure that they don't happen again,

Starting point is 01:26:13 or, you know, they're caught ahead of time. Yep. And then, so outlaw hit on the switch thing, right? Like if it goes out,

Starting point is 01:26:22 you know, it's like, well, sorry, that's just what happened. You know, you guys are gonna have to eat it with the budget. But what they said that this can actually do is it can bring to light some, some overly aggressive reliability, um, targets that people have hit or trying to hit, right? Like, let's say that, you know, something happened that you couldn't control. And all of a sudden your entire budget's eaten. You can look at it and be like,

Starting point is 01:26:49 yeah, you know what? Our reliability goals are way too high. We should back off this a little bit because otherwise we're never going to be able to release another feature. Right? So again, I think that goes back to what outlaw said with the level of maturity, you have to be able to reevaluate that stuff and say, and be honest with yourself and the, and the group and the product and be like, yeah, I think we were, we were a little too aggressive here. Yep. So, uh, we'll have links to the resources we like in this episode. Clearly, uh, sre.google will be one of the many in there and uh yeah with that we head into alan's favorite portion of the show it's the tip of the week yeah so i've actually got a couple here um today so i'm going to lead off with the one that is a follow-up to the one that i did last

Starting point is 01:27:41 last episode with the guava library from Google. So Michael Warren wrote us on the previous episodes, show notes, and let us know that there is actually an update to the cash library. So Google has even pointed to this other one, this GitHub library from Ben mains, it's called caffeine. And it's a,

Starting point is 01:28:04 I guess it's more of a standalone caching implementation that they've done. And Google even recommends it from their own Guava pages. So if you are looking for some of those caching features, go take a look at that instead of necessarily pulling in the Guava stuff, which I'm assuming if they're, if they're pointing to this other library, they don't plan on building this up or maintaining it. Um, they're probably deprecating it eventually. So, um, excellent. Thank you for the tip there, Michael. And then the other one I wanted to share because I get into this flow where I'll be working on things and I don't know, as a developer that gets involved in a lot of stuff, I get pulled off tasks a lot. I don't know as a developer that gets involved in a lot of stuff i get pulled off tasks

Starting point is 01:28:45 a lot i don't know about you guys does that ever happen to you do you get off task a lot you get pulled off off your task to do another task i don't know what you're talking about sometimes i get to work on the tasks i'm supposed to oh that's that's kind of it's kind of like it yeah i think that's kind of like it maybe that's what happens to me maybe sometimes i actually get to work on it but so here's the gist of it right so i check out a branch um and i start working on it and i get pulled off on something two or three days later i come back to it and i'm like oh man what did i branch this off of where is this supposed? Where is this supposed to go? Is this supposed to go into my release branch?

Starting point is 01:29:26 Is it supposed to go into the dev branch? Like, I don't remember where I started or what I was doing. Well, there's something that I do when I do get checkouts that helps me with that is the actual command is get checkout dash B space branch name space dash T. The dash T tells get to track the branch that you branched off of. So let's say that I created a branch called ABC, right? And I told it to track the branch that I branched off of, which might have been dev. What I can do after that, there's another command in Git where you can type in git branch space dash vv. And it's a verbose version of the branches you have. Because if you just type in git branch, it'll give you a list of all the branches you have locally, right?

Starting point is 01:30:18 If you do git branch dash vv, it'll give you a list of those branches, but it'll also show if you tracked another branch, it'll show you a list of those branches, but it'll also show if you tracked another branch, it'll show you the branch that you tracked off of. So I can look at it and say, oh, okay, my branch ABC, I branched off dev. Cool. I'm going to, I'm going to end up what I do need to do a pull to get new code in, then I'll get it from the dev branch. Right. So, um, and it also makes it easier when you do the tracking like that. If you happen to switch back over to your dev branch and you do a get pull to get in the latest changes

Starting point is 01:30:50 from your origin, and then you switch back over to your other branch, it'll tell you, hey, this branch is, you know, 25 changes behind dev, just do a get pull and it'll pull it in because it was tracking your local dev branch. So that's all really nice. There are tons of little caveats to when you don't have to do this.

Starting point is 01:31:10 I'm not going to go over all those because, I mean, it's just too much information. But there is a way to set this up to where you never actually have to do the dash T if you don't want to have to remember it. You can do a git config dash dash global branch auto setup merge and set it to always, and it will always track the branch that you branched off of. So that's a nice way to do it. I recommend that if you're doing it locally, that's fine,

Starting point is 01:31:37 but I still always do the dash T because if I get on another system, another computer, another environment, then I don't have to check to see if that global config set, it's nice to know that it exists so all that will be in the show notes um that is that is my michael outlaw tip of of this i want to clarify that was branch dot auto setup yes yeah yeah totally i didn't read it verbatim. Dash dash global branch dot auto set up merge dot space always. Yep. Very nice. All right.

Starting point is 01:32:13 So just check my watch and it's been 10 minutes since I mentioned MSUR. So ding, ding, ding. So two tips from him. And the first one, I'm just going to read part of this talk description here. And y'all can let me know if this rings any bells. So, your job title says software engineer, but you seem to spend most of your time in meetings. You'd like to have time to code, but nobody else is onboarding the junior engineers, updating the roadmap, talking to the the things that got dropped asking questions on design documents and making sure that everyone's going in roughly the same direction if you stop doing those things the team won't be as successful

Starting point is 01:32:53 now someone's suggesting that maybe you're not uh that maybe you'd be happier in a less technical role if that describes you congratulations you're the glue if it's not have you thought about who is filling this role on your team? I'm going to skip the next paragraph and just end with this. Let's talk about how to allocate glue work deliberately, frame it usefully, and make sure that everyone is choosing a career path that they actually want to be on. Cool. That's just really cool to talk. It's from XGoogler.

Starting point is 01:33:22 They're a lead at Squarespace now. And so there's an excellent talk that I listened to that's really nice. And I think that it makes a lot of really good points, you know, of course about career management and, you know, making sure that you're doing visible work. But also just about recognizing when you are doing these kinds of tests and recognizing who is doing it and kind of what that means for the team. So I thought it was really interesting. And that could be a source of frustration for senior engineers a lot of times where you feel like you're not getting any work done. But really, you're doing this really important stuff that isn't always seen or noticeable. So, great talk.

Starting point is 01:34:00 Yeah, definitely. I'm going to have to get that one. I'm going to watch that. Yeah, for sure. And I got a second one here for you. So I mentioned earlier, I hinted that there are other books from Google. And actually, this one's even listed on the sre.google site, but it's not under the slash books section for

Starting point is 01:34:20 some reason. There's a book called Anatomy of an Incident. Google Site Reliability Engineering. You know I can't say that. You can't say that. You want to give it one more try? Cyber... It just can't do it.

Starting point is 01:34:39 You could never have that job title. No. What's your job title? I really want to. I really want to. I really want to as well. I'm an SRE. That's pretty good. SRE. So it's all about Google's approach to incident management for production services.

Starting point is 01:34:54 So not only managing with the incidents and dealing with them proactively and preventing them essentially and being ready for them, but also doing things like postmortems and whatnot. This is a free book. It's really useful if you're getting into that or you just want to get better at it. Awesome. Yeah. And if you want this free book, just leave a comment.

Starting point is 01:35:16 I don't know if this one, it's in a Riley book. I assume you could buy a hard copy of that or no. Yeah. They got PDF, EPUB and Moby. No reason. That's nice.

Starting point is 01:35:24 So now you want a physical. It feels good. It smells good. Yeah. Yeah. They got PDF, EPUB and Moby. No reason. That's nice. So now you want a physical, it feels good. It smells good. Yeah. You know, yesterday I had a clown open a door for me. What? I thought it was a nice jester.

Starting point is 01:35:37 So, uh, for, uh, my tip of the week, one, uh, you know,

Starting point is 01:35:43 I mean, we do a lot of Kubernetes stuff and in our, uh, day to day the week, one, you know, I mean, we do a lot of Kubernetes stuff in our day-to-day. And if you are too, then, you know, if you aren't already using Minikube, if you, well, first of all, if you're using Docker Desktop for your Kubernetes work, I guess this is like a PSA, like there's better ways out there. No offense to Docker Desktop, but, you, but I'm a fan of Minikube because you can very easily specify the version of Kubernetes that you want to use,

Starting point is 01:36:14 which to me is critical if you want to be able to test your infrastructure against a prod-like environment going back to our DevOps handbook. And so being able to specify that version is critical. And, you know, Minikube allows you easily to do that. But now that you're using Minikube, I've convinced you of all of its wonders.

Starting point is 01:36:40 Such a salesman. Thank you. Sold it in like 30 seconds. That was really good. good yeah so um you know you might want to be able to see like hey of all my pods like do i have any like heavy pods like what which ones are doing the most work and whatnot so go to your favorite terminal minicube space add-ons space enable space metrics dash server, enter that in. And then you can do something like a cube cuddle space, top space pods. So, you know, or if you have it

Starting point is 01:37:13 alias to K so K top pods, and you can see like your pods based on a memory and CPU, you can see how they're performing and, and whatnot. So that and whatnot. So that's pretty cool. And also, hey, here's another really cool thing that you can do with Minikube. If I didn't just sway you already with that 30-second salesman speech that I did before, you can type in Minikube dashboard and that'll bring up a page that will, if you're using something like a Google, you know, a GKE Google Kubernetes environment, you know, it'll be kind of like that where you can see the nodes and the pods and all the different Kubernetes resources that are in that cluster of 1VM on your machine, but whatever, like you can see some cool stats that are going on. It might be like helpful to you to know, like, Hey, do I have, um, am I, am I, you know, requesting too much, uh, you know, for all of these pods that I want to dev on locally, or, you know, maybe the limit is, uh, you know, maybe I'm well above the limit and that's why my

Starting point is 01:38:24 local cluster keeps crashing every time I'm trying to dev on it, you know, maybe I'm well above the limit. And that's why my local cluster keeps crashing every time I'm trying to dev on it. You know, things like that. I do want to call out one thing on the Minikube thing, only because it confused the ever living heck out of me when I first started dealing with Kubernetes. So Docker desktop nowadays, as Michael mentioned, has Kubernetes built in. You can easily turn it on. When you, at least back then, when you'd start looking into Kubernetes things, they'd tell you to use many, many cube. And, and what confused me is I thought that many cube was using Docker for, for its images and stuff.

Starting point is 01:39:01 So they're two totally separate products, right? So many cube is running a Kubernetes cluster in its own little like, um, configurations, right? It's its own little world. It doesn't care about Docker desktop or it doesn't have to generally speaking. So what you can do with many cubes, you can run a Kubernetes cluster and do all the cube cuddle stuff there. What you can't do is you can't do like a Docker run and expect everything to work there. Even though it has a Docker daemon, it won't let you do it exactly the way that you think you should. So my whole point in saying this is if you're confused, if you want to get started with

Starting point is 01:39:38 Kubernetes, you could totally just use Docker desktop and turn on Kubernetes there, but you are stuck with the version that they bundle with Docker desktop, which is what outlaw was saying. You can't specify a version. And if you allocate resources to that thing, let's say that you give it 10 gigs of Ram and whatever that's taking that. Now, if you wanted to run many cube as well, that is going to start another separate VM that is going to require its own resources. So if you give it 10 gigs of RAM, it's going to be its own cluster. Docker is going to have its own 10 gigs of RAM, and they don't operate together. So it was confusing to me when I first started out because a lot of tutorials would say,

Starting point is 01:40:18 hey, use Minikube, but then Docker desktop's like, yo, use Kubernetes here. And I didn't understand it. So just know that they are two totally separate things. Let's be honest. If you're just starting out, then everything involving Kubernetes is a bit to take in. It's a lot. It's a lot of information.

Starting point is 01:40:37 Just do an episode on it. You're in good company there. And you're not the only one, well you you know what's funny the reason the reason i got so frustrated is i was following a tutorial and it was like hey um go to the metrics dashboard or whatever or this this metrics add-on thing right and i'm like it's not working i'm using the docker desktop thing and it's like it's not there i can't get it to work i spent hours trying to figure it out and then i was like okay i can't get it to work. I spent hours trying to figure it out. And then I was like, okay, I don't get it. And it was just the fact that the tutorial was from the standpoint of mini cube where, you know, I was using the built-in Kubernetes and Docker desktop.

Starting point is 01:41:15 So yeah, it is, it's a mind wash. We should totally do an episode on it. I think we have enough information at this point to, to spend a minute or two on it. I'm all for it. Yeah. Sounds good. Um, yeah. Reminds me that some people are like slinkies though. Yeah.

Starting point is 01:41:34 It was that they're really good for much, but they bring a smile on your face when you push them down the stairs. Uh, thank you, Jesse. Everyone. That's the best one. uh thank you jesse everyone it's really dark though so sorry for ending the show on such a uh you know like i feel like i feel a little bit like dexter saying that one you know like you know here's the clean room and we're gonna tell you something funny but uh's a slide. Yeah.

Starting point is 01:42:09 So, yeah, subscribe to us on iTunes, Spotify, or don't. Maybe you heard that last joke. You're like, you know what? No, that's too far. You went too far. But I really wish that you would. And if you would, you can find us on whatever your podcast platform of choice is. Maybe a friend gave you, like, hey, go check out these crazy guys. And, you't realize that we had a podcast. But yeah, we are there. And so leave us a review if you can

Starting point is 01:42:32 as Johnny Underwood said before. We greatly appreciate it. There's some helpful links at www.codingblocks.net slash review. In the ring of fire while you're at codingblocks.net check out our show notes examples discussions and more and send your feedback questions and rants to our slack channel and make sure to follow us on twitter and

Starting point is 01:42:56 send us some I don't know social distortion related trivia from mommy's little monster head over to codingblocks.net and find all our dillies at the top of the page. He says dillies. That's what they are.

Starting point is 01:43:12 That's what cool kids call them now. You say dillies. You say it like real quiet. Cool kids. He talks about social distortion that's been around for like over 40 years. I think, yeah, that's about right. No, I know because i was at the 40th year uh really yeah nice

Pet Camera - EBO Air 2

Coding Blocks - Site Reliability Engineering – Embracing Risk

We learn how to embrace risk as we continue our learning about Site Reliability Engineering while Johnny Underwood talked too much, Joe shares a (scary) journey through his mind, and Michael, Reader o...f Names, ends the show on a dark note.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

Coding Blocks - Site Reliability Engineering – Embracing Risk

We learn how to embrace risk as we continue our learning about Site Reliability Engineering while Johnny Underwood talked too much, Joe shares a (scary) journey through his mind, and Michael, Reader o...f Names, ends the show on a dark note.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.