Coding Blocks - Site Reliability Engineering – More Evolution of Automation

Starting point is 00:00:00 You're listening to Coding Blocks, episode 188! We're going 88 miles per hour. Going back in time. Alright, so, but if we're going to 188 though, does that mean we go forward? Then 88 is to go backwards and 188 is to go forward? That's the old 188 paradox. Yeah, okay. Paradox?

Starting point is 00:00:25 Really? Okay. So, all right. As you can see, we are definitely starting the show on the rails. Like we are definitely, you know, following the right path here. If you haven't already subscribed to us, we would greatly appreciate it. If you did, you can find helpful links uh at the top of the page no uh yeah whatever you know the routine by now there's 188 episodes you don't know what i'm

Starting point is 00:00:52 going to say by now at the start of this thing right yeah so you can visit us at coding blocks where you can find our show notes examples discussions and more send your feedback questions and rants to comments at coding blocks it's almost like we have that written down somewhere somewhere uh and also follow us on twitter uh at coding box and if you tweet out your feedback, questions, and rants, do comments at CodingBlocks.net. It's almost like we have that written down. Somewhere. Somewhere. And also follow us on Twitter at CodingBlocks. And if you tweet out like, hey, love the CodingBlocks show.

Starting point is 00:01:11 Y'all are awesome. Then I will respond with a gif of somebody dancing. What if they respond back in the negative there? Oh, then somebody crying. Plus he meant gif anyways. I'm good. I'm a child of social media the revolution you know i know how to i know how to work this stuff i was gonna i thought you were

Starting point is 00:01:30 gonna say you were a child of giphy because i was like my giphy game is this tight it's really good yeah i also have a website we found out dilly's at the top of the page with that i'm joe zack i'm michael outlaw who might be on time or late to that announcement and i'm alan underwood and can you truly call it giphy if you just called it a gif i mean that the company is is giphy i can't i can't help that's fired wow all right so with, we are on part due of evolution of automation at Google. And this particular show was going to be a little bit more difficult because this is all like stories from inside Google about how things played out and how automation helped and how it hurt and everything. So there are excellent stories, but it was a little bit harder to put notes together. So hopefully this will be pretty good and you guys will all get something out of it.

Starting point is 00:02:28 Guys and gals, everybody. Indeed. So first section is on automating yourself out of a job, which, you know, might scare some people, but not programmers. We're into that sort of thing. They had an interesting story about automating MySQL, which is a database they use for the Google Ads. At some point, I'm sure they have multiple databases in there for that. Were you not already just a little bit mind-blown when they read that? And you're like, oh, man, something the size of Google relied on MySQL?

Starting point is 00:02:57 They, too, use this stuff? I mean, this is decades ago. Yeah, I seriously had the same reaction when I first saw it. Their ads was running on MySQL. That's pretty awesome. Yeah, I'm totally naive here, but is there some sort of advantage to using MySQL over Postgres at this time? Just the time frame?

Starting point is 00:03:16 Because this was like early, mid-2000s? What you're talking about today? Probably still. I think it's still one of the most used open source databases out there. Like probably even way more than Postgres would be my guess. And some of the tooling around MySQL is amazing. Like MySQL Workbench and some of the stuff that you get for free destroys some of the open source stuff for Postgres.

Starting point is 00:03:43 So maybe that's good enough reason plus you know everything that's on wordpress basically uses it so that's yeah 30 of the internet so a lot of things that like i i did like about my sequel now like just kind of had to deal with like their storage engine stuff and transactions and like all that stuff is fixed now i'm just i just don't know much about it anymore i was going to say there's no way that my sql is still more heavily used than something like a postgres but i was wrong like yeah way wrong right yeah according to db engines uh which we've mentioned before db-engines.com their ranking of it my sql is the number two database postgres is number four with oracle

Starting point is 00:04:24 and then SQL Server filling in the blanks in between for one and three. Yeah, and it's a pretty big drop. It's like half, right, on the score down to Postgres. I mean, Postgres is awesome, though. It's just MySQL is pretty ubiquitous. That's crazy. I wouldn't have guessed that.

Starting point is 00:04:40 I just assumed Postgres was number one. I would have thought Postgres had overtaken it too. And so this tale starts out with basically failovers. So if you kind of think about like early days of Google, they had far less computers and failovers were manual processes. And this is going to happen sometimes if you just need to kind of move one server to another, like you need to do a kernel upgrade or something. Also, it could happen if something goes wrong

Starting point is 00:05:06 and, you know, a server gets in bad shape. There's a couple different reasons why you might want to failover. And this would take, I don't think we wrote it down, but I think they said it was like 30 minutes, 30 to 90 minutes. It could take some time. Yeah, that was for the manual intervention, right? Like, if somebody had to go in and do it. They said that the actual.

Starting point is 00:05:27 For the master node, right? Right. Yeah. Not for the replicas. Right. And so one of the first things that they automated was this replica replacement thing, right? And that was their start into this world. But then they talked about they were going to migrate over to their Borg.

Starting point is 00:05:48 Do you call it an engine? I don't really know. It's like their cluster scheduling thing, but, but Borg, they were migrating to Borg, right? And Borg,

Starting point is 00:05:56 like it was kind of the predecessor, predecessor to Kubernetes, which, um, I think Borg and Kubernetes are still kind of different things, but Kubernetes kind of like, that's what it kind of came out of. Or at least that was my reading of it.

Starting point is 00:06:10 Yeah. And we actually have a link to what Borg is in the show notes here. So if you were interested in what exactly that was at the time and still, I imagine still exists, right? They probably still use it because I doubt they pushed everything over to Kubernetes but isn't the Borg supposed to be a bad thing like didn't they like can we address like that would be saying that you like created the Death Star for your company

Starting point is 00:06:34 and now it's a good thing and like everybody's happy that you have the Death Star but wait a minute it didn't work out so well for Alderaan but it's a good thing if you're the Borg exactly it's good to thing if you're the borg right yes yeah exactly yeah it's good to be the borg okay uh yeah so uh they uh wanted to be able to get this faster obviously

Starting point is 00:06:54 and as google scaling up you can't have people spending 30 to 90 minutes or whatever it was uh failing these machines over so they're trying to figure out how to get faster um and i am struggling with how to even relate this story this is to say sorry the back whole backup for this chapter was basically kind of stories that emphasize the first part of the chapter about kind of evolution of automation and how things kind of grow from manual processes into autonomous systems um and i do feel like there's really good nuggets here that really kind of make some of the weirder points from last episode make more sense but it's just kind of hard to figure out how to talk about it in a way that's sensical so this one for me this this first one with the Borg and the migration and all that, the automation kind of what they were

Starting point is 00:07:46 getting at is it wasn't like painless. They, they first started doing it and, and they saw benefits, right? Like they started automating out the, the human interaction, the person interaction, they were getting rid of that stuff and things were great. But then when they tried to take it to the next level, the problem with MySQL is they said, you can't do the same thing with the master nodes, with the master MySQL nodes that you can do with the replicas. And so they ran into problems with Borg there, right? So during their automation, they ran into a bunch of brick walls that they had to overcome. And so while they were trying to automate stuff, it all sounded great,

Starting point is 00:08:25 but they actually caused themselves a lot of pain in the process, right? But ultimately, after they got over those things, they saw the fruits of all the efforts come to life. And then in the end, basically now they've got a system that is sort of hands-free, right? Like it, it, it kind of takes care of itself now. Yeah. And well, and they commented to that like they saw a large reduction in like mundane tasks because, because they had, you know, figured out how to solve the difficult problem was, you know, how to handle failovers for the master node, master database nodes.

Starting point is 00:09:01 Once they figured that out, I think it was like 95% of their uh you know mundane tasks were were gone like their toil right the the crap that they just had to do and then they also said after they finally got to the end of the tunnel with all their efforts to make this happen not only did their mundane tasks reduce by 95%, their total operational costs for managing those MySQL clusters also dropped by 95%. But I want to address another elephant in the room kind of thing. Because we hear this portion of the chapter was titled automating yourself out of a job. And that kind of has a negative connotation to it. But number one is, well is it negative because like did you really want to like spend the rest of your career like failing over my sql uh master database nodes no like that you you want to get out of that job

Starting point is 00:09:58 number one but number two is like i kind of we shouldn't think of these things in like such negative context like i'm automating myself out of job because that's never going to be the case. And I think we described this before with like this type of automation, you're always going to be like, well, what's the next thing that I can automate? You're going to just iterate onto the next thing. And even in some of their story here, like, you know, that's what ended up happening, right? Yeah, totally. And it, I mean, we kind of just blasted through that whole section, but a couple of points worth talking about is there were one of the main things that they ran into when they, when they automated all of this, the, right? Because typically in anybody that's been doing this stuff for a long time, if you're in your standard three-tier architecture, right? Where you've got a front end, a middle tier and your backend database, we'll call it. You almost always

Starting point is 00:10:57 expect that backend database to be your, like the thing that's strong in your system, right? Like that thing's always going to thing's always they called it out is like developers always assume it's the the strongest part of the stack right yeah yeah and so the problem is when they started building all this automated failover these things will migrate nodes and the masters would change over and the replicas would change over. The problem is the code wasn't built to be fault tolerant. Right. And so they had to go in. So they automated the MySQL stuff to do the failovers, but then they had to go in and touch all the code as well. And they had to do that because they needed to make sure that this thing would, would be able to come back up in a state to where it

Starting point is 00:11:41 could operate again. So, so this whole notion of automating yourself out of a job, you've, you've never finished, right? Like there's always more stuff that you've got to do. And ultimately what you're trying to do is make your life better as a developer, as a person who has to support these systems. So, you know, like, like outlaw seven, we've said in the past, right? Like even if you automate a job, I know there's, there's definitely been arguments over like people in fast the past, right? Like even if you automate a job, I know there's, there's definitely

Starting point is 00:12:05 been arguments over like people in fast food industries, right? Like they've already started putting kiosks in the stores, right? Like if you go into a Taco Bell or something, you might be ordering at a kiosk. Well, I can guarantee you. Okay. So now there's not people standing in front of a register that are going to be taking your order necessarily in every place. But now you have created jobs for people that are going to be maintaining these systems and taking care of these things when they go wrong, right? So you're always going to have new things to do. There were two thoughts I had on that.

Starting point is 00:12:39 One was the software that they mentioned. They specifically called out they had to make changes to JDBC in order to support this idea that the master nodes might not be as reliable or as you previously once thought of them because of this. Because the Borg, like we said, was an early predecessor to what maybe later

Starting point is 00:12:57 spun out Kubernetes, but it was basically doing container management, right? So a big part of the reason why they were able to reduce their operational costs is because they were now able to keep all of the MySQL infrastructure being maintained by the Borg, then they were able to better stack MySQL instances onto fewer machines. And I think this was the portion of the chapter where they were talking about like suddenly all their sres they had a huge abundance of time because their mundane their toil was reduced by 95 but they also had a huge abundance of hardware because i think they said they reduced the hardware by like you know 60 or something it was massive right all because of you know

Starting point is 00:13:39 then then being able to like run my sequel now as containers that were managed by the borg but so that was that was thought one. But then, uh, you know, you were talking about like this, this building upon thing, like, you know, I made the joke about the death star a moment ago, but you know, if you think about like how you would build the death star, right? Like at some point it just started out as the international space station, right? Something, something silly, you know, something like weird stick looking thing and floating in the

Starting point is 00:14:05 space right you know and then they just kept building upon it so it was like hey we have the international space station woohoo we're done no you're not you got to keep building man right like so it's yeah you're never automating yourself out of a job no yeah i wanted to mention too like if you think about um when we got rid of the toil to uh required and failovers that time went into improving software that like jdbc drivers that benefited like the whole organization if they if they push that stuff into the um the main jdbc you know repositories that they made the better for the world which is pretty cool so like would you rather be doing like repetitive boring work uh or making the world a better place for everybody just it's a tremendous trade-off in

Starting point is 00:14:43 value there well now that you put it like that, I guess I'll just do the total. Right. Also, I don't think we mentioned it, but they eventually got this process down to 30 seconds or less downtime, which I thought was really cool. And they mentioned that it was actually a requirement. So I was kind of curious. They didn't say whether that was a business requirement. No, they did. I remember.

Starting point is 00:15:04 Oh, it was a business. Okay. Yeah. It was kind of curious. They didn't say whether that was a business requirement. No, they did. I remember. Oh, it was a business. Okay. Yeah. It was a board. That was part of the problem with the 90 minute time for the previous versions of the failover for the master node was that it took too long. And so once they put it onto the board, then because the board would like reschedule these containers, you know, a couple of times a week, then they were blowing out their uh their error

Starting point is 00:15:25 budget okay cool so they had to reduce that time in order to fit within it and so for anybody that wants some behind the scenes on this show like typically we we mostly go in order we totally did that entire section out of order so if you're looking at the show notes trying to trying to line it up with what we said, it's not. And again, it's partly because of how these stories are done, right? Like it's – we don't want to read the stories out to you, right? Like this isn't a bedtime type thing. So, you know, we're talking about it in the way that seems to make the most sense.

Starting point is 00:15:59 It's neither a forward nor a reverse index. Yes. If you leave a comment, let us know why you like my sequel over Postgres or what you think about any of these things. We'll send you a copy of the book. You have a chance to win a copy of the book. You can read some of these stories yourself. Don't forget, the book is also free online. You can be reading all this right now.

Starting point is 00:16:18 That's true. Second story, automating cluster jewelry. So they had a story about a particular setup of Bigtable, where there was some sort of efficiency kind of change put in place where they didn't use the first disk of a 12-disk cluster. And then some automation came around at a later date and saw that if the first disk wasn't being used, it assumed that none of the disks were being used and it would wipe it and so this had uh i think these were cache servers so it wasn't like catastrophic but it uh ended up causing this kind of cascading deleting of data which was trained on other systems and took down i think one of the data centers uh briefly so they mentioned that they had like real-time you know replicas of it so it wasn't like yeah a huge problem but cause some panic it does make you wonder like was that the guy's first day on the job that was like well i guess i can assume that if uh if the first disc isn't being used that none of them are being used so just rm minus rf star and then he commits the code and like

Starting point is 00:17:20 a week later he's like hey uh tapped the shoulder. Did you know that you just deleted the entire data center? This is dangerous because this is, I mean, this has happened to all of us, right? So basically you had people that were, so this automating cluster delivery, this was actually talking about the infrastructure for clusters, right? This had nothing to do with Bigtable itself. And then you had the Big table group that did an optimization because apparently not using the first disc in the 12 disc cluster made it way faster for some, some odd reason, right? Who knows what it is, but so you basically had two teams doing things

Starting point is 00:17:58 that made sense for themselves. And, and so this automating of the cluster delivery they're like oh well they're not using the first disk it's gone so so it's you had this sort of hidden optimization in the big table delivery that the teams automating the cluster delivery didn't know about and so they that's they call out like you have to be careful about these sort of hidden safety signals because like how how is my team supposed to know that your team just doesn't use the first disc because you know it was it was slower with the cash retrieval or whatever like it's just it's a bizarre thing but it's easy to see how that could blow up on you yeah it was for latency reasons that it they would have a 12 disc system and they would not use the first disk

Starting point is 00:18:45 so one right away i was like my mind was just kind of blown like wait why did the first disk matter or have such a huge impact on it that's just bizarre yeah but also yeah to your point like hey i gave you 12 disks and you're only using 11 of them like you know why would i even how would i even know that why would i even think that you you know if i saw the first disc is completely unused why wouldn't you think that the others aren't and you imagine like a lot of times you don't write a script like this from like top to bottom like every line doing something you combine other tools that people have so someone might have said like hey here's a tool called disc checker that checks if the disc array is uninitialized and you use it and you don't realize that the way it checks that is by kind of this implicit safety signal of like figuring out whether or not the first disk is

Starting point is 00:19:28 empty and so you don't really think about how it works necessarily because you're a higher level of abstraction and it works great in test and then you roll it out and whoops you didn't realize that one of these other tools you're using kind of took a shortcut on something right i was questioning like how this worked because like in my you you just mentioned the 12 disc array, but I'm thinking like, no, I don't know that it was. Cause like if it was a 12 disc array, then all 12 discs would be used, right? Like you wouldn't have, whoever the array controller is would be controlling the usage. So I'm in my mind, I was thinking like, okay, it's a rack of like, you know,

Starting point is 00:19:59 maybe like a storage, uh, uh, controller that hat that houses 12 discs and they made an array that is the last 11. So if you were to go look at the storage rack, you see the first drive light in every storage controller is solid, never being used, but the others are blinking from all the usage, right? I don't know. Weird.

Starting point is 00:20:22 Also a weird assumption, though, to just make, like, oh oh this first one isn't being used so i guess none of them are right yeah yeah but i mean you could see how somebody would do it right and that's that's you know kind of what they were getting at is there are some things that are dangerous and then they even say um their their automation depending on a bunch of like you know shell scripts, and those ended up being problems over time, which takes us into the next section, which was something that they'd created called ProdTest,

Starting point is 00:20:54 which was their way of detecting inconsistencies with deployments. Now, just a quick clarification. I think ProdTest, if I wasn't mistaken, ProdTest was almost a framework that individual teams would use for their services. And so they would say, like, ProdTest, you're setting up a new service called BigTable or BigQuery or whatever. Like, fill out ProdTest. Okay. But I did have one thing that was like, I had to take my medication when I was reading this

Starting point is 00:21:22 because I kept twitching when I was reading this portion of it and i don't know if it bothered you guys too but they would refer to them as unit tests but then they were like pushing files around or setting up dns and i'm like no it's an integration test why are you calling it that yeah but it's totally an integration test all of it was an integration test the entire string of them well i guess at their point like they've gotten to such a scale, you know, like, you know, I we consider our computer like that thing on our desk, whereas they consider their computer is like, oh, it's that large building over there. So maybe like, you know, integration test means something different. Yeah, totally. Yeah, totally.

Starting point is 00:22:03 And I totally interrupted the flow sorry about that so you were just describing what that protest was but um it's basically the idea is that protest is a suite of what they called unit tests that would be run on a service that would do things like checking just like what allah said checking dns to make sure it's okay and if that test can uh works then it'll go on the next next test and maybe it would check i don't know connectivity to some other service or something and keep on going and if one of those tests fails and it would go on to the next test and maybe it would check, I don't know, connectivity to some other service or something and keep on going. And if one of those tests fails, then it would bail. And then this was something that each team would create. And so that you could run this test on the service and kind of get a health check and say, well, okay, this is where it stopped. So somebody needs to fix that. And then somebody could go in and jump in,

Starting point is 00:22:40 but every team was required to kind of create this broad test create this broad test file or speed of files that would go and check out the health of something. This was the sentence that bothered me the most, though. We extended the Python unit test framework to allow for unit testing of real-world

Starting point is 00:23:00 services. I was like, wait a minute, you've broken out of that thing. It's no longer a unit test. Let's back up real like, well, wait a minute. You've broken out of that thing. But yeah. It's no longer a unit test. So let's back up real quick, though, because let's talk about why they even had this. So what they were saying is in their cluster deployments, they go to set up new clusters, right? And every time that they do it, they'd end up having to make some custom flag changes to various different parts of the system. And when something would go wrong, they'd have no idea what it was because

Starting point is 00:23:27 they modified their typical scripts or whatever, their flags. And so they got into a situation where it would take months to stand up a cluster, right? And basically they got a directive for management where they were like, hey, we want this done in a few weeks or a couple weeks or whatever it was. And they were like, Hey, we want this done in a few weeks or a couple of weeks or whatever it was. And they were like, um, okay. So, so that made them back up because they had typically done everything in shell script form, which is great for a lot of things, but it didn't let them check the state or consistency of a bunch of other systems. And so this is kind of what drove them to this prod testing was, was to be able to check what is the state of all these systems that are hooked

Starting point is 00:24:11 up in this cluster so that we can know when we're ready to launch the thing. Yeah. It was like the, the them spinning up, like how fast it went was a side effect of doing prod test because they, you're right. There was a section where they described like they were out of the blue told like, hey, we're going to spin up these five new clusters

Starting point is 00:24:30 and you have to spin them up in a single week. Whereas before that was something that took a long time. But that mandate came because they had prod test now and they were, because of prod test, managers or project managers were finally able to like

Starting point is 00:24:46 predict when they could go live and that's when that's what that was the impetus that was like okay well because we can predict this room you got a week yeah i think you're right so so next story actually by the way yeah it bleeds into it so yeah i think i sort of i said it in a way that implied that protest came from them saying that they needed to get something up in two weeks whereas what outlaw said is you know they were like hey well we can get something stood up in a couple weeks now because we have prod test but that led into other problems but again the whole reason prod test even came about was they were having a hard time even getting the cluster stood up because when they would make these changes to custom flags and scripts and everything they had no idea why things were failing all over the place

Starting point is 00:25:29 because it wasn't consistent from one cluster to another yeah here here were the steps that it took to and by the way this whole this whole thing with the the one week thing i just want to go like that that's part of the jay-z said that was part of a later chapter. And he's not wrong. That chapter is called No Good Deed Goes Unpunished. But here were the steps for getting the cluster ready. And I was like reading through this. I was like, oh, man, it actually kind of sounds fun. Number one, fit out a data center building for power and cooling. Right away.

Starting point is 00:26:01 Like, again, going back to our definition of computer being that thing on your desk versus them. It's like, no, it's that building over there. Right. Install and configure core switches and connections to the backbone. Number two, number three, install a few initial racks of servers. And then number four through six is where it got complicated. Configure basic services such as dns and installers then configure a lock service storage and computing that's number four number five deploy the remaining racks of machines and number six assign user-facing services resources so their teams can set up the services yeah i like how step one is create a building with power and and that's not the hard part. Right. Exactly. That's amazing. And this whole story, by the way, I think it's really about evolution.

Starting point is 00:26:49 So it started with them, you know, having a lot of manual work and a lot of kind of scripts and everything was inconsistent. And then they moved to protest. And then when all the tests were green, that's when you know, something's ready.

Starting point is 00:27:00 And so management can see and say, Hey, we're a 50% passing tests, which we know, you know, roughly takes about two weeks say hey we're 50 uh passing tests which we know you know roughly takes about two weeks so we're on schedule we're running faster than the last one you know it introduced some uh predictability uh but the problem with uh prod test was that uh you know you're still relying on humans to go in and fix these things when things went wrong before you jump up because i know where you're going with this real quick. The important part about prod tests that, that they built in was this chain ability, right? So, so when they would go deploy a cluster, it would go check to see if the configurations

Starting point is 00:27:36 were right. It would check to see if this system was up and running, if the service was up and running. And then after that one succeeded, then it would know the prod test framework would know that, okay, the next thing to check is this test, right? And it would keep going down these line of tests. And if one failed, then it would abort and be like, Hey, something died here. Well, right. So, I mean, let's build on that for a moment because like this, this was another part of the thing. Like typically when we talk about unit tests, unit tests, one of the core assumptions about, about unit testing is tests one of the core assumptions about about unit testing is that you cannot make an assumption as to like the order of what your tests are going to

Starting point is 00:28:10 run run in yeah you should just assume that they might run in a random order that the order they ran in last time is not going to be the same time order they ran in next time and if it if they do happen to run in the same order that's just a coincidence and you should not try to make any kind of assumptions or infer any kind of state in a later unit test. Whereas what Alan's saying here is they specifically did add in that ordering, that implicit or explicit ordering to the way their tests were being ran. Yep. And so the important thing to know, I guess, is if they had 100 tests, you could think of it as almost like this top-down thing. Like the top one was green, all right, then run the second one. If it's green, then run the third one.

Starting point is 00:28:52 If it's not, then abort, right? Like stop and throw up a thing and let everybody know that everything failed. Oddly, they didn't have a unit test for is the building built yet. I noticed in their testing that they did give examples. That one wasn't there. And also in the show notes, we do have a link to what one of these charts looks like. Again, this book is available online for free. So we have a link to one of the diagrams they have so that you can sort of see this flow. And with that, back to where you were headed, Joe, because you were about to take us to the next step.

Starting point is 00:29:27 Yep, so they had these change unit tests and one fail, and then they would stop the whole thing, and that's where the percentages kind of were meaningful because it was almost like steps along a flow chart, and then someone would go in, investigate, do whatever they needed to do, and then move on to the next thing. Well, at some point, someone said, you know what, some of these things,

Starting point is 00:29:43 if we see this needs to be done, we can automate it. So maybe we'll have the shell script for setting up the IP tables or something. And if that test fails, it'll run and we'll make it idempotent if we can, as best we can. And so that, you know, if it ends up getting run more than once or, you know, ends up like something bails in the middle, it can kind of pick up and just kind of fix itself uh which is pretty nice uh but the problem with that is that some of these things are hard to make fully item potent so uh sometimes things would be kind of flaky or maybe the test would be a little bit flaky like maybe it wasn't a problem with ip tables maybe ds was down because somebody tripped over the uh ethernet cable or something you know so um sometimes these things would just kind of end up in weird spots. And so even though this was a huge good thing, uh, it wasn't very easy and it wasn't,

Starting point is 00:30:30 it still required a lot of human intervention to get these things going to fully green. But this one also, like, I think that, um, you know, this is where like being an outsider, looking in, like reading this book, maybe like is working against me because like, I was trying to understand this portion. Like, why would you even do it that way? Cause I was like thinking that, okay. One of the examples they gave was a test DNS monitoring config exists. So in my mind, I'm like, okay, well you already had some code that pushed out the monitoring config. Now you're going to test if the config exists, it doesn't exist. So now you're going to call a method called fix monitoring create config.

Starting point is 00:31:07 I'm like, well, but that's the thing you already did that you're trying to test that failed. And now you're going to do it again as part of the test. And so that flakiness that you referred to, maybe that's related to it or something. Like having like,

Starting point is 00:31:20 it almost sounded like you might have duplicate code or code paths to do the same thing right the initial time and then like once as part of a a test fix but you know they refer to like that the flakiness that you refer to is uh sometimes like uh because you would have these flaky tests that would fail but then you could rerun it and oh now it works right kind of like ruin the reliability of the test you know to know like well or maybe not ruin the reliability of it but but more like kind of encourages a behavior of oh it failed just run it again it'll probably succeed well if you remember right they had also talked about here, I think in this particular section that the, the retry times or

Starting point is 00:32:07 whatever were set on intervals that were sort of long. I want to say it was like 15 minutes or something like that. Right. And so that they said, what would happen is things would get out of a good state in that time when it was trying to run the next retry. And so because, because it wasn't like, Hey, this thing failed immediately, try and fix it. It was, Hey, it failed. It's going to check in 15 minutes again to see if everything was good. And it's going to go through and do all these tests. And then it's going to fail again. If you had a hundred things and, and it's having to go through and wait 15 minutes before it can fix the next one, the next one, it would get into this like really long loop.

Starting point is 00:32:47 But then they said during that time, somehow the state could get in bad. Right. And we don't know what the internals of whatever that might've meant. Right. Like what happened, but suffice it to say that if the fix might've happened immediately, then it could have probably curtailed some of that, that those problems that were happening.

Starting point is 00:33:04 But you know, it's hard to say. Yeah. I'm just more or less more to the point getting saying though, that because you could have these tests that would fail the first time and then later succeed because some automated fix ran like it fixed itself. So that's good but you as a person now don't trust your the failures and so that like makes you i don't know like you you know you know what i'm trying to say though because because like you're you're going to just in your mind think that like okay well it'll probably it'll probably you're not going to drop everything to like oh if the test failed oh let me figure out what's going on instead you're going to like harbor you're going to reinforce this behavior among your team to where you're like oh just wait and see if it fails twice

Starting point is 00:33:56 twice in a row which like to your point could be like 30 minutes you know or 15 minutes later so yeah and for anybody new to the show or people that aren't like super familiar with computer science, you terms when, when you say item potent, that basically means you can just do the same thing over and over and over and expect the same exact result, right?

Starting point is 00:34:16 Like it's, it's like adding two plus two. So, um, it's such a weird word. It is. Yeah. So they're like, obviously like a bank transaction is not idempotent.

Starting point is 00:34:27 So if you run a code that says minus $15 and you run it again, you're down $30 and down $45 and then who knows what's next. Right. That's not idempotent. So what they were talking about in their scripts was, let's say set DNS, right? Like set DNS settings. You know, the whole thought is if they run that and, you know, it was supposed to put in a certain set of values for DNS, the next time it runs, it shouldn't add more values. It should make sure that the ending state of those DNS settings is exactly what that script wanted, right? So that's when they say item potent, that's what they mean

Starting point is 00:35:03 is being able to run these things with the expected state being done at the end of it yeah i mean if it going back to your point like if you're new to it like a calculator it should be item potent right if you say two plus two you should always get five no matter what right totally always should always get the same answer every time you do it yep even if it's wrong wait what i got that answer from joe so i'm pretty sure math and the chicken would not let me down that is correct all right so the next thing that they got into was specializing and what they basically what they go into here is there's like three things or three ways that automation can vary. There's competence, latency, and relevance. And competence is just, can it do it? Latency, how long it takes. And

Starting point is 00:35:52 then the relevance, I actually put the definition on this one because I was like, huh? So they said the proportion of real world processes covered by automation. So basically they're going after the stuff that matters, I think is what they're basically trying to say here um yeah and go ahead i was just going to go on the next part yeah go ahead about basically uh the ability that they would use turn up teams and they use this word uh turn up uh for it almost not like a... Not like the vegetable. Not the vegetable. Not a turnip. Yeah, not a turnip. Turnip.

Starting point is 00:36:28 A turnip team that would just focus on automation tasks. So teams of people in the same room that we all get together and get things done quicker. So you imagine these are a bunch of specialists who come in and they're used to setting up clusters and so they can kind of swoop in, get things set up, and then move on to the next thing, which sounds pretty nice. But again, it's kind of like a just a stepping stone in terms of evolution they didn't

Starting point is 00:36:50 stick with it for too long because there could be like they would say actually over a thousand changes a day to running systems with just a ton of stuff and you've got imagine all these people like kind of shouting across the room like hey did you set this up all right i'm running this now and you know things happening at the same time these computers it gets confusing like sometimes you know he imagines like someone's restarting in the middle of your process running or you know doing whatever uh needs to happen just kind of funky stuff can imagine um happening here was this the same section where they were talking about like coincidentally they noticed a pattern of where like they were hiring a new engineer every time there was a new cluster. That was on the previous one.

Starting point is 00:37:26 They were talking about doing the, the cluster, uh, online in the clusters. But yeah, yeah, they, where it was purely,

Starting point is 00:37:32 they, they noticed it coincidentally and then, you know, yeah. All right, whatever. I'll move on. Oh yeah.

Starting point is 00:37:38 It was there. Trust me. Read it. It was. But, uh, yeah. So,

Starting point is 00:37:43 um, when any, anytime they had the automation code that wasn't staying in sync with the code it was covering, that's when they actually said, like, that's when automation code dies. And we've talked about that before. When you've got automation codes, it's kind of like this glue stuff that runs around into those systems. And sometimes people, like, add flags or change how things work a little bit. If the system that's orchestrating and automating those things isn't aware of those changes then it gets out of sync and it starts

Starting point is 00:38:09 acting poorly and people don't use it anymore because yeah it just stinks so you've got to kind of keep on top of this stuff okay so this is where this can get this can get tricky now hold on because this is where the debate of is devops a title or a culture can become a debate, right? Because basically what they're saying here is that you have to have people that are passionate about these certain areas, right? Like in this case, automation. And so if you like, imagine, for example, you create a new build system and then no one else bothers to care about it and you walk away from it to go work on something else now. And now other people, like maybe you've like made it like super fast, right?

Starting point is 00:38:49 Like the build times are like, I don't know, 30 seconds. But you now go away to it from it and nobody else is maintaining it. So they're adding in, you know, a lot of cruft or whatever and bloat. And now it's 10 minutes, right? Because nobody else was bothering to care, you know, to maintain it, right? That's the type of example that they're talking about here is that you have to have people who, whatever this automation might be, you know, that are passionate about maintaining it and caring for it and the feeding of it that, you know, will keep, you know,

Starting point is 00:39:23 the performance updated or like as new changes are made. Otherwise what will happen is it becomes stale and maybe even stops working because as things change over time, that could be one, or it could be. So I would imagine like these specialized teams that they got together were people that were like cluster delivery people, right? Like, Hey, we're going to set up a new cluster. Well, some of the things that they used in the past have changed, right? Like, so not necessarily like the automation system itself, but the software they're trying to deliver, right? Like maybe the DNS system changed or maybe, you know, some other thing that they're putting on these change and they're just not aware of those changes. So the things that they'd

Starting point is 00:40:02 automated in the past no longer work the same way at all. So it could be a combination of the two, right? It could be the actual automation. It could be the underlying systems that changed that the automation was written around. And now, and now they're kind of in a really bad situation, right? Because they're not aware of the changes. Yeah. You weren't, you weren't maintaining that automation code as, as new things would, as software would change around it and so therefore like you know uh you're using old flags you know for like you know something like git or something like that and and now those options aren't available and so your automation starts to fall apart yep yeah

Starting point is 00:40:39 they mentioned they created uh some bad side effects here and this is kind of the argument that people make or one of the arguments people make when talking about DevOps not being a role. Like what we're talking about here is basically kind of a traditional ops type scenario. We've got one team that's in charge of running this stuff

Starting point is 00:40:53 and another team who's in charge of developing it. And then you've got these incentives where the ops team just wants to get this thing green. They want to get it running and any problems that come up, they're like, well,

Starting point is 00:41:03 that's the product team where someone else is going to have to fix that and the product team is delivering is developing this stuff and they don't really have an incentive to make it easier to stand up so they just kind of want to add features features features they don't want to make it easier to run or you know all that sort of stuff and who is more uh qualified to kind of know what's wrong and set stuff up and the product team was actually developing it so they had a bad split here and just wasn't really working out so that's why i say it was a stepping stone part of the evolution but uh ultimately it wasn't a good thing and they ended up getting away from it because turnips ended up being inaccurate and taking forever

Starting point is 00:41:38 and uh or one of those things inaccurate high latency and incompetence so basically the the uh three kind of anti patterns when it comes to automation. This section that was kind of interesting. I don't know if you guys got this takeaway from it. So keep in mind the timeframe of, of when, um, you know, the SRI SRE movement within Google, I think the books called it out was like in the mid two thousands. Right. And, and they were writing about this like after the fact right so this is mid 2000s this was the section where they they were talking about how uh specific to like their use of ssh because they were using ssh to automate a lot of things but that would require root access on on these machines in order to install and make

Starting point is 00:42:24 configuration changes, which they admitted was like, you know, clumsy from a security point of view, but they referred to it as, however, this is the quote, however, an unrelated security mandate allowed us out of this trap because that those, uh, high latency, inaccurate and incompetence, uhence turn up teams that you were just referring to, Joe was the quote trap. So an unrelated security mandate allowed us out of this trap. And I immediately was like Snowden. This was, this had to be related to that,

Starting point is 00:42:59 right? Like when, when all the leakage of the documents. Yeah. That, that's what I was thinking of from a timeframe point of view. Did you guys make that connection? Because that was around the time where everything started to get... Everything across the internet seemed to get more secure. We started caring about, is it HTTP versus HTTPS?

Starting point is 00:43:20 And everything was putting tight controls around security. I don't know. I just thought it was kind of interesting. Maybe that wasn't related. Maybe I could be wrong. And it was like later the heartbleed or something like that that made them decide to change it. But just trying to relate the real world kind of time frames

Starting point is 00:43:39 of like what was happening. The point is like there were other things happening in the world and they had this process that they were using and they admitted that it had high latency and it was inaccurate and had some incompetencies about it but they were they were using it and they might not have ever changed had these other like outside factors ever influenced their thought on it you know maybe they wouldn't be where they are today because of it. Snowden was 2013. I had to look it up. I couldn't remember.

Starting point is 00:44:07 It was 2013? 2013. So they mentioned it was in response to advanced persistent threats. So who knows? Maybe this involved governments or maybe there was some sort of hacking incident that we didn't know about. But either way, basically they said they had a security mandate that said, hey, no more shelling into individual computers. And so what they ended up doing about it was pretty cool and specific. But basically, to kind of boil it down, they ended up creating this kind of, I forget what they call it, like admin manager, admin service.

Starting point is 00:44:34 We'll get into it here in a minute. Okay. But yeah, it was basically just a service that would run somewhere. And it was in charge of making the changes. And you would tell it what to do, but there's a full audit trail and all sorts of good stuff and then that way nobody had to have root on these machines like they in fact they weren't allowed to have root uh so it was just better all around but what it did is it made like all those shell scripts and stuff that they had written uh kind of not necessarily moot you know but um it was a good time to reform they had to change

Starting point is 00:45:03 right yeah well when when did you say this, Snowden? 2013. I think it said June 2013. Okay, so maybe it was Heartbleed then, because I thought Heartbleed came after, and according to Wikipedia, Heartbleed was February of 12. So maybe that's what the advanced security, persistent security threats that made them move it.

Starting point is 00:45:22 But I was thinking like, yeah. Definitely like the Snowden stuff would have been more external and not internal but still made me think that like you know as an industry people started becoming more like security minded first you know yep so all right well um joe i don't think you can do this anymore yeah I don't give up. Yeah, you failed us. So if you haven't already left us a review, we would greatly appreciate it if you would leave us a review. No one stars though, like Joe asked for last time. Please don't do that. But I mean, if that's how you feel, that's how you feel.

Starting point is 00:46:00 I can't tell you how to feel. So you can find some helpful links at www.codingblocks.net slash review. And, yeah, it really does put a smile on our face when we read that. If you're like, man, I'd really like to buy these guys a beer as a way to thank them for everything, hey, just leave us a review. It's cheap for you, puts a smile on our face, and works for everybody, right? Because depending on what city you live in like a beer could be expensive it can be expensive and it could be sweet water 420 which just kind of isn't good so i mean you know why why why did you

Starting point is 00:46:34 go there like why do ips exist that's right i was being so nice there was nothing mean about anything i said and then all of a sudden you had to take it to this dark place i'll take a bud lighter corns light over a sweet water i'm just saying well i ate a clock yesterday i mean if we're talking about like you know things that we eat and drink eight o'clock yeah i ate a clock it was so time consuming all right so uh a few episodes back you know we're talking about this sre book and there's It was so time consuming. All right. So a few episodes back, you know, we're talking about this SRE book and there's a lot of bleed over with DevOps and whatnot. And so we asked, how do we feel about DevOps? So your choices were love it.

Starting point is 00:47:17 It's the greatest. Or it's OK when things work. Or or no, I'm sorry. It's great when things work. It's no, I'm sorry. It's great when things work. It's okay, but overrated is the third choice. Or, I wish we had a good DevOps pipeline. Or lastly, it's a dream. Nobody really does that.

Starting point is 00:47:37 This whole book is a lie. All right. That part wasn't in there, but it should have been. It was, you know. What is this? 188. So according to Tateko's trademark rules of engagement, Jay-Z, you are first. Okay.

Starting point is 00:47:56 Well, I think that people wish they had a good DevOps pipeline. And I'm going to say 30 to 33% said that. Wait, what? You're giving me a range? Yep. That's not how this game works, sir. It's to be lazily evaluated upon reveal of the answer. He's kind of lazy a bit, Var.

Starting point is 00:48:17 Apparently. Lazy of tea? There's no side effects, so you can just run this later. It's an item potent type of thing here yes all right all right so i'm going to go it's great when things work and i'll go 30 percent oh single number yeah right yes yes it's a daring move and you both are like in the same range right see what i did there yeah so uh joe says i wish we had a good devops pipeline 30 to 33 percent and alan says it's great when things work 30 percent ladies and gentlemen we have a winner oh not only do we have a winner of who picked the right

Starting point is 00:49:12 uh option but they also did not go over okay that's like a double whammy win but i thought whammies were supposed to be like a bad thing remember that old game no whammies pressure look but this one we're gonna flip the script whammies were supposed to be like a bad thing. Remember that old game? No whammies. Press your luck. But this one, we're going to flip the script. Whammies are a good thing here. So you got a double whammy. Alan is the winner.

Starting point is 00:49:32 Yeah. Okay. Yeah. It was 50%. It's great when things work. That's pretty high. That's pretty good. Yeah.

Starting point is 00:49:38 Yeah. Do you know the unfortunate thing about that then though is that the things are not as smooth as what they should be right either that or you know you could also be like you know that's one half full version of it you could also just say like well maybe 50 don't have it there's that although there were some that were like in the love it category but you know in okay so definitely there were others that had it cool but yeah yeah oh uh did we want to mention so we recently found that you can see all the polls that we've ever done and you can still vote on them and see the results it's pretty interesting uh i don't know if you've

Starting point is 00:50:18 got a day or two available yeah i didn't create a short link for that though um that was from the plugin we could create a short link for it if um that was from the plugin we could create a short link for it if you wanted to yeah what's a good name for the short link i'll do it right now what was the is is polls is that an option no it looks like we that already exists actually can we repurpose that one yeah i don't think can you yeah polls is the one that i set up so we'll i'll change that one and make it. Yeah, so it's going to be prunebox.net slash polls. Yeah, let's just do that.

Starting point is 00:50:52 On the fly, you know, code review and edit and whatever. Yeah. Yeah. So how about for this episode survey, we ask, for your day job, are you primarily working dot, dot, working... I had a typo here. Michael, your grammar. I said in cloud, but I meant in the cloud. That's where my head's at.

Starting point is 00:51:14 And you can tell that that's where my head was at when I wrote that answer when I just said in cloud. Or on-prem, we like to think we control our servers. Or a hybrid, we can't make up our minds. Or local desktop application, keeping it old school. Or it's all about mobile. I would have totally forgotten about mobile, which is crazy because it's probably one of the biggest ones out there now. You know why? Okay, so total tangent alert.

Starting point is 00:51:47 All right, so one thing that's been on my mind lately like i want just stupid kind of dumb brainless games to play on the phone right but what totally bothers me like i can't stand some of the games with like the just inundation of uh ads right i can't play them i know it's like some games are just awful like every single like time you restart a level like because some games are aren't like uh some some games are just like like take a for example let's take something as silly as like a minesweeper or solitaire where it's like you know you're going to redo it over and over and over right and after each time that you play the game, you're going to, you know, have to like watch some ad or something, you know, and, and some games that can be really annoying because based on the type of game that it is, you know, there might be like a high restart frequency.

Starting point is 00:52:40 And so now, yeah, I mean, you know, the developer, they're making, you know, buckets of money from it probably, but which is why they do it. But I've kind of had this urge to just create like ad free games, like open source, ad free games and open source so that you can like inspect the code and see like, Hey, there's nothing tracking in here. Cause that's the other thing that bothers me is like nowadays it's like, I don't know what you can and can't trust. Like, how do I know that that game isn't like, you know, accessing some library that it shouldn't be in tracking something that it shouldn't be or whatever. And did you guys see like, uh, we've, we've, I guess maybe picked on it too much, but, uh, like, you know, related to Tik TOK. And this week, uh, there was an article, um, that I read on in gadget where I'll put it in the show notes, but, um, where the, I think it was the FCC was, or at least a member of the SEC. I think it

Starting point is 00:53:38 was more accurately a way to describe that was strongly urging Apple and Google to take TikTok off of their app stores because of various security concerns that they had for the concerns of the American citizens. So yeah, that's why mobile has been on my mind. It's just not necessarily related to TikTok, but just the desire to create just some kind of stupid game that I don't care about. I mean, for what it's worth, those games, I would gladly pay $2 just to get rid of the ads. But then you don't know that they're not tracking you.

Starting point is 00:54:20 Yeah, you don't know that. That's true, too. But I cannot stay in the games that do exactly what you're saying where it's, you know, constantly. I uninstall them almost immediately if it turns out to be one of those. I kind of want to, there's a, there's a, a very small part of me that that's like, I kind of want to just like create my own because I don't know about you. Like I don't play anything on the phone that I'm going to play is just because like I have 30 seconds to kill and you know, my ADD can't like, I must have something to do.

Starting point is 00:55:00 Right. You know what I'm saying? So, uh, man battle Royale. Yeah. Whatever it is, you know, I don't care. That's fine. Right. You know what I'm saying? So, uh, man battle Royale. Yeah. Whatever it is, you know,

Starting point is 00:55:07 I don't care. That's fine. Yeah. These are this and other crazy things are the types of things that, that go through my mind that, and like, you know what I'm going to eat, you know,

Starting point is 00:55:17 that's another one I'm always like focused on. Like, you know, I was going to go on an all almond diet. And then I thought that's just nuts. You know, aren't almonds not nuts? Don't try to take away from my joke, sir. Let's see here.

Starting point is 00:55:34 I just heard this the other day. Spites are a common label. Nope. Almonds are not true nuts because they're not a type of dry fruit, but they're rather seeds enclosed in a hard fruit covering. They're fruit. Well, they're seeds. They're classified as droops. Would that make peanuts also the same?

Starting point is 00:55:55 Because they have a hard covering. I don't know. Let's see. Mmm. Mmm. The fruits of cashew, almond, and pistachio plants are not true nuts, but are rather classified as droops and peanuts are legumes. So they are nuts.

Starting point is 00:56:11 They're edible seeds. So similar, but they grow in pods. So they're legumes. What about pecans? Aren't pecans similar to almonds? Oh wait, was that the one you just named off Alan?

Starting point is 00:56:21 Uh, the cashew almond and pistachio plants oh pecans are nuts wow okay well this and more things this is what you can learn from listening to coding box so uh like i said if you haven't already subscribed you know there's there's some helpful links there maybe like a friend is like giving you a link or something like you should listen to what this guy crazy said about almonds that's just insane how did he not know that that's nuts yeah that's nuts or it's not apparently we're learning so yeah but let's get let's get back into uh uh google and uh service oriented cluster turn up because that's where this ended up going was,

Starting point is 00:57:07 Oh, sorry. But yeah, this is kind of where we like, uh, already, uh, alluded to before though,

Starting point is 00:57:14 was that because of their use of shell scripts and everything and, and what they were doing with SSH and, and how they were maybe like abusing it and what you can get away with, right? They ended up deciding that as a part of this security threat, they needed to move to a different architecture, and that architecture ended up becoming a service-oriented architecture where they could have one kind of control server that could run those tasks from there in a RPC kind of fashion.

Starting point is 00:57:49 Yep. And this is where this chapter started, like, clicking to me, where there were things that we talked about and stuff. It was good, but then I started to kind of understand. And what I mean by that is that we're talking about the product team writing a service, you know, a series of services whatever says so uh service uh oriented whatever the a stands for architecture uh architecture that would be in charge of creating these other services and making sure they were stood upright and cycling them doing

Starting point is 00:58:18 failovers doing whatever they need to do for the services so keep that in mind and we say that um i'm going to describe this flow real quick and then i'll tie up the point here so the flow went from operators triggering manual actions with no automation to operators writing system-specific automations to externally maintained generic automation moving to internally maintained system-specific automation, and finally ending up at autonomous systems that need no human intervention. Now, the reason I want to call that and why I blasted that out is because of the word operator. Now, if you think about or if you're familiar with Kubernetes, there is this concept of

Starting point is 00:58:58 an operator, which conceptually kind of acts almost like a person on your team that's in charge of kind of keeping these things or keeping your services and pods and all your various Kubernetes resources in shape. And so when you need a change to your services, you ask the operator to do it by changing the definition of the operator's resources. And so when I was kind of reading about this stuff, it was like, oh, this is where operators came from. This is where the people who write the service, like Postgres or something, obviously in this chapter it's going to be internal Google tools, but they're responsible for publishing this API that is in charge of kind of running things and making changes and hides all the various details of that stuff. And I was like, oh, that's exactly what a Kubernetes operator is essentially and as you know we're kind of talking this chapter like we're kind of

Starting point is 00:59:49 talking about the evolution of borg going from you know step one is just kind of people like shelling into individual machines until you get to the other opposite end of things where you're talking about kind of like a kubernetes type borg type thing we've got this like massive kind of global computer that's doing all these things and self-healing and keeping things running and so i can see how this is like suddenly like about the big uh important piece of the automation story but also an important uh theoretical concept in that allows kubernetes to be what it is i mean if we hadn't called this out before but just to back up for a moment because this was what we had this flow that you just described was previously

Starting point is 01:00:32 called out in the the previous episode where you know we and we had referred to it as the maturity model uh in the show notes but they had this like hierarchy class of automation and so the the no you just give some examples right here. No automation was again, where the database master has failed over manually between locations. The second one externally maintained system specific automation is where you might have a failover script in your home directory, for example. And then there's the externally maintained generic automation where now you have added that um you know support for that database uh as you've added that script to some repository where everybody can use and that script has support for a generic failover where like i could

Starting point is 01:01:20 specify the database name and host name as parameters maybe. And then the fourth one, the internally maintained system-specific automation was the example where the database itself, in this case, we're talking about MySQL in their example here, the database itself might ship with a failover script. And then the fifth example was that we refer to here as autonomous systems that need no human. We had previously called that out as systems that don't need any automation, and that was because the database itself noticed the problem and automatically failed over without human intervention.

Starting point is 01:01:55 Yep, and think about prod tests. We just talked about prod tests and talked about, you know, it was a series of unit tests and they could fail and they could try to fix themselves. We said ultimately that was kind of not so great because it was kind of flaky and there was weird to fix themselves. We said, ultimately, that was kind of not so great because it was kind of flaky and there was weird stuff going on. And we moved to a centralized system now where we had these kind of admin servers that were responsible for kind of making sure things were right.

Starting point is 01:02:13 And it's a big conceptual change to go from like agent-based things that are fixing themselves to these operators, which can fix other computers, but also orchestrate changes across computers. So it may not just be computers. Now it's not just these virtual machines or whatever they are containers, but now it's requisitioning or I forget the word I'm looking for,

Starting point is 01:02:36 but basically standing up load balancers and setting up a cloud DNS or, you know, whatever, like actually doing provisioning. Yeah. Provisioning multiple servers. So it's not so much about the individual computers in my cluster as it is uh kind of everything it's like these operators

Starting point is 01:02:50 provide an interface to be used and can use other interfaces to do other things and affect the system in other ways that are outside just its individual components hey i don't know if we called it out so we talked about these admin servers and things and how it replaced shell. But the whole the did we mentioned that it was like basically R group that was in charge of their service would set up a remote procedure call that could be called because they knew when their service should be in a good state and the kinds of things it needed to do. So each team would sort of manage their own service, and then these admin servers would call those RPC methods whenever it knew that it needed to do the next thing.

Starting point is 01:03:48 So by getting away from the shell scripts, now you got rid of the root access. Plus, now you've also got something that is called in a standard way and has an audit trail. They can put ACLs around it. What is that? Something control access control? Something control list. Access control list. Access control list. So basically they could make sure that only users or systems that have the right privileges could call these things.

Starting point is 01:04:13 So that was the SOA thing that Jay-Z and Outlaw mentioned a minute ago. And they basically turned it away from shell scripts into regular software is more or less what they ended up doing right do you ever think that like so so i guess it's long been you know kind of i don't know rumored or at least suggested that like if you hear something that google is doing that or google did you know like if you're reading about in a book then that is something that they've already moved away from and that was something they were doing like 10 years ago. Right.

Starting point is 01:04:46 Like you guys have heard that kind of thing too. So like, as it relates to this, sorry thing that like they probably have something greater that they're doing now. And like, you know, this was a practice that they did 10 years ago and they're sharing it to

Starting point is 01:04:56 the world, you know, and they might have evolved onto something much better. Now, if you, if you take that at face value and think like, okay, that's how Google operates, right?

Starting point is 01:05:05 They don't give the secret sauce away until they no longer need it and they found something better. Do you think, how amazing must it be then that they're using something better than Kubernetes now, if they are, internally? That's kind of mind-boggling, right? Yeah, I mean, I'm pretty much in love's kind of mind boggling, right? Yeah. I mean, I'm pretty much in love with Kubernetes at this point. Yeah. Right. Yeah.

Starting point is 01:05:30 Same. But you know, I think it's interesting. I think what you're talking about is like the software they use, but I think like this book, a lot of it is the patterns of how they ended up getting to a point to where things didn't suck. Kind of. Right. I mean, that's's that's sort of

Starting point is 01:05:45 like this is why we don't put alan in charge of the episode titles things that don't suck right episode 189 i think it's real but but i think that's the notion and and the steps the pains they went through to get to a point to where things actually worked the way they wanted them to right like they're little specialized teams they They admitted that failed. Like it just, it didn't work well as time went on. And so, so yeah, definitely they're, they're probably using something like, you know, that's gone past Kubernetes or something now, but at least, at least getting to the point to where they felt like they were being successful and not and not chasing their tails on issues it seems like they've been open and honest about that kind of stuff you know

Starting point is 01:06:30 the one thing though that i thought was like kind of going back to part this part of the chapter from the previous episode that we talked about where the i think it was a previous episode where where they were talking about like they would uh you know build their own system so that they could like write APIs around it or anything like that. And I was thinking about it from a maturity kind of point of view from the company, right? Like, you know, that takes a rather large company where like everyone in the company has the same kind of drive, but also similar skill sets across the company, you know, across that large group of

Starting point is 01:07:06 individuals. Right. And by that, what I mean is it's not like, you know, a large company where some portion of the crew might be like office workers managing money. And then you have like line workers that are, you know, and so there's like a large, you know, skills gap there between those people. Because, you know, if you you think about at Google, if you just assume that a large, it's a software company, so a large portion of the company you would assume is, or the majority of the company is software developers. To have the time and focus to say, oh, I don't like this tool that's freely available, because while it does work and

Starting point is 01:07:44 solves the job, I can't write an API around it easily. So I'm going to write my own is, is like that. That's a, that's definitely a maturity level kind of thing. And like one of the things that super hit home with me this week was that we have some software that we use to maintain schedules for like, you know, who's primary and secondary on calls, right? And it is, the software that we're using is not as, you know, we talked about the Grafana on call last time. And I can't speak to like how easy or not that is. But what we're using, I'm not going to throw it under the bus by name, but it is also just a train smash of awfulness.

Starting point is 01:08:28 Like it is so unnecessarily, the interface is just ridiculously, you know, confusing and whatnot. But then like, you know, not having this API around it, you know, that we could interact with made me appreciate Google's point that they made in that, I think it was this chapter, right? Or, you know, earlier in the chapter, or maybe it was last one where they were saying like, we would favor just writing our own version of that thing so that we can control it. Right. And then, and then now you're like, well, that's silly. And you're on call example, you want an API and like, yeah, because what if like, I'm the on-call person and I schedule some time off. Now I have this calendaring system that's completely unrelated to this on-call system, but one could talk to the other and like, you know,

Starting point is 01:09:19 oh, he, he scheduled this time off. Well, then I need to rearrange the on-call calendar to substitute somebody else in automatically who's next in the rotation, right? So, yeah, it made me totally appreciate their take on, like, we'll just write it ourselves. Well, I think it's a combination of maturity and just resources, right? Like, I mean, you could have a killer development team that's

Starting point is 01:09:46 mature and can do the stuff but if you i mean if if you don't have the money or the number of developers to be like oh yeah we're just gonna write our own scheduling app that's what i meant by like the number when i was using the example of like a line worker and then the the office worker like that that skills gap because like i assume and this is you know probably not a fair assumption but uh you know because of google being the type of company they are i assume that like the largest portion of their workforce is probably like you know 85 software developers and you know in some kind of way whether they're classified as sREs or whatever, you know, um, versus, you know, you might take a company like, I don't know, Ford or, or, or, you know, uh, yeah, where, where it's, there's a large group, you know, like the, the, the guys that are designing the cars are probably a small fraction.

Starting point is 01:10:41 Like, you know, they might be 10% of Ford or less, right? Whereas the people who were actually like putting together the thing that you designed might make up a larger portion of the company. Yeah. Yeah. I don't know. That wasn't a tangent. So I'm not going to do a tangent alert for that one.

Starting point is 01:10:58 I think that one was related. So, uh, the next section was the one that like, you know, really sealed the, I was like, okay. And I actually went back and reread it. We listened to the chapter section was the one that like you know really sealed the deal i was like okay and i actually went back and reread we listened to the chapter after reading this one because it kind of put everything else in perspective to me so it's kind of like uh i like when things begin with

Starting point is 01:11:13 like tell me where you're going and then let's go back so uh this this was my favorite one and the deal was uh and there's some carryover too but uh this is about the kind of the birth of borg he said in the early days google's clusters were racks of machines with specific purposes. And this is where they talked about having developers that would start every three months or so. And that's about how long it would take to stand up a cluster. And so a new employee would come and they'd be like, hey, tell you what, you're in charge of this turn up. It'll help you learn the ropes and stuff. And then you'll be able to help, you know, new other people when they come on.

Starting point is 01:11:48 And so, you know, these people would start up. They'd have a bunch of readmes. They'd have scripts. They'd have things checked into repositories or kind of be shared around. And we're talking about back in the days when, like, devs would, like, log into these machines and have things like golden binaries. Like, this is the version that we're installing it was uh delivered to us last month and this is the version that we're gonna be installing for the next three months until the next version rolls out whatever like that kind

Starting point is 01:12:13 of thing so this is like super early days but as google grew the number of clusters and machines started getting out of hand so the scripts had to get better just kind of by the definition you know they couldn't like hire somebody new every time they needed a new cluster and to run that stuff it just doesn't really scale very well so uh these things had to get better and this is all the stuff we've already kind of talked about um i also mentioned like shelling in machines to look at logs and doing regex parses parsing that's not something you could do when you have a million computers you know it just doesn't make sense and that's where google's heading you know the google we know now like who knows how many actual servers they operate right

Starting point is 01:12:52 but i'm pretty sure they're they have more uh servers than employees for sure well i think this was the chapter two where they originally talked about like it was in a single building or something like that was that where they were talking about like the clusters were deployed in a single building and then like as it grew in scale to where it was where you know data centers were around the world or whatever like that's when it became they did more of an issue they did because they had even mentioned that originally they named their machines a particular way and then they were like oh wait a second yeah we have too many now we need patterns yeah, you could assume like data center and domain names and whatnot. Yeah, so that's where they were like, yeah, things just got out of control, right?

Starting point is 01:13:33 Like they got too big. And then this is where Jay-Z was going with all this. Yeah, I remember Rack and the Day going to meetups. So you'd be able to meet someone new. I'm like, oh, you're at a WebVille bar? Well, what's your naming convention for your servers? You greek gods or roman gods oh i was gonna say like distinction remember i remember one of them being like uh the seven dwarfs from snow white remember that yeah yeah or transformers like i remember all sorts of like creative names like people would name

Starting point is 01:13:58 servers and they were like pets right that's very different from kind of how we talk about and think about these things but this is the world they were coming from and and this is the this is like in particular is the the point in the notes that you can't see that uh kind of flipped the switch in my brain where they said automation eventually evolved to storing the state of machines in a proper database with sophisticated monitoring tools and this is something where i was like duh why like why have i not thought about this problem and like until then like i i've always thought about things like even cloud resources and stuff is like things in an environment that i would go out and i would have my shell script go check and go look for these things and apply the actions

Starting point is 01:14:34 whatever i would shell into these machines i've got the you know bookmarks in my browser but things got so big that google started storing this information in a database and to me like that seems kind of like such an obvious evolution i just never got to that step i never thought about having a database for our servers and you can go and see uh you know what the version of the operating system is what's the status what's the last time we heard of it what's its uh you know uh how long has it been running what's the sub time what racks it in what's location like that that's all information that i just you know like my kind of small time thinking like you went out and got when you needed uh but when you flip the switch and say no like let's keep this stuff in

Starting point is 01:15:14 the database and we'll keep that database up to date with various agents or different polling or whatever but uh once you start getting data into a database then my programmer brain's like oh i know what to do with data right like i'm oh, I know what to do with data, right? Like I'm a programmer. I know what to do with software and getting stuff out of databases. And I know how to make things go affect the real world based on changes in the database. So you could say like, you know, have a little web app where I can go and say, restart these 10 servers. I do a couple little checkboxes, I hit apply, right? And I know I've got some process out there that's going to be watching this database and say, oh, okay, I need to go restart these servers. Once you've turned your servers, your infrastructure, all this stuff into data,

Starting point is 01:15:53 suddenly you've turned this from like a hardware operations problem into a software problem. And like programmers are good with that stuff. So I can imagine like this is a big leap in like Google's kind of productivity and scalability. Well, I mean, also security to like, think about it, going back to that Heartbleed example, right? Like if you have thousands upon thousands of servers, and you don't have all this data centralized, and then an issue like Heartbleed comes out, you're like, okay, how many servers are impacted by this vulnerability? How many have I already fixed and how many I have left, right? Like you're going to go run a script that's going to run some SSH command on every one of those thousands of servers. No, that'd take forever,

Starting point is 01:16:35 right? Versus if you had it all centralized, then you can just, you know, it's a simple, you know, SQL query, right? So to put what Jay-Z just said into the words that they had in the chapter, that was, that also was kind of like a turning point in my brain about how Google handled this stuff is like, he just said, they turn it into data. They stopped looking at hardware as hardware and they looked at it as just resources, right? And, and basically what Jay-Z was saying is they catalog those resources, right? And basically what Jay-Z was saying is they catalog those resources, right? And so once you get to that point, there's so much that you can do with it. And they started doing more with it. Can you imagine it totally separates the teams? So you're like, okay, hey, we need some hardware

Starting point is 01:17:15 HVAC power people, and we'll tell you how many buildings to go build and plug this stuff in. And when you're done, plug it all in and maybe you'll have some sort of process that will kind of uh investigate that stuff and do it into a database and then you're done you move on to the next building and then the software kind of takes over and can say like hey these machines are for this for that you get into like software defined networking and all sorts of cool stuff that is kind of evolved in like the last you know 20 years or so uh for dealing with that stuff but like all of that comes from having like a centralized database with all your resources in it which i it's just so

Starting point is 01:17:50 dumb that i never thought about like having having that in any places i've worked and i'm talking you know like i'm not talking about nowadays where like you've got like a a lot of times you'll have a cloud provider and if you want to know all your vms like you go to the vm screen and look at them there um you know so obviously there's some sort of database they're not going out and looking for all your stuff at that point so like and you know i obviously i knew that was there but i'm talking about like back in like 90s and aughts and whatever where i'd be working in places i never thought to have like my machines in a database like i i would have a list of them sometimes if i needed to do um there are various tools for like kind of showing the multiple computers at a time and doing things and i

Starting point is 01:18:29 would just have a script that had like the list of all the servers and if somebody had a new server you'd have to update your list it's just so dumb i was dealing with databases and doing all this stuff i never thought to put it in there it's like physician heal myself that's why i'm not good if i if i just made that one little leap you don't be zacking things right now that's right okay let me let me sack that for you let me sack into that machine that's amazing uh would that make us zackers then i guess absolutely i mean yeah it's like way better google right i guess we'll have to change rename the you know coding zacks that's right there we go so so what he just said right is is exactly what led into

Starting point is 01:19:22 the things that they've been able to do over time, which was now they know of all the resources in their entire infrastructure, right? So they could start doing other things and allocate those resources differently, right? So we all remember the days where you had one computer and it did one thing, right? It had a database on it or it had your application on it. And they started thinking about, Hey, well, we can kind of sort of separate out these resources, right? So we have CPU resources, we have Ram resources, we have all this stuff. And then they made it to where they could start running different types of tasks on the same machines because it was just a pool of resources,

Starting point is 01:20:01 right? And that's, I think that's what they kind of named this particular section was the birth of the warehouse scale computer right so you don't think of it as as 5 000 machines in this in this data warehouse you think of it as these are the compute resources we have available and that's how they started treating it yeah it's uh you know i i mentioned that kind of flip there too like there was a part where they mentioned having file descriptors or computer descriptors on computers so you could shell into a machine and look at its info.txt and see what the machine's used for and what it's good for. And this flips the script and says, no, the authority, the program of record, is actually the database. And if the machines don't match the database, it's the machines that are wrong, not the other way around. So something needs to go fix those machines or decommission or just wipe them and start all over again those are the

Starting point is 01:20:48 things that are in trouble so it's just really cool kind of uh you know switch there from those servers being pets to the servers being treated like kind of cattle or being treated like these kind of like um flexible kind of reusable um components you took it to a dark place there can we that yeah that pet cattle comet did not go unnoticed sir yeah that was like i think we've gotten away from that analogy because of that because it is kind of like a weird like but uh yeah we don't say that so much anymore yikes but um i forgot what i was gonna say there oh sorry yeah you ruined was going to be brilliant. It was going to be the next great

Starting point is 01:21:28 Zachification of the world. So what this ended up doing, right, when they started treating all this stuff as resources, now they could scale things without having people do stuff, right? It was all automated by their software, by their controllers.

Starting point is 01:21:44 And this this nowadays, you don't even realize it. You don't notice it, but there's, there's tons of machines that go up and down all day and nobody cares. Nobody knows because it's all being managed behind the scenes, right? If something dies, whatever, it gets picked up on another node, another machine, another cluster, whatever. It just keeps running. I remember what I was going to say now. So this really became super important as we were moving more into virtualization. So we stopped even talking about computers at some point and talking about servers. We started talking about virtual servers, and now we're talking about containers.

Starting point is 01:22:18 So you might have one computer, one node running, who knows, hundreds of pods, hundreds of containers um maybe thousands i don't know but um like once you start kind of looking at these things globally you stop thinking even about the computers you just think about the resources in the way that in the units that are comfortable to you and uh that you know again led into the kind of the birth of kubernetes but what's also super important here is that this starts looking a lot like scheduling processes on like a cpu or scheduling resources like memory and uh this space and allocation these are all things that programmers have been doing since programming was emitted you know allocating space um scheduling

Starting point is 01:22:56 processes so again it's kind of tying in this analogy over like borg slash kubernetes is really kind of like a distributed computer. And once you've kind of managed to shape your problem in such a way that it lines up with that metaphor, then you can start using the techniques of things like Hoban spoke to architecture and like all the things that people have been studying now for 50, 60, however long, you can suddenly apply these things to your kind of your hardware. I think it was a novel idea at the time. I think we've established that all the good ideas came out in the 70s and we're only just now beginning to understand and implement them. It took us 50 years to be able to shape our problem into something that fit with those.

Starting point is 01:23:38 But yeah. Yeah. I mean, I was going to add on to what you're saying about like the, you know, how you would rethink about these problems. Like, I think that once you do get into this you know distributed computing kind of like borg kubernetes model that you no longer think of it in terms of the hardware at all like who cares about the memory ram cpu you know disk and stuff like that and instead you're just like is the service up and available or not and if it's not just restart it like it's so you know or or

Starting point is 01:24:02 let it scale itself you know if it's not uh you know sir if it's falling behind due to latencies or whatnot yep anything about like auto scaling like all this stuff can't work like the rise of cloud computing like none of that can happen if the systems weren't self-healing and you can't have like a team of working people working on like swapping out hard drives or swapping out computers or putting new racks in and stuff like that. Totally separate from what those computers are being used for. If you think about Google, Amazon, AWS, Azure for Microsoft. That whole thing is basically them selling you resources.

Starting point is 01:24:37 And it's totally divorced from the idea of the computer. So it's kind of like what we're reading about here. In a way, it's like the birth of cloud computing. But it's also like here's another another analogy of way to think about it like right now in you know 2022 it's a big deal to think about like cloud computing and kubernetes and stuff like that right but we're talking about this is like it's almost more infrastructure-y as a service kind of stuff that we're talking about right like you're providing the you know so i'm paying you to provide somebody to manage that there is a computer that can host all these pods and all these containers and if there's a drive

Starting point is 01:25:13 you know a disk drive that needs to be uh swapped out like you're going to replace that disk or you know if a new ethernet cable needs to be ran or whatever like you're gonna you're gonna do all that for me right i mean what about like in a hundred years maybe we just treat this as like how we consider the electric company today right we don't really think about electricity in there in the terms of like just how amazing it is that we even have this thing right it's just no it's just part of that we take it for granted well you know it's funny funny, though. I think you're onto something with that is right now we use it as infrastructure as a service, right? Very much so, like what you were talking about. AWS, GCP, Azure, all of them are pushing towards software as a service, which is using their own infrastructure, right?

Starting point is 01:26:04 Like, basically, you'll just be able to use stuff that you don't using their own infrastructure, right? Like, basically, you'll just be able to use stuff that you don't have to think about, right? Like, like, I mean, I know Azure has its machine learning, and AI type stuff out there that you can just, you can just use, right? Like, you can put it in your own software, at some point, it won't even be, hey, you can use this software, it'll be like, hey, just use the service, and you're done, right? And I think a lot of them are pushing that way so that you're not thinking about the ram you're not thinking about the cpus you're not thinking about any of that you just use the thing and you're done um i think that's the push for everything right now yeah i'm just i'm

Starting point is 01:26:38 just thinking like you know after our lifetime well after our lifetimes like this won't even be some like it'll be, it'll just be such a building block that you'll just assume is there. You'll completely take for granted. You won't, you know, like the problems will be so much more grand. Yeah.

Starting point is 01:26:56 At that point. So the next session section is actually kind of interesting here. So they say reliability is the fundamental feature. And when they say the fundamental feature, they're talking about automation, right? So this gets into something that's a little bit tricky. They said the internal operations that automation relies on, they need to be exposed to the people as well.

Starting point is 01:27:19 And the reason they say this is because as these systems got more automated and more complicated the ability for just regular people to reason about what was going on it starts deteriorating over time because think about it right like if your systems run basically hands-free for a month solid and then something goes wrong you haven't had to think about that thing for a month so now if something goes wrong where do you go where do think about that thing for a month. So now if something goes wrong, where do you go? Where do you start? Like, where do you get into there and figure out what's going on? So that's basically said that that is one of the biggest issues that you run into is if you've automated these things, but you don't expose what that

Starting point is 01:27:59 automation is doing and what the internal states of the systems are, then people are going to have a really hard time getting in there when something does fail. Yeah. We talked a little bit about that last episode. Like if your phone breaks, you don't have the tool, like you literally cannot go in and fix it. You can replace components. Sometimes, sometimes you just have to replace the whole phone because you're so far divorced from like the actual other pittings that even if you knew what to do, you couldn't do it. Well, that's not true anymore because of the right to repair laws yeah are you not have you not seen this where like now apple will ship you a uh set of tools and instructions on how to do the repair yourself oh no have you really you really haven't okay i'm gonna send you this link that's got to

Starting point is 01:28:38 be right to repair laws kind of coming into effect but i mean like if your battery in your laptop dies you're not going to crack that thing open, get a screwdriver and like fix the lithium cells, right? Like you can't. And even like your chip goes bad because some transistors got kind of burned or something like, not transistors, but whatever they are. You can't go in there and like straighten that out

Starting point is 01:28:58 with a toothpick, you know? I'll have a link to it in the show notes for this episode. But yeah, there was a, in april of this year uh apple's self-repair service is now available where they will send you genuine apple parts and tools to do whatever the repair is and i'm pretty sure if i remember right it included the instructions on how to do it so no pretty awesome no you're not going to like open up the cells on a battery. Cause I mean, even if it was like a, you know, double a battery, you wouldn't do that. But, uh, in the, in like the tools that they send you, it's basically like a $50 rental, uh, of the, of the toolkit. So, you

Starting point is 01:29:36 know, yeah, it's pretty neat. That's interesting. So one of the other things that they say here is when things get automated, they, they called it, there's a difference when something is non-autonomous. Basically there are manual actions that were automated that you assume could still be done manually, but that's not necessarily the case, right? So whatever your automation is doing might've been based off what you did manually previously, but sometimes that changes and you don't have access to the same underlying things in a manual process that the automated stuff does. So that's where they say, excuse me, that, you know, there, there can be problems over time as you automate things.

Starting point is 01:30:21 If you don't make sure there are ways for people to interact with those same systems um it's kind of like like your ears your processes make assumptions that you know you doing it manually you're not aware of right or maybe you don't even have the rights you may not even have the rights anymore to do some of the stuff manually right like they could have stripped all that away um now they do go on to say, right, like we've talked about all this stuff and outlaw even hit on it with Google has the resources, the maturity to do a bunch of stuff. And so is this even, does this even matter for your company or for your business or your software that you're writing or whatever? And the answer to that is still yes, right? Because the main benefit you get out of automation is reliability.

Starting point is 01:31:07 That's consistency, right? Like if you have something automated, if you have a person go do the same thing on 20 different servers, they might mess up on one of them. Why? Because maybe somebody came by and said something to them in the middle of while they were doing their 15th, right? Who knows? But when you automate that stuff, you now take that, that accidental thing out of play. And now you've, you've set up these systematic processes to, to go do these things.

Starting point is 01:31:37 And you make it to where like anybody can do it. Anybody can do it. It's faster and it's reliable. So, so what they said is don't focus necessarily on, I would like to call it consistency though, more than reliability. It could still be, it could still, the automated version could still produce a bad result, but then you're like, Oh, I can find that bad result, fix it. And now, you know, but it's just consistency and whatever the process is going to be. Yep. One thing they called out is a lot of people

Starting point is 01:32:05 want to look at the return on investment when they're doing something like this. Like, okay, well, it's going to take me one person week to do this, right? And it's going to cost me X amount. Am I going to get that much return from doing this? And they called out that that's not necessarily the best way to look at this, right? Because that consistency you get from it pays off over time in different ways, right? Like you may not be getting a monetary return on exactly what you did, but what you are doing is setting yourself up for better successes over a longer period of time. And then how do you, how do you put a dollar amount on centralizing logic? Right. It's hard. Yeah. It can be almost impossible to, to put a dollar amount on centralizing logic. Right. It's hard.

Starting point is 01:32:45 Yeah. It can be almost impossible to put a dollar amount on that, but you can actually see real benefits of it over time. I mean, if I gave you some source code and said, hey, I need you to compile and sign it to deliver an executable out to the real world, right? And the three of us were each tasked with doing that manually, we might come up with three different things, right? And the three of us were each tasked with doing that manually. We might come up with three different things, right? Versus if you like, if you consolidated that logic into one centralized place, then it's, it's consistent and reproducible. And, oh, now I've decided to change the keys or

Starting point is 01:33:19 whatever that are required for it. Like, you know, you can, you can, you only have the one place to do it. And I don't have to like go to each of us and say like, Hey, here's the new signing certificate. Right. Yep. Imagine like if you're an ops person,

Starting point is 01:33:32 you're totally separated from the people who write the products and all three of us wrote a different way of starting up our services and managing it. It's like, okay, let me open up three ounces, lowercase set up and the flags are this, that, and the other outlaws.

Starting point is 01:33:44 Uh, you have to run, um, some sort of pre-initialization script. It's going to go do everything for you, but you got to check back in an hour to make sure it worked. And Joe's just doesn't work at all. You know, it's like, he's just got a couple of paragraphs written. Like, what am I supposed to do with that? You know, it's just, it's not scalable. You need these things to be consistent.

Starting point is 01:34:01 And the way you do that is by building a centralized platform. Yeah. And they also say to kind of wrap up this section was think about automation in your design phase. And the reason is, is because it's a lot harder to retrofit that stuff after the fact. We've talked about that with things like security in the past, too, right? Like there are certain things that you want to try and do up front because they're important and they can actually save you a lot over time. And then there's one last bit here that was interesting. They, they kind of threw this in earlier, but it didn't make sense where they had put it because it was going to sort of take you out of the flow of one of the other stories. So what they said is you also have to be careful about automating

Starting point is 01:34:45 failure at scale. And this one was kind of funny because the short of it is they had a thing called disk erase, I think is what it was called. And more or less what this thing was supposed to do was if they pointed it at a machine, it would kind of securely wipe a drive, right? Like get rid of everything on it. Well, they had screwed up. It had failed at some point. And then they were trying to figure out what was going on. And so they put it into an area to where they're going to kick it off manually and just, you know, see, Hey, what, what happened? Where was the failure thing? Well, the problem is, um, there was a, an assumption. I won't call it a bug. There was an assumption in the code that said, Hey, if I don't have a list of machines to

Starting point is 01:35:40 wipe it, so basically you give me an empty list that I'm going to assume that means you want to wipe everything. So they turned this thing on and it went and found all the machines that were on a particular CDN and wiped all the drives on them. Now they said it didn't end up killing them because fortunately they had enough, you know, capacity planning set up to where the main machines that were serving whatever that data was could handle the load but they nuked every single cdn machine in that particular area with this disc race thing when they didn't know about this hey if i don't have anything wipe it all and so you got to be careful right like uh yeah it reminded me of uh examples in our day job where we've like tried to to define like okay how do i want to do i want to give a

Starting point is 01:36:35 different meaning to null versus an empty list versus a list of values like those are three possibilities do they mean different things does the null and the empty list mean the same thing or different things and it's dangerous right like especially when you do something like this built around it and one of the interesting things that they said that came out of this is they built in more sanity checks so that if they ever did go to run this thing they could make sure that something nasty wasn't going to happen but they also built in rate limiting right because this thing went crazy it just wiped every machine efficiently quickly i'm still i'm still kind of baffled about the idea of just writing some automation to go and erase disks like that part already is like a scary premise to start with

Starting point is 01:37:17 and you're like yeah okay fine sure let's just do it let's run it. Have at it. Yeah. Yeah. That's the Leroy Jenkins approach. Let's do it. So, yeah, it was interesting. I mean, I guess, uh, enabling failure at scale is pretty scary. Yeah. So yeah. So, uh, we'll like, we'll have plenty of resources. We like, uh, links to resources we like, including, uh, some of the stories that I've mentioned in here, you know, in this episode. And with that, well, first, let me ask you this, or let me tell you this, or ask you this question. What did the Zen Buddhist say to the hot dog vendor? I don't know. I don't know. Make me one with everything. So with that, we head into Alan's favorite portion of the show. It's the tip of the week. All right.

Starting point is 01:38:11 And I stole first spot here. And so I'm going to dip out early in a second. But have you guys heard of cube cuddle debug? No, I have not. Okay. This cube cuddle space debug. There used to be a tool called kubectl dash debug in older versions, but I couldn't find the exact version that the debug command came out. But

Starting point is 01:38:31 you can actually use it for several different things. And some of those things are really cool, like adding an ephemeral container to your pods. So have you ever been working in a Kubernetes cluster and something's going kind of funky? And so you like, maybe you create a job or maybe you kind of do a custom deployment and you keep kind of apply and throw something out there and then you kind of shell in. Like an example here is a lot of times if I've got a service that's going wrong, I would maybe create a deployment and I would change the command to be tail dev null. So this container is going to go up. It's not going to try and run the thing that it usually runs that I can go in there shell it and kind of look around a little bit well it turns out there's this command that's specifically designed for doing that sort of things and what's nice about it is that in my example i created a deployment so i could get a pod or there's other ways to do it

Starting point is 01:39:17 that's just one example but there's these things that are lingering that you've got to go in and delete which is just kind of messy and manual and uh what this lets you do is uh add and kind of make changes to uh pods or various other things um that go away right when that thing restarts you're not actually changing the permanent state of your cluster you're setting up something temporarily uh which is really nice there's a bunch of different flags and actually several different things you can do uh one of the things that the uh couple if you can kind of tweak the flags in order to do is actually create a copy of a running pod that's ephemeral so that once that pod shuts down it's no lingering resources i don't know if you've ever seen issues where you have like you set up like a deployment like i imagine i i gave an example of and you delete the deployment

Starting point is 01:40:06 the pod goes when you think you're done but you may not realize that the replica set didn't get deleted and so you've got these resources that are just kind of hanging out there and it's just it's not tidy and cube cuddle what's the cube cuddle debug lets you do a lot of those different kinds of things um which is really cool and they actually have examples in the docs and we'll have a link here for handling situations like if you've ever had a pod that just immediately failed it crashed they're like okay well here's how you can deal with that situation with kubectl debug and you run this command it doesn't change the permanent state of your cluster it gives you a way to kind

Starting point is 01:40:38 of shell in and deploy that stuff so obviously not kind of things you're you don't want people doing in prod but it's nice for dev environments. That's really cool. And last thing I want to mention is that I found this resource and then I went up a level in the docs. I was like, oh, they've got a whole section here on debugging. And a lot of it, if you're like you've been using Kubernetes for a while, it doesn't really add a whole lot. It's like, hey, is your service down? Try describing it, which is kind of like something that you're going to learn very early on.

Starting point is 01:41:10 But some of these things actually get pretty big, like the debugging pods section, huge, and had all sorts of information about kubectl debug that I'd never heard. So I never thought to look here because I thought it was just going to be all basic, but there was some surprisingly good nuggets in there. So it's good stuff. Most excellent. Most excellent. All right. So mine is actually pretty simple.

Starting point is 01:41:25 And so maybe my first tip should have been when the tip pops up in whatever IDE of choice, read through some of them. You don't just turn those off forever? You know, I never have. And I'm usually annoyed when it pops up. And I'm like, know, I never have. And I'm usually annoyed when it pops up and I'm like, ah, whatever, close. Well, so for whatever reason, the other day I opened up IntelliJ and, and one of the tips came up on the screen. I was like, oh, I'm gonna read this. And it was actually really good. So when you're debugging an application, a lot of times you'll put a break

Starting point is 01:41:58 point in somewhere and you'll put a watch in so that you can see what the value of a variable was, right? Like that's pretty common stuff that we all do. Well, there are times that you're like, man, I really don't want it to stop at these breakpoints. I just want to know what the value of that thing was. And if it was just move on, right? You can do that in IntelliJ. If you were to highlight the section of code that you want it to output the value of, right? So let's say it's like application dot my value, right? You could actually highlight application dot get my value with its open, closed parentheses or whatever. And then if you shift click in the gutter on, on a point over there after that, it will actually write it out to the logs that are happening in the application, and it won't stop on the breakpoint.

Starting point is 01:42:47 So you can see the values of the things that you care about as the thing's running without actually stopping. And if you see anything nasty, then sure, you can go and put it in breakpoint and stop there. But it's almost more like a sanity check. So I'll have an image that we can put up on the post as well for this. It sounds like watch values on steroids. It is. Yeah, it's very much that, right? So instead of it having to stop your application and look in your watch values, it'll just put it in the same output as the rest of your application logging. So I thought that was really cool and something I'd never really thought about. So pretty nifty. Yeah.

Starting point is 01:43:26 I'm with you. Like those, those tips come up and my, my OCD won't let me like close, like permanently close it. Cause I'm afraid like I'll miss something, but inevitably like 90% of the time I'm just like, no,

Starting point is 01:43:40 not today. Yeah, exactly. Exactly. But it seems like every time I actually do take the time to read them, I'm like, oh, yeah. Why didn't I look at this before? What other goodies have you not been letting me know that you've probably been letting

Starting point is 01:43:52 me know that I missed? Exactly. Yeah. It's all your fault. So I'll ask you. Well, first, let me tell you a little story here before I get into my tip of the week. So here locally, there was a man that was caught,

Starting point is 01:44:07 uh, stealing in a supermarket while balanced on the shoulders of a couple of vampires. He was charged with shoplifting on two counts. So my first, my first tip of the week is specifically for Alan. So yeah, this was to you.

Starting point is 01:44:24 This was to, yeah, I mean, helping you out, helping you out first, read the, read the tips that, you yeah, this one's to you. This one's to, yeah, I'm helping you out. Helping you out. First, read the tips that, you know, the application comes up with. Also, test your UPS battery regularly. So, you know, me personally, I have like a little, you know, reminder

Starting point is 01:44:38 every, you know, few months to like, just see, you know, like if I, can I still run the computer for like you know 10 minutes on ups or does it immediately just die um whatever your method is of it point is is like you should you should test those things regularly because those ups batteries will die uh just uh you know from sitting there hey the better tip you should have given me was plug yours up, dude. You haven't done that yet? No, I told you.

Starting point is 01:45:09 It's still sitting down there. What? Okay, so a little background information here. This is why this came up. We were supposed to record this episode a few days ago, and a bad storm came through, and we were all joking at the start about, like, well, we're all on UPS, you know, and we all specifically got each other UPSs to make sure that like we wouldn't have this problem with this.

Starting point is 01:45:30 But Alan's Alan said, well, mine's sitting there next to my computer on the floor, not plugged in. And we're like, what? And then guess what happened? Storm rolls through, knocks Alan off. And yeah, so now we're recording a few days later. So this is why this tip is for Alan. So yeah, I guess you're right. I should have started with tip one, plug it in.

Starting point is 01:45:53 So yeah, I assume that was already a given though. All right. So here's for another tip of the week though. We've talked a lot about container you know, container type things today, Borg, Kubernetes, whatever. So let's talk about Docker. So have you ever found yourself in a situation where you have an image built and you want to get something out of that image, but you don't necessarily need or want or care to run that, to spin up a container to run that thing, right? So, and you're like, well, what would possibly be a use case for that? Let's take our build

Starting point is 01:46:33 pipeline, for example. I have this preference, you know, I don't know, affinity. I strongly want like everything to be Dockerized, including the build chain so that that way, however, something is compiled, it is consistent across every developer's machine because versions of code and whatnot or versions of I'm sorry, not code, but versions of the tools can be strongly maintained and enforced through that Docker file, for example. But now when you do, if you're using that to, if you're using Docker to build your code, how do you get test results out of the code as an example? So that's an example where you might want to do what I'm about to say. So rather than doing a Docker run to start that image as a container and copy the file out with a later Docker CP command, instead you can just do a Docker create

Starting point is 01:47:33 to create the container without actually running it and using those resources. But now that you have it created, you can then do a copy command out of that container. So I'll have the exact, you know, examples of like what this flow might look like in the show notes. But the one big call out to that I want to make in this example of the Docker create is that it will be extremely helpful for you if you name the container a specific name that you know of, right? So that you can use that same container name later in your Docker copy command. And you'll later, you might, you probably want to remove the container that you've created. So you'll need the container name again to remove it. So it's highly advisable to name it something that you know ahead of time. Okay, so I've got to piggyback on this because I'm actually super excited about this. I didn't know it existed.

Starting point is 01:48:32 So what he's talking about, the reason you want to do this is if there is a file in the image that was created itself, you can get that file out, right, without having to run the thing. And why does that matter if you've ever tried to docker run something that requires like 80 environment variables or a bunch of map paths or whatever it's a pain in the butt just to try and get a file out of it so this i didn't i didn't know this existed so this docker create allow you to copy the file out of the image without having to get it actually up and running because you'll know if you ever do a docker run and you don't give it everything it needs it'll typically die and then you can't do anything with it then you're trying to figure

Starting point is 01:49:13 out what you need to do to make it run or also there might be a default entry point already specified in the in the for the in the docker file for that particular image and so if you do docker run it it's going to go and run whatever that entry point is and that entry point might not be something you want done at that given point yep no this is this is killer that's exciting i had no idea this existed two quick examples i wanted to mention is like one is like um a lot of times you'll run like unit tests and get coverage files out of it this This is a great way to do that in Docker and then use it and export that coverage file. Also, jars.

Starting point is 01:49:48 So if you have like a Java build or.NET build, anything that generates like DLLs or jars, artifacts that you want in another place, then you might do your build in Docker and copy that out and load it into like a repository or whatever. Yep. Yeah. That's beautiful. So, yeah, all of this will be in the show notes. Uh, if you haven't already, you can find those on the website, uh, www.codenbox.net. Uh, and like I said earlier, you know, maybe a friend like said, Hey man, you got to see what these

Starting point is 01:50:18 guys are talking about. Like almonds and pecans. This is crazy talk. And so, you know, you know, you're, you were just listening randomly through some link, but you know, Hey and so you know you know you you were just listening randomly through some link but you know hey did you know you can subscribe to us uh you go to itunes spotify wherever you like to find your podcasts uh i certainly hope we're there um and hey if you find a spot where we're not um let alan know he he'll he'll'll fix that. I just got tasked out. That's good. Yeah, there you go. And, and, and as I said earlier we, we, I can't emphasize how much we really do appreciate the reviews. They, they really are meaningful. I think Alan's even commented on this in the past.

Starting point is 01:50:57 Like sometimes like we get like some truly heartfelt ones that, I mean, they can't, you, you, you can't help but be a little bit emotional when you read some of the things of like the way that we have the positive impact that, I mean, they can't, you can't help but be a little bit emotional when you read some of the things of like the way that we have, uh, the positive impact that some of the silly things that we've said, um, but yet they've had a positive impact on other people's lives and everything. So we, we really do, uh, appreciate reading those and it means a lot to us. So you can find some helpful links at dub wcodingblocks.net slash review yep hey and while you're up there at codingblocks.net make sure you do check out

Starting point is 01:51:31 our show notes we have examples discussion and a whole lot more and you can send your feedback questions and rants to the slack channel at codingblocks.net slash slack yeah and uh like i mentioned uh you can follow us on twitter at CodingBlocks where we send you those really good gifs or you can head up to CodingBlocks.net find all our social dailies at the top of the page

Coding Blocks - Site Reliability Engineering – More Evolution of Automation

We're going back in time, or is it forward?, as we continue learning about Google's automation evolution, while Allen doesn't like certain beers, Joe is a Zacker™, and Michael poorly assumes that UP...Ses work best when plugged in.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.