Coding Blocks - Site Reliability Engineering – More Evolution of Automation
Episode Date: July 5, 2022We're going back in time, or is it forward?, as we continue learning about Google's automation evolution, while Allen doesn't like certain beers, Joe is a Zacker™, and Michael poorly assumes that UP...Ses work best when plugged in.
Transcript
Discussion (0)
You're listening to Coding Blocks, episode 188!
We're going 88 miles per hour.
Going back in time.
Alright, so, but if we're going to 188 though, does that mean we go forward?
Then 88 is to go backwards and 188 is to go forward?
That's the old 188 paradox.
Yeah, okay.
Paradox?
Really?
Okay.
So, all right.
As you can see, we are definitely starting the show on the rails.
Like we are definitely, you know, following the right path here.
If you haven't already subscribed to us, we would greatly appreciate it.
If you did, you can find helpful links uh at the top of the
page no uh yeah whatever you know the routine by now there's 188 episodes you don't know what i'm
going to say by now at the start of this thing right yeah so you can visit us at coding blocks
where you can find our show notes examples discussions and more send your feedback
questions and rants to comments at coding blocks it's almost like we have that written down
somewhere somewhere uh and also follow us on twitter uh at coding box and if you tweet out your feedback, questions, and rants, do comments at CodingBlocks.net. It's almost like we have that written down. Somewhere.
Somewhere.
And also follow us on Twitter at CodingBlocks.
And if you tweet out like,
hey, love the CodingBlocks show.
Y'all are awesome.
Then I will respond with a gif of somebody dancing.
What if they respond back in the negative there?
Oh, then somebody crying.
Plus he meant gif anyways.
I'm good.
I'm a child of social media
the revolution you know i know how to i know how to work this stuff i was gonna i thought you were
gonna say you were a child of giphy because i was like my giphy game is this tight it's really good
yeah i also have a website we found out dilly's at the top of the page with that i'm joe zack
i'm michael outlaw who might be on time or late to that announcement and i'm
alan underwood and can you truly call it giphy if you just called it a gif i mean that the company
is is giphy i can't i can't help that's fired wow all right so with, we are on part due of evolution of automation at Google.
And this particular show was going to be a little bit more difficult because this is all like stories from inside Google about how things played out and how automation helped and how it hurt and everything.
So there are excellent stories, but it was a little bit harder to put notes together.
So hopefully this will be pretty good and you guys will all get something out of it.
Guys and gals, everybody.
Indeed.
So first section is on automating yourself out of a job, which, you know, might scare some people, but not programmers.
We're into that sort of thing.
They had an interesting story about automating MySQL, which is a database they use for the Google Ads.
At some point, I'm sure they have multiple databases in there for that.
Were you not already just a little bit mind-blown when they read that?
And you're like, oh, man, something the size of Google relied on MySQL?
They, too, use this stuff?
I mean, this is decades ago.
Yeah, I seriously had the same reaction when I first saw it.
Their ads was running on MySQL.
That's pretty awesome.
Yeah, I'm totally naive here,
but is there some sort of advantage to using MySQL over Postgres at this time?
Just the time frame?
Because this was like early, mid-2000s?
What you're talking about today?
Probably still.
I think it's still one of the most used open source databases out there.
Like probably even way more than Postgres would be my guess.
And some of the tooling around MySQL is amazing.
Like MySQL Workbench and some of the stuff that you get for free
destroys some of the open source stuff for Postgres.
So maybe that's good enough reason
plus you know everything that's on wordpress basically uses it so that's yeah 30 of the
internet so a lot of things that like i i did like about my sequel now like just kind of had
to deal with like their storage engine stuff and transactions and like all that stuff is fixed now
i'm just i just don't know much about it anymore i was going to say there's no way that
my sql is still more heavily used than something like a postgres but i was wrong like yeah way
wrong right yeah according to db engines uh which we've mentioned before db-engines.com
their ranking of it my sql is the number two database postgres is number four with oracle
and then SQL Server
filling in the blanks in between for one and three.
Yeah, and it's a pretty big drop.
It's like half, right, on the score down to Postgres.
I mean, Postgres is awesome, though.
It's just MySQL is pretty ubiquitous.
That's crazy.
I wouldn't have guessed that.
I just assumed Postgres was number one.
I would have thought Postgres had overtaken it too.
And so this tale starts out with basically failovers.
So if you kind of think about like early days of Google,
they had far less computers and failovers were manual processes.
And this is going to happen sometimes if you just need to kind of move one server to another,
like you need to do a kernel upgrade or something.
Also, it could happen if something goes wrong
and, you know, a server gets in bad shape.
There's a couple different reasons why you might want to failover.
And this would take, I don't think we wrote it down,
but I think they said it was like 30 minutes, 30 to 90 minutes.
It could take some time.
Yeah, that was for the manual intervention, right?
Like, if somebody had to go in and do it.
They said that the actual.
For the master node, right?
Right.
Yeah.
Not for the replicas.
Right.
And so one of the first things that they automated was this replica replacement thing, right?
And that was their start into this world.
But then they talked about they were going to migrate over to their Borg.
Do you call it an engine?
I don't really know.
It's like their cluster scheduling thing,
but,
but Borg,
they were migrating to Borg,
right?
And Borg,
like it was kind of the predecessor,
predecessor to Kubernetes,
which,
um,
I think Borg and Kubernetes are still kind of different things,
but Kubernetes kind of like,
that's what it kind of came out of.
Or at least that was my reading of it.
Yeah.
And we actually have a link to what Borg is in the show notes here.
So if you were interested in what exactly that was at the time and still, I imagine still exists, right?
They probably still use it because I doubt they pushed everything over to Kubernetes
but isn't the Borg supposed to be a bad thing
like didn't they like can we
address like that would be saying that you like
created the Death Star for your company
and now it's a good thing and
like everybody's happy that you have the Death Star
but wait a minute it didn't work out so well for
Alderaan but
it's a good thing if you're the Borg
exactly
it's good to thing if you're the borg right yes yeah exactly yeah
it's good to be the borg okay uh yeah so uh they uh wanted to be able to get this faster obviously
and as google scaling up you can't have people spending 30 to 90 minutes or whatever it was uh
failing these machines over so they're trying to figure out how to get faster um and i am struggling with how to
even relate this story this is to say sorry the back whole backup for this chapter was basically
kind of stories that emphasize the first part of the chapter about kind of evolution of automation
and how things kind of grow from manual processes into autonomous systems um
and i do feel like there's really good nuggets here that really kind of make some of the weirder
points from last episode make more sense but it's just kind of hard to figure out how to talk about
it in a way that's sensical so this one for me this this first one with the Borg and the migration and all that, the automation kind of what they were
getting at is it wasn't like painless. They, they first started doing it and, and they saw benefits,
right? Like they started automating out the, the human interaction, the person interaction,
they were getting rid of that stuff and things were great. But then when they tried to take it
to the next level, the problem with MySQL is they said,
you can't do the same thing with the master nodes, with the master MySQL nodes that you can do with the replicas.
And so they ran into problems with Borg there, right?
So during their automation, they ran into a bunch of brick walls that they had to overcome.
And so while they were trying to automate stuff, it all sounded great,
but they actually caused themselves a lot of pain in the process, right? But ultimately, after they got
over those things, they saw the fruits of all the efforts come to life. And then in the end,
basically now they've got a system that is sort of hands-free, right? Like it, it, it kind of takes care of itself now.
Yeah. And well,
and they commented to that like they saw a large reduction in like mundane
tasks because, because they had, you know,
figured out how to solve the difficult problem was, you know,
how to handle failovers for the master node, master database nodes.
Once they figured that out, I think it was like 95% of their uh you know mundane tasks were were gone like their toil right the the crap that
they just had to do and then they also said after they finally got to the end of the tunnel with
all their efforts to make this happen not only did their mundane tasks reduce by 95%, their total operational costs for managing those MySQL clusters also dropped by 95%.
But I want to address another elephant in the room kind of thing.
Because we hear this portion of the chapter was titled automating yourself out of a job.
And that kind of has a negative connotation to it.
But number one is, well is it negative because like did you really want to like spend the rest of your career
like failing over my sql uh master database nodes no like that you you want to get out of that job
number one but number two is like i kind of we shouldn't think of these things in like such
negative context like i'm automating myself out of job because that's never going to be the case. And I think we described
this before with like this type of automation, you're always going to be like, well, what's the
next thing that I can automate? You're going to just iterate onto the next thing. And even in
some of their story here, like, you know, that's what ended up happening, right? Yeah, totally.
And it, I mean, we kind of just blasted through that whole section, but a couple of points worth talking about is there were one of the main things that they ran into when they, when they automated all of this, the, right? Because typically in anybody that's been doing
this stuff for a long time, if you're in your standard three-tier architecture, right? Where
you've got a front end, a middle tier and your backend database, we'll call it. You almost always
expect that backend database to be your, like the thing that's strong in your system, right?
Like that thing's always going to thing's always they called it out is
like developers always assume it's the the strongest part of the stack right yeah yeah
and so the problem is when they started building all this automated failover these things will
migrate nodes and the masters would change over and the replicas would change over. The problem is the code wasn't built to be fault
tolerant. Right. And so they had to go in. So they automated the MySQL stuff to do the failovers,
but then they had to go in and touch all the code as well. And they had to do that because they
needed to make sure that this thing would, would be able to come back up in a state to where it
could operate again. So, so this whole notion of automating yourself out of a job, you've,
you've never finished, right?
Like there's always more stuff that you've got to do.
And ultimately what you're trying to do is make your life better as a
developer, as a person who has to support these systems.
So, you know, like, like outlaw seven, we've said in the past, right?
Like even if you automate a job, I know there's,
there's definitely been arguments over like people in fast the past, right? Like even if you automate a job, I know there's, there's definitely
been arguments over like people in fast food industries, right? Like they've already started
putting kiosks in the stores, right? Like if you go into a Taco Bell or something, you might be
ordering at a kiosk. Well, I can guarantee you. Okay. So now there's not people standing in front
of a register that are going to be taking your order necessarily in every place.
But now you have created jobs for people that are going to be maintaining these systems
and taking care of these things when they go wrong, right?
So you're always going to have new things to do.
There were two thoughts I had on that.
One was the software that they mentioned.
They specifically called out they had to make changes to JDBC in order to support this
idea that the master nodes might not be
as reliable
or as you previously once thought of them
because of this. Because the Borg,
like we said, was an early predecessor to
what maybe later
spun out Kubernetes, but it was basically doing
container management, right? So a big part
of the reason why they were able to reduce their operational
costs is because they were now able to keep all of the MySQL infrastructure being
maintained by the Borg, then they were able to better stack MySQL instances onto fewer machines.
And I think this was the portion of the chapter where they were talking about like suddenly all their sres they had a huge abundance of time because their mundane their toil was
reduced by 95 but they also had a huge abundance of hardware because i think they said they reduced
the hardware by like you know 60 or something it was massive right all because of you know
then then being able to like run my sequel now as containers that were managed by the borg
but so that was that was thought one. But then, uh, you know,
you were talking about like this, this building upon thing, like, you know,
I made the joke about the death star a moment ago, but you know,
if you think about like how you would build the death star, right?
Like at some point it just started out as the international space station,
right? Something, something silly, you know,
something like weird stick looking thing and floating in the
space right you know and then they just kept building upon it so it was like hey we have the
international space station woohoo we're done no you're not you got to keep building man right like
so it's yeah you're never automating yourself out of a job no yeah i wanted to mention too like if
you think about um when we got rid of the toil to uh required and failovers that time
went into improving software that like jdbc drivers that benefited like the whole organization
if they if they push that stuff into the um the main jdbc you know repositories that they made
the better for the world which is pretty cool so like would you rather be doing like repetitive
boring work uh or making the world a better place for everybody just it's a tremendous trade-off in
value there well now that you put it like that, I guess I'll just do the total.
Right.
Also, I don't think we mentioned it, but they eventually got this process down to 30 seconds or less downtime, which I thought was really cool.
And they mentioned that it was actually a requirement.
So I was kind of curious.
They didn't say whether that was a business requirement.
No, they did.
I remember.
Oh, it was a business.
Okay. Yeah. It was kind of curious. They didn't say whether that was a business requirement. No, they did. I remember. Oh, it was a business. Okay.
Yeah.
It was a board.
That was part of the problem with the 90 minute time for the previous versions of the failover
for the master node was that it took too long.
And so once they put it onto the board, then because the board would like reschedule these
containers, you know, a couple of times a week, then they were blowing out their uh their error
budget okay cool so they had to reduce that time in order to fit within it and so for anybody that
wants some behind the scenes on this show like typically we we mostly go in order we totally
did that entire section out of order so if you're looking at the show notes trying to trying to line
it up with what we said, it's not.
And again, it's partly because of how these stories are done, right?
Like it's – we don't want to read the stories out to you, right?
Like this isn't a bedtime type thing.
So, you know, we're talking about it in the way that seems to make the most sense.
It's neither a forward nor a reverse index.
Yes.
If you leave a comment, let us know why you like my sequel over Postgres or what you think about any of these things.
We'll send you a copy of the book.
You have a chance to win a copy of the book.
You can read some of these stories yourself.
Don't forget, the book is also free online.
You can be reading all this right now.
That's true.
Second story, automating cluster jewelry. So they had a story about a particular setup of Bigtable, where there was some sort of efficiency kind of change put in place where they didn't use the first disk of a 12-disk cluster.
And then some automation came around at a later date and saw that if the first disk wasn't being used, it assumed that none of the disks were being used and it would wipe it and so this had uh i think these were cache servers so it wasn't
like catastrophic but it uh ended up causing this kind of cascading deleting of data which
was trained on other systems and took down i think one of the data centers uh briefly so they
mentioned that they had like real-time you know replicas of it so it wasn't like yeah a huge problem but cause some panic it does make you wonder like was that the guy's first day
on the job that was like well i guess i can assume that if uh if the first disc isn't being used that
none of them are being used so just rm minus rf star and then he commits the code and like
a week later he's like hey uh tapped the shoulder. Did you know that you just deleted the entire data center?
This is dangerous because this is, I mean, this has happened to all of us, right?
So basically you had people that were, so this automating cluster delivery,
this was actually talking about the infrastructure for clusters, right?
This had nothing to do with Bigtable itself.
And then you had the Big table group that did an optimization
because apparently not using the first disc in the 12 disc cluster made it way faster for some,
some odd reason, right? Who knows what it is, but so you basically had two teams doing things
that made sense for themselves. And, and so this automating of the cluster delivery they're like oh well they're not using the first
disk it's gone so so it's you had this sort of hidden optimization in the big table delivery
that the teams automating the cluster delivery didn't know about and so they that's they call
out like you have to be careful about these sort of hidden safety signals
because like how how is my team supposed to know that your team just doesn't use the first disc
because you know it was it was slower with the cash retrieval or whatever like it's just it's
a bizarre thing but it's easy to see how that could blow up on you yeah it was for latency
reasons that it they would have a 12 disc system and they would not use the first disk
so one right away i was like my mind was just kind of blown like wait why did the first disk
matter or have such a huge impact on it that's just bizarre yeah but also yeah to your point
like hey i gave you 12 disks and you're only using 11 of them like you know why would i even how would
i even know that why would i even think that you you know if i saw the first disc is completely unused why wouldn't you think that the others
aren't and you imagine like a lot of times you don't write a script like this from like top to
bottom like every line doing something you combine other tools that people have so someone might have
said like hey here's a tool called disc checker that checks if the disc array is uninitialized
and you use it and you don't realize that the way it checks that is by kind of this implicit safety signal of like figuring out whether or not the first disk is
empty and so you don't really think about how it works necessarily because you're a higher level of
abstraction and it works great in test and then you roll it out and whoops you didn't realize that
one of these other tools you're using kind of took a shortcut on something right i was questioning
like how this worked because like in my you you just mentioned the 12 disc array, but I'm thinking like, no,
I don't know that it was. Cause like if it was a 12 disc array,
then all 12 discs would be used, right? Like you wouldn't have,
whoever the array controller is would be controlling the usage.
So I'm in my mind, I was thinking like, okay, it's a rack of like, you know,
maybe like a storage, uh, uh,
controller that hat that houses 12 discs and they made an array that is the last 11.
So if you were to go look at the storage rack,
you see the first drive light in every storage controller
is solid, never being used,
but the others are blinking from all the usage, right?
I don't know.
Weird.
Also a weird assumption, though, to just make,
like, oh oh this first
one isn't being used so i guess none of them are right yeah yeah but i mean you could see how
somebody would do it right and that's that's you know kind of what they were getting at is
there are some things that are dangerous and then they even say um their their automation
depending on a bunch of like you know shell scripts, and those ended up being problems over time,
which takes us into the next section,
which was something that they'd created called ProdTest,
which was their way of detecting inconsistencies with deployments.
Now, just a quick clarification.
I think ProdTest, if I wasn't mistaken,
ProdTest was almost a framework that individual teams would use for their services.
And so they would say, like, ProdTest, you're setting up a new service called BigTable or BigQuery or whatever.
Like, fill out ProdTest.
Okay.
But I did have one thing that was like, I had to take my medication when I was reading this
because I kept twitching when I was reading this portion of it and i don't know if it bothered you guys too but
they would refer to them as unit tests but then they were like pushing files around or setting
up dns and i'm like no it's an integration test why are you calling it that yeah but it's totally
an integration test all of it was an integration test the entire string of them well i guess at
their point like they've gotten to such a scale, you know, like, you know, I we consider our computer like that thing on our desk, whereas they consider their computer is like, oh, it's that large building over there.
So maybe like, you know, integration test means something different.
Yeah, totally.
Yeah, totally.
And I totally interrupted the flow sorry about that so you
were just describing what that protest was but um it's basically the idea is that protest is a suite
of what they called unit tests that would be run on a service that would do things like checking
just like what allah said checking dns to make sure it's okay and if that test can uh works
then it'll go on the next next test and maybe it would check i don't know connectivity to some other service or something and keep on going and if one of those tests fails and it would go on to the next test and maybe it would check, I don't know, connectivity to some other service or something and keep on going. And if one of those tests
fails, then it would bail. And then this was something that each team would create. And so
that you could run this test on the service and kind of get a health check and say, well, okay,
this is where it stopped. So somebody needs to fix that. And then somebody could go in and jump in,
but every team was required to kind of create this broad test create this broad test file or speed of files that would go and
check out the health of something.
This was the sentence that bothered me
the most, though. We extended
the Python unit test
framework to allow
for unit testing of
real-world
services.
I was like, wait a minute, you've broken out
of that thing. It's no longer a unit test. Let's back up real like, well, wait a minute. You've broken out of that thing.
But yeah.
It's no longer a unit test.
So let's back up real quick, though, because let's talk about why they even had this.
So what they were saying is in their cluster deployments, they go to set up new clusters, right?
And every time that they do it, they'd end up having to make some custom flag changes to various different parts of the system. And when something would go wrong, they'd have no idea what it was because
they modified their typical scripts or whatever, their flags.
And so they got into a situation where it would take months
to stand up a cluster, right? And basically they got a directive for
management where they were like, hey, we want this done in a few weeks or a couple weeks or whatever it was.
And they were like, Hey, we want this done in a few weeks or a couple of weeks or whatever it was. And they were like, um, okay. So, so that made them back up because they had
typically done everything in shell script form, which is great for a lot of things, but it didn't
let them check the state or consistency of a bunch of other systems. And so this is kind of what drove them to this prod testing was,
was to be able to check what is the state of all these systems that are hooked
up in this cluster so that we can know when we're ready to launch the thing.
Yeah.
It was like the,
the them spinning up,
like how fast it went was a side effect of doing prod test because they,
you're right. There was a section where they described
like they were out of the blue told like,
hey, we're going to spin up these five new clusters
and you have to spin them up
in a single week. Whereas before
that was something that took a long time.
But that mandate
came because they had prod test
now and they were, because
of prod test, managers or project managers
were finally able to like
predict when they could go live and that's when that's what that was the impetus that was like
okay well because we can predict this room you got a week yeah i think you're right so so next
story actually by the way yeah it bleeds into it so yeah i think i sort of i said it in a way that
implied that protest came from them saying that they needed to get something up in two weeks whereas what outlaw said is you know they were like hey
well we can get something stood up in a couple weeks now because we have prod test but that led
into other problems but again the whole reason prod test even came about was they were having a
hard time even getting the cluster stood up because when they would make these changes to
custom flags and scripts and everything they had no idea why things were failing all over the place
because it wasn't consistent from one cluster to another yeah here here were the steps that it took
to and by the way this whole this whole thing with the the one week thing i just want to go like that
that's part of the jay-z said that was part of a later chapter. And he's not wrong. That chapter is called No Good Deed Goes Unpunished.
But here were the steps for getting the cluster ready.
And I was like reading through this.
I was like, oh, man, it actually kind of sounds fun.
Number one, fit out a data center building for power and cooling.
Right away.
Like, again, going back to our definition of computer being that thing on your desk versus them. It's like, no, it's that building over there. Right. Install and configure core switches and connections to the backbone. Number two, number three, install a few initial racks of servers. And then number four through six is where it got complicated. Configure basic services such as dns and installers then configure
a lock service storage and computing that's number four number five deploy the remaining racks of
machines and number six assign user-facing services resources so their teams can set up the services
yeah i like how step one is create a building with power and
and that's not the hard part. Right. Exactly.
That's amazing.
And this whole story,
by the way, I think it's really about evolution.
So it started with them,
you know,
having a lot of manual work and a lot of kind of scripts and everything was
inconsistent.
And then they moved to protest.
And then when all the tests were green,
that's when you know,
something's ready.
And so management can see and say,
Hey,
we're a 50% passing tests, which we know, you know, roughly takes about two weeks say hey we're 50 uh passing tests which we know you know roughly
takes about two weeks so we're on schedule we're running faster than the last one you know it
introduced some uh predictability uh but the problem with uh prod test was that uh you know
you're still relying on humans to go in and fix these things when things went wrong
before you jump up because i know where you're going with this real quick. The important part about prod tests that, that they built in was this chain ability,
right? So, so when they would go deploy a cluster, it would go check to see if the configurations
were right. It would check to see if this system was up and running, if the service was up and
running. And then after that one succeeded, then it would know the prod test
framework would know that, okay, the next thing to check is this test, right? And it would keep
going down these line of tests. And if one failed, then it would abort and be like, Hey, something
died here. Well, right. So, I mean, let's build on that for a moment because like this, this was
another part of the thing. Like typically when we talk about unit tests, unit tests, one of the
core assumptions about, about unit testing is tests one of the core assumptions about about unit
testing is that you cannot make an assumption as to like the order of what your tests are going to
run run in yeah you should just assume that they might run in a random order that the order they
ran in last time is not going to be the same time order they ran in next time and if it if they do
happen to run in the same order that's just a coincidence and you should not try to make any
kind of assumptions or infer any kind of state in a later unit test. Whereas what Alan's saying here is they specifically
did add in that ordering, that implicit or explicit ordering to the way their tests were
being ran. Yep. And so the important thing to know, I guess, is if they had 100 tests, you could think of it as almost like this top-down thing.
Like the top one was green, all right, then run the second one.
If it's green, then run the third one.
If it's not, then abort, right?
Like stop and throw up a thing and let everybody know that everything failed.
Oddly, they didn't have a unit test for is the building built yet.
I noticed in their testing that they did give examples.
That one wasn't there.
And also in the show notes, we do have a link to what one of these charts looks like. Again,
this book is available online for free. So we have a link to one of the diagrams they have so that you can sort of see this flow. And with that, back to where you were headed, Joe, because you were
about to take us to the next step.
Yep, so they had these change unit tests and one fail,
and then they would stop the whole thing,
and that's where the percentages kind of were meaningful because it was almost like steps along a flow chart,
and then someone would go in, investigate,
do whatever they needed to do,
and then move on to the next thing.
Well, at some point, someone said,
you know what, some of these things,
if we see this needs to be done, we can automate it.
So maybe we'll have the shell script for setting up the IP tables or something.
And if that test fails, it'll run and we'll make it idempotent if we can, as best we can.
And so that, you know, if it ends up getting run more than once or, you know, ends up like something bails in the middle, it can kind of pick up and just kind of fix itself uh which is pretty nice uh but the
problem with that is that some of these things are hard to make fully item potent so uh sometimes
things would be kind of flaky or maybe the test would be a little bit flaky like maybe it wasn't
a problem with ip tables maybe ds was down because somebody tripped over the uh ethernet cable or
something you know so um sometimes these things would just kind of end up in weird spots. And so even though this was a huge good thing, uh, it wasn't very easy and it wasn't,
it still required a lot of human intervention to get these things going to fully green.
But this one also, like, I think that, um, you know, this is where like being an outsider,
looking in, like reading this book, maybe like is working against me because like,
I was trying to understand this portion. Like, why would you even do it that way? Cause I was like
thinking that, okay. One of the examples they gave was a test DNS monitoring config exists.
So in my mind, I'm like, okay, well you already had some code that pushed out the monitoring
config. Now you're going to test if the config exists, it doesn't exist. So now you're going
to call a method called fix monitoring create config.
I'm like, well,
but that's the thing you already did
that you're trying to test that failed.
And now you're going to do it again
as part of the test.
And so that flakiness that you referred to,
maybe that's related to it or something.
Like having like,
it almost sounded like you might have duplicate code
or code paths to do the same
thing right the initial time and then like once as part of a a test fix but you know they refer
to like that the flakiness that you refer to is uh sometimes like uh because you would have these
flaky tests that would fail but then you could rerun it and oh now it works right kind of like ruin the
reliability of the test you know to know like well or maybe not ruin the reliability of it but but
more like kind of encourages a behavior of oh it failed just run it again it'll probably succeed
well if you remember right they had also talked about here, I think in this particular section that the, the retry times or
whatever were set on intervals that were sort of long. I want to say it was like 15 minutes or
something like that. Right. And so that they said, what would happen is things would get out of a
good state in that time when it was trying to run the next retry. And so because, because it wasn't
like, Hey, this thing failed immediately, try and
fix it. It was, Hey, it failed. It's going to check in 15 minutes again to see if everything
was good. And it's going to go through and do all these tests. And then it's going to fail again.
If you had a hundred things and, and it's having to go through and wait 15 minutes before it can
fix the next one, the next one, it would get into this like really long loop.
But then they said during that time, somehow the state could get in bad.
Right.
And we don't know what the internals of whatever that might've meant.
Right.
Like what happened,
but suffice it to say that if the fix might've happened immediately,
then it could have probably curtailed some of that,
that those problems that were happening.
But you know, it's hard to say. Yeah. I'm just more or less more to
the point getting saying though, that because you could have these tests that would fail the first
time and then later succeed because some automated fix ran like it fixed itself. So that's good but you as a person now don't trust your the failures and so that like
makes you i don't know like you you know you know what i'm trying to say though because because
like you're you're going to just in your mind think that like okay well it'll probably it'll
probably you're not going to drop everything to like oh if the test failed oh let
me figure out what's going on instead you're going to like harbor you're going to reinforce
this behavior among your team to where you're like oh just wait and see if it fails twice
twice in a row which like to your point could be like 30 minutes you know or 15 minutes later so
yeah and for anybody new to the show or people that aren't like super familiar
with computer science,
you terms when,
when you say item potent,
that basically means you can just do the same thing over and over and over
and expect the same exact result,
right?
Like it's,
it's like adding two plus two.
So,
um,
it's such a weird word.
It is.
Yeah.
So they're like, obviously like a bank transaction is not idempotent.
So if you run a code that says minus $15 and you run it again, you're down $30 and down $45 and then who knows what's next.
Right.
That's not idempotent.
So what they were talking about in their scripts was, let's say set DNS, right?
Like set DNS settings. You know, the whole thought is if they run that
and, you know, it was supposed to put in a certain set of values for DNS, the next time it runs,
it shouldn't add more values. It should make sure that the ending state of those DNS settings is
exactly what that script wanted, right? So that's when they say item potent, that's what they mean
is being able to run these things with the expected state being done at the end of it yeah i mean if it going back to
your point like if you're new to it like a calculator it should be item potent right if
you say two plus two you should always get five no matter what right totally always should always
get the same answer every time you do it yep even if it's wrong wait what i got that answer
from joe so i'm pretty sure math and the chicken would not let me down that is correct all right so
the next thing that they got into was specializing and what they basically what they go into here is
there's like three things or three ways that automation can vary. There's competence,
latency, and relevance. And competence is just, can it do it? Latency, how long it takes. And
then the relevance, I actually put the definition on this one because I was like, huh? So they said
the proportion of real world processes covered by automation. So basically they're going after
the stuff that matters, I think is what they're basically trying to say here um yeah and go ahead i was just going to go
on the next part yeah go ahead about basically uh the ability that they would use turn up teams
and they use this word uh turn up uh for it almost not like a... Not like the vegetable. Not the vegetable.
Not a turnip.
Yeah, not a turnip.
Turnip.
A turnip team that would just focus on automation tasks.
So teams of people in the same room that we all get together
and get things done quicker.
So you imagine these are a bunch of specialists who come in
and they're used to setting up clusters
and so they can kind of swoop in, get things set up,
and then move on to the next thing,
which sounds pretty nice. But again, it's kind of like a just a stepping stone in terms of evolution they didn't
stick with it for too long because there could be like they would say actually over a thousand
changes a day to running systems with just a ton of stuff and you've got imagine all these people
like kind of shouting across the room like hey did you set this up all right i'm running this now
and you know things happening at the same time these computers it gets confusing like sometimes you know he
imagines like someone's restarting in the middle of your process running or you know doing whatever
uh needs to happen just kind of funky stuff can imagine um happening here was this the same
section where they were talking about like coincidentally they noticed a pattern of where
like they were hiring a new engineer every time there was a new cluster. That was on the previous one.
They were talking about doing the,
the cluster,
uh,
online in the clusters.
But yeah,
yeah,
they,
where it was purely,
they,
they noticed it coincidentally and then,
you know,
yeah.
All right,
whatever.
I'll move on.
Oh yeah.
It was there.
Trust me.
Read it.
It was.
But,
uh,
yeah.
So,
um,
when any,
anytime they had the automation code that wasn't staying in sync with the code it was covering,
that's when they actually said, like, that's when automation code dies.
And we've talked about that before.
When you've got automation codes, it's kind of like this glue stuff that runs around into those systems.
And sometimes people, like, add flags or change how things work a little bit.
If the system that's orchestrating and automating those things isn't aware of those changes then it gets out of sync and it starts
acting poorly and people don't use it anymore because yeah it just stinks so you've got to
kind of keep on top of this stuff okay so this is where this can get this can get tricky now hold on
because this is where the debate of is devops a title or a culture can become a debate, right? Because basically what
they're saying here is that you have to have people that are passionate about these certain
areas, right? Like in this case, automation. And so if you like, imagine, for example,
you create a new build system and then no one else bothers to care about it and you walk away from it
to go work on something else now.
And now other people, like maybe you've like made it like super fast, right?
Like the build times are like, I don't know, 30 seconds.
But you now go away to it from it and nobody else is maintaining it.
So they're adding in, you know, a lot of cruft or whatever and bloat.
And now it's 10 minutes, right?
Because nobody else was bothering to care,
you know, to maintain it, right? That's the type of example that they're talking about here is that
you have to have people who, whatever this automation might be, you know, that are passionate
about maintaining it and caring for it and the feeding of it that, you know, will keep, you know,
the performance updated or like as new changes are made.
Otherwise what will happen is it becomes stale and maybe even stops working because as things
change over time, that could be one, or it could be. So I would imagine like these specialized
teams that they got together were people that were like cluster delivery people, right? Like,
Hey, we're going to set up a new cluster. Well, some of the things that they used in the past have changed, right? Like,
so not necessarily like the automation system itself, but the software they're trying to
deliver, right? Like maybe the DNS system changed or maybe, you know, some other thing that they're
putting on these change and they're just not aware of those changes. So the things that they'd
automated in the past no longer work the same way at all.
So it could be a combination of the two, right? It could be the actual automation. It could be
the underlying systems that changed that the automation was written around. And now, and now
they're kind of in a really bad situation, right? Because they're not aware of the changes. Yeah.
You weren't, you weren't maintaining that automation code as, as new things would,
as software would change around it and so therefore like you
know uh you're using old flags you know for like you know something like git or something like that
and and now those options aren't available and so your automation starts to fall apart yep yeah
they mentioned they created uh some bad side effects here and this is kind of the argument
that people make or one of the arguments people make when talking about
DevOps not being a role.
Like what we're talking about here
is basically kind of
a traditional ops type scenario.
We've got one team
that's in charge of running this stuff
and another team
who's in charge of developing it.
And then you've got these
incentives where the ops team
just wants to get this thing green.
They want to get it running
and any problems that come up,
they're like, well,
that's the product team
where someone else is going to have to fix that and the product team
is delivering is developing this stuff and they don't really have an incentive to make it easier
to stand up so they just kind of want to add features features features they don't want to
make it easier to run or you know all that sort of stuff and who is more uh qualified to kind of
know what's wrong and set stuff up and the product team was actually developing it so they had a bad split here and just wasn't really working out so that's why i
say it was a stepping stone part of the evolution but uh ultimately it wasn't a good thing and they
ended up getting away from it because turnips ended up being inaccurate and taking forever
and uh or one of those things inaccurate high latency and incompetence so basically
the the uh three kind of anti
patterns when it comes to automation. This section that was kind of interesting. I don't
know if you guys got this takeaway from it. So keep in mind the timeframe of, of when, um, you
know, the SRI SRE movement within Google, I think the books called it out was like in the mid
two thousands. Right. And, and they were writing about this like after the fact right so this is mid 2000s this was the section where they they were talking
about how uh specific to like their use of ssh because they were using ssh to automate a lot of
things but that would require root access on on these machines in order to install and make
configuration changes,
which they admitted was like, you know, clumsy from a security point of view, but they referred
to it as, however, this is the quote, however, an unrelated security mandate allowed us out of this
trap because that those, uh, high latency, inaccurate and incompetence, uhence turn up teams that you were just referring to, Joe was the quote trap.
So an unrelated security mandate allowed us out of this trap.
And I immediately was like Snowden.
This was,
this had to be related to that,
right?
Like when,
when all the leakage of the documents.
Yeah. That, that's what I was thinking of from a timeframe point of view.
Did you guys make that connection?
Because that was around the time where everything started to get...
Everything across the internet seemed to get more secure.
We started caring about, is it HTTP versus HTTPS?
And everything was putting tight controls around security.
I don't know.
I just thought it was kind of interesting.
Maybe that wasn't related.
Maybe I could be wrong.
And it was like later the heartbleed or something like that
that made them decide to change it.
But just trying to relate the real world kind of time frames
of like what was happening.
The point is like there were other things happening in the world
and they had this process that they were using and they admitted that it had high latency and it was
inaccurate and had some incompetencies about it but they were they were using it and they might
not have ever changed had these other like outside factors ever influenced their thought on it you
know maybe they wouldn't be where they are today because of it. Snowden was 2013.
I had to look it up.
I couldn't remember.
It was 2013?
2013.
So they mentioned it was in response to advanced persistent threats.
So who knows?
Maybe this involved governments or maybe there was some sort of hacking incident that we didn't know about.
But either way, basically they said they had a security mandate that said, hey, no more shelling into individual computers.
And so what they ended up doing about it was pretty cool and specific.
But basically, to kind of boil it down, they ended up creating this kind of, I forget what they call it, like admin manager, admin service.
We'll get into it here in a minute.
Okay.
But yeah, it was basically just a service that would run somewhere.
And it was in charge of making the changes.
And you would tell it what to do, but there's a full audit trail and all sorts of good stuff and then that way nobody had to have root
on these machines like they in fact they weren't allowed to have root uh so it was just better all
around but what it did is it made like all those shell scripts and stuff that they had written
uh kind of not necessarily moot you know but um it was a good time to reform they had to change
right yeah well when when did you say this, Snowden?
2013.
I think it said June 2013.
Okay, so maybe it was Heartbleed then,
because I thought Heartbleed came after,
and according to Wikipedia, Heartbleed was February of 12.
So maybe that's what the advanced security,
persistent security threats that made them move it.
But I was thinking like, yeah.
Definitely like the Snowden stuff would have
been more external and not internal but still made me think that like you know as an industry
people started becoming more like security minded first you know yep so all right well um
joe i don't think you can do this anymore yeah I don't give up. Yeah, you failed us.
So if you haven't already left us a review, we would greatly appreciate it
if you would leave us a review. No one stars though, like Joe asked for last time.
Please don't do that. But I mean, if that's how you feel, that's how you feel.
I can't tell you how to feel. So you can find some
helpful links at www.codingblocks.net slash review.
And, yeah, it really does put a smile on our face when we read that.
If you're like, man, I'd really like to buy these guys a beer as a way to thank them for everything,
hey, just leave us a review.
It's cheap for you, puts a smile on our face, and works for everybody, right?
Because depending on what city you live in like a beer could be expensive it can be expensive and
it could be sweet water 420 which just kind of isn't good so i mean you know why why why did you
go there like why do ips exist that's right i was being so nice there was nothing mean about
anything i said and then all of a sudden you had to take it to this dark place i'll take a bud lighter corns light over a sweet water i'm just saying
well i ate a clock yesterday i mean if we're talking about like you know things that we eat
and drink eight o'clock yeah i ate a clock it was so time consuming
all right so uh a few episodes back you know we're talking about this sre book and there's It was so time consuming. All right.
So a few episodes back, you know, we're talking about this SRE book and there's a lot of bleed over with DevOps and whatnot.
And so we asked, how do we feel about DevOps?
So your choices were love it.
It's the greatest.
Or it's OK when things work.
Or or no, I'm sorry.
It's great when things work. It's no, I'm sorry. It's great when things work.
It's okay, but overrated is the third choice.
Or, I wish we had a good DevOps pipeline.
Or lastly, it's a dream.
Nobody really does that.
This whole book is a lie.
All right.
That part wasn't in there, but it should have been.
It was, you know.
What is this?
188.
So according to Tateko's trademark rules of engagement, Jay-Z, you are first.
Okay.
Well, I think that people wish they had a good DevOps pipeline.
And I'm going to say 30 to 33% said that.
Wait, what?
You're giving me a range?
Yep.
That's not how this game works, sir.
It's to be lazily evaluated upon reveal of the answer.
He's kind of lazy a bit, Var.
Apparently.
Lazy of tea?
There's no side effects, so you can just run this later.
It's an item potent type of thing here yes all right
all right so i'm going to go it's great when things work and i'll go 30 percent
oh single number yeah right yes yes it's a daring move and you both are like in the same range
right see what i did there yeah so uh joe says i wish we had a good devops pipeline
30 to 33 percent and alan says it's great when things work 30 percent ladies and gentlemen we have a winner oh not only do we have a winner of who picked the right
uh option but they also did not go over okay that's like a double whammy win
but i thought whammies were supposed to be like a bad thing remember that old game
no whammies pressure look but this one we're gonna flip the script whammies were supposed to be like a bad thing. Remember that old game? No whammies.
Press your luck.
But this one, we're going to flip the script.
Whammies are a good thing here.
So you got a double whammy.
Alan is the winner.
Yeah.
Okay.
Yeah.
It was 50%.
It's great when things work.
That's pretty high.
That's pretty good.
Yeah.
Yeah.
Do you know the unfortunate thing about that then though is that the things are not as
smooth as what they should be right
either that or you know you could also be like you know that's one half full version of it you
could also just say like well maybe 50 don't have it there's that although there were some that were
like in the love it category but you know in okay so definitely there were others that had it cool but yeah yeah oh uh did
we want to mention so we recently found that you can see all the polls that we've ever done
and you can still vote on them and see the results it's pretty interesting uh i don't know if you've
got a day or two available yeah i didn't create a short link for that though um that was from the
plugin we could create a short link for it if um that was from the plugin we could create a
short link for it if you wanted to yeah what's a good name for the short link i'll do it right now
what was the is is polls is that an option no it looks like we that already exists actually
can we repurpose that one yeah i don't think can you yeah polls is the one that i set up so we'll
i'll change that one and make it.
Yeah, so it's going to be prunebox.net slash polls.
Yeah, let's just do that.
On the fly, you know, code review and edit and whatever.
Yeah.
Yeah.
So how about for this episode survey, we ask,
for your day job, are you primarily working dot, dot, working... I had a typo here.
Michael, your grammar.
I said in cloud, but I meant in the cloud.
That's where my head's at.
And you can tell that that's where my head was at when I wrote that answer when I just said in cloud.
Or on-prem, we like to think we control our servers.
Or a hybrid, we can't make up our minds.
Or local desktop application, keeping it old school.
Or it's all about mobile.
I would have totally forgotten about mobile, which is crazy because it's probably one of the biggest ones out there now.
You know why?
Okay, so total tangent alert.
All right, so one thing that's been on my mind lately like i want just stupid kind of dumb brainless games to
play on the phone right but what totally bothers me like i can't stand some of the games with like the just inundation of uh ads right i can't play them i know it's like
some games are just awful like every single like time you restart a level like because some games
are aren't like uh some some games are just like like take a for example let's take something as
silly as like a minesweeper or solitaire where it's like you know you're going to redo it over
and over and over right and after each time that you play the game, you're going to, you know, have to like
watch some ad or something, you know, and, and some games that can be really annoying because
based on the type of game that it is, you know, there might be like a high restart frequency.
And so now, yeah, I mean, you know, the developer, they're making, you know, buckets
of money from it probably, but which is why they do it. But I've kind of had this urge to just
create like ad free games, like open source, ad free games and open source so that you can like
inspect the code and see like, Hey, there's nothing tracking in here. Cause that's the
other thing that bothers me is like nowadays it's like, I don't know what you can and can't trust. Like, how do I know that that game isn't like, you know, accessing some
library that it shouldn't be in tracking something that it shouldn't be or whatever. And did you guys
see like, uh, we've, we've, I guess maybe picked on it too much, but, uh, like, you know, related to Tik TOK. And this week, uh, there was an article, um, that I read on in gadget where I'll put it in the show
notes, but, um, where the, I think it was the FCC was, or at least a member of the SEC. I think it
was more accurately a way to describe that was strongly urging Apple and Google to take TikTok off of their app stores because of
various security concerns that they had for the concerns of the American citizens.
So yeah, that's why mobile has been on my mind. It's just not necessarily related to TikTok,
but just the desire to create just some kind of stupid game
that I don't care about.
I mean, for what it's worth, those games,
I would gladly pay $2 just to get rid of the ads.
But then you don't know that they're not tracking you.
Yeah, you don't know that.
That's true, too.
But I cannot stay in the games that do exactly what you're saying where it's, you know, constantly. I uninstall them almost immediately if it turns out to be one of those. I kind of want to, there's a, there's a, a very small part of me that that's like,
I kind of want to just like create my own because I don't know about you.
Like I don't play anything on the phone that I'm going to play is just because
like I have 30 seconds to kill and you know,
my ADD can't like,
I must have something to do.
Right.
You know what I'm saying?
So,
uh,
man battle Royale. Yeah. Whatever it is, you know, I don't care. That's fine. Right. You know what I'm saying? So, uh, man battle Royale.
Yeah.
Whatever it is,
you know,
I don't care.
That's fine.
Yeah.
These are this and other crazy things are the types of things that,
that go through my mind that,
and like,
you know what I'm going to eat,
you know,
that's another one I'm always like focused on.
Like,
you know,
I was going to go on an all almond diet.
And then I thought that's just nuts.
You know, aren't almonds not nuts?
Don't try to take away from my joke, sir.
Let's see here.
I just heard this the other day.
Spites are a common label.
Nope.
Almonds are not true nuts because they're not a type of dry fruit, but they're rather seeds enclosed in a hard fruit covering.
They're fruit.
Well, they're seeds.
They're classified as droops.
Would that make peanuts also the same?
Because they have a hard covering.
I don't know.
Let's see.
Mmm.
Mmm.
The fruits of cashew, almond, and pistachio plants are not true nuts,
but are rather classified as droops and peanuts are legumes.
So they are nuts.
They're edible seeds.
So similar,
but they grow in pods.
So they're legumes.
What about pecans?
Aren't pecans similar to almonds?
Oh wait,
was that the one you just named off Alan?
Uh,
the cashew almond and pistachio plants oh pecans are nuts wow okay well this and
more things this is what you can learn from listening to coding box so uh like i said if
you haven't already subscribed you know there's there's some helpful links there maybe like a
friend is like giving you a link or something like you should listen to what this guy crazy said about almonds that's
just insane how did he not know that that's nuts yeah that's nuts or it's not apparently we're
learning so yeah but let's get let's get back into uh uh google and uh service oriented cluster
turn up because that's where this ended up going was,
Oh,
sorry.
But yeah,
this is kind of where we like,
uh,
already,
uh,
alluded to before though,
was that because of their use of shell scripts and everything and,
and what they were doing with SSH and,
and how they were maybe like abusing it and what you can get away with, right?
They ended up deciding that as a part of this security threat,
they needed to move to a different architecture,
and that architecture ended up becoming a service-oriented architecture
where they could have one kind of control server
that could run those tasks from there in a RPC kind of fashion.
Yep.
And this is where this chapter started, like, clicking to me,
where there were things that we talked about and stuff.
It was good, but then I started to kind of understand.
And what I mean by that is that we're talking about the product team writing a service,
you know, a series of services whatever says so uh service uh oriented
whatever the a stands for architecture uh architecture that would be in charge of
creating these other services and making sure they were stood upright and cycling them doing
failovers doing whatever they need to do for the services so keep that in mind and we say that um i'm going to describe
this flow real quick and then i'll tie up the point here so the flow went from operators triggering
manual actions with no automation to operators writing system-specific automations to externally
maintained generic automation moving to internally maintained system-specific automation, and finally ending
up at autonomous systems that need no human intervention.
Now, the reason I want to call that and why I blasted that out is because of the word
operator.
Now, if you think about or if you're familiar with Kubernetes, there is this concept of
an operator, which conceptually kind of acts almost like a person on your team that's in charge of kind of keeping these things or keeping your services and pods and all your various Kubernetes resources in shape.
And so when you need a change to your services, you ask the operator to do it by changing the definition of the operator's resources.
And so when I was kind of reading about this stuff, it was like, oh, this is where operators came from. This is where the people who write the service, like
Postgres or something, obviously in this chapter it's going to be internal Google
tools, but they're responsible for publishing this API
that is in charge of kind of running things and making changes
and hides all the various details of that stuff. And I was like, oh, that's exactly what a
Kubernetes operator is essentially and as you know we're kind of talking this chapter like we're kind of
talking about the evolution of borg going from you know step one is just kind of people like
shelling into individual machines until you get to the other opposite end of things where you're
talking about kind of like a kubernetes type borg type thing we've got this like massive kind of
global computer that's doing all these things and self-healing and keeping things running
and so i can see how this is like suddenly like about the big uh important piece of the automation
story but also an important uh theoretical concept in that allows kubernetes to be what it is i mean
if we hadn't called this out before but just to
back up for a moment because this was what we had this flow that you just described was previously
called out in the the previous episode where you know we and we had referred to it as the maturity
model uh in the show notes but they had this like hierarchy class of automation and so the the no
you just give some examples right here. No automation was again, where the database
master has failed over manually between locations. The second one externally maintained system
specific automation is where you might have a failover script in your home directory, for
example. And then there's the externally maintained generic automation where now you have
added that um you know support for that database uh as you've added that script to some repository
where everybody can use and that script has support for a generic failover where like i could
specify the database name and host name as parameters maybe.
And then the fourth one, the internally maintained system-specific automation was the example where the database itself, in this case, we're talking about MySQL in their example here,
the database itself might ship with a failover script.
And then the fifth example was that we refer to here as autonomous systems that need no
human.
We had previously called that out as systems that don't need any automation,
and that was because the database itself noticed the problem
and automatically failed over without human intervention.
Yep, and think about prod tests.
We just talked about prod tests and talked about, you know,
it was a series of unit tests and they could fail and they could try to fix themselves.
We said ultimately that was kind of not so great because it was kind of flaky
and there was weird to fix themselves. We said, ultimately, that was kind of not so great because it was kind of flaky and there was weird stuff going on.
And we moved to a centralized system now
where we had these kind of admin servers
that were responsible for kind of making sure things were right.
And it's a big conceptual change
to go from like agent-based things
that are fixing themselves to these operators,
which can fix other computers,
but also orchestrate changes across computers.
So it may not just be computers.
Now it's not just these virtual machines or whatever they are containers,
but now it's requisitioning or I forget the word I'm looking for,
but basically standing up load balancers and setting up a cloud DNS or,
you know,
whatever,
like actually doing provisioning.
Yeah.
Provisioning multiple servers.
So it's not so much
about the individual computers in my cluster as it is uh kind of everything it's like these operators
provide an interface to be used and can use other interfaces to do other things and affect the
system in other ways that are outside just its individual components hey i don't know if we
called it out so we talked about these admin servers and things and how it replaced shell. But the whole the did we mentioned that it was like basically R group that was in charge of their service
would set up a remote procedure call that could be called
because they knew when their service should be in a good state
and the kinds of things it needed to do.
So each team would sort of manage their own service,
and then these admin servers would call those RPC methods whenever it knew that it needed to do the next thing.
So by getting away from the shell scripts, now you got rid of the root access.
Plus, now you've also got something that is called in a standard way and has an audit trail.
They can put ACLs around it.
What is that?
Something control access control? Something control list. Access control list.
Access control list.
So basically they could make sure that only users or systems
that have the right privileges could call these things.
So that was the SOA thing that Jay-Z and Outlaw mentioned a minute ago.
And they basically turned it away from shell scripts into regular software
is more or less what they ended up
doing right do you ever think that like so so i guess it's long been you know kind of i don't
know rumored or at least suggested that like if you hear something that google is doing that or
google did you know like if you're reading about in a book then that is something that they've
already moved away from and that was something they were doing like 10 years ago.
Right.
Like you guys have heard that kind of thing too.
So like,
as it relates to this,
sorry thing that like they probably have something greater that they're
doing now.
And like,
you know,
this was a practice that they did 10 years ago and they're sharing it to
the world,
you know,
and they might have evolved onto something much better.
Now,
if you,
if you take that at face value and think like,
okay,
that's how Google operates, right?
They don't give the secret sauce away until they no longer need it and they found something better.
Do you think, how amazing must it be then that they're using something better than Kubernetes now, if they are, internally?
That's kind of mind-boggling, right? Yeah, I mean, I'm pretty much in love's kind of mind boggling, right?
Yeah.
I mean, I'm pretty much in love with Kubernetes at this point.
Yeah.
Right.
Yeah.
Same.
But you know, I think it's interesting.
I think what you're talking about is like the software they use, but I think like this
book, a lot of it is the patterns of how they ended up getting to a point to where things
didn't suck.
Kind of.
Right.
I mean, that's's that's sort of
like this is why we don't put alan in charge of the episode titles things that don't suck right
episode 189 i think it's real but but i think that's the notion and and the steps the pains
they went through to get to a point to where things actually worked the way they wanted them
to right like they're little specialized teams they They admitted that failed. Like it just,
it didn't work well as time went on. And so, so yeah, definitely they're, they're probably using
something like, you know, that's gone past Kubernetes or something now, but at least,
at least getting to the point to where they felt like they were being successful and not and not chasing
their tails on issues it seems like they've been open and honest about that kind of stuff you know
the one thing though that i thought was like kind of going back to part this part of the chapter
from the previous episode that we talked about where the i think it was a previous episode where
where they were talking about like they would uh you know build their own system so that they could
like write APIs around it
or anything like that.
And I was thinking about it from a maturity kind of point of view from the company, right?
Like, you know, that takes a rather large company where like everyone in the company
has the same kind of drive, but also similar skill sets across the company, you know, across that large group of
individuals. Right. And by that, what I mean is it's not like, you know, a large company where
some portion of the crew might be like office workers managing money. And then you have like
line workers that are, you know, and so there's like a large, you know, skills gap there between
those people. Because, you know, if you you think about at Google, if you just assume
that a large, it's a software company, so a large
portion of the company you would assume is, or the majority of the company is software developers.
To have the time and focus to say, oh, I don't
like this tool that's freely available, because while it does work and
solves the job, I can't write an API around it easily.
So I'm going to write my own is, is like that. That's a,
that's definitely a maturity level kind of thing.
And like one of the things that super hit home with me this week was that we
have some software that we use to maintain schedules for like, you know, who's
primary and secondary on calls, right? And it is, the software that we're using is not as,
you know, we talked about the Grafana on call last time. And I can't speak to like how easy or not
that is. But what we're using, I'm not going to throw it under the bus by name, but it is also just a train smash of awfulness.
Like it is so unnecessarily, the interface is just ridiculously, you know, confusing and whatnot. But then like, you know, not having this API around it, you know, that we
could interact with made me appreciate Google's point that they made in that, I think it was this
chapter, right? Or, you know, earlier in the chapter, or maybe it was last one where they
were saying like, we would favor just writing our own version of that thing so that we can control
it. Right. And then, and then now you're like, well, that's
silly. And you're on call example, you want an API and like, yeah, because what if like,
I'm the on-call person and I schedule some time off. Now I have this calendaring system that's
completely unrelated to this on-call system, but one could talk to the other and like, you know,
oh, he, he scheduled this time off. Well, then I need to rearrange the on-call calendar
to substitute somebody else in automatically
who's next in the rotation, right?
So, yeah, it made me totally appreciate their take on,
like, we'll just write it ourselves.
Well, I think it's a combination of maturity
and just resources, right?
Like, I mean, you could have a killer development team that's
mature and can do the stuff but if you i mean if if you don't have the money or the number of
developers to be like oh yeah we're just gonna write our own scheduling app that's what i meant
by like the number when i was using the example of like a line worker and then the the office
worker like that that skills gap because like i assume and this is you know probably not a fair assumption but uh you know because of google being the type of company
they are i assume that like the largest portion of their workforce is probably like you know 85
software developers and you know in some kind of way whether they're classified as sREs or whatever, you know, um, versus, you know, you might take a company like,
I don't know, Ford or, or, or, you know, uh, yeah, where, where it's, there's a large group,
you know, like the, the, the guys that are designing the cars are probably a small fraction.
Like, you know, they might be 10% of Ford or less, right?
Whereas the people who were actually like putting together the thing that you
designed might make up a larger portion of the company.
Yeah.
Yeah.
I don't know.
That wasn't a tangent.
So I'm not going to do a tangent alert for that one.
I think that one was related.
So,
uh,
the next section was the one that like,
you know,
really sealed the, I was like, okay. And I actually went back and reread it. We listened to the chapter section was the one that like you know really sealed the deal i was like okay and i
actually went back and reread we listened to the chapter after reading this one because it kind of
put everything else in perspective to me so it's kind of like uh i like when things begin with
like tell me where you're going and then let's go back so uh this this was my favorite one and the
deal was uh and there's some carryover too but uh this is about the kind of the birth of borg
he said in the early days google's clusters were racks of machines with specific purposes.
And this is where they talked about having developers that would start every three months or so.
And that's about how long it would take to stand up a cluster.
And so a new employee would come and they'd be like, hey, tell you what, you're in charge of this turn up.
It'll help you learn the ropes and stuff.
And then you'll be able to help, you know, new other people when they come on.
And so, you know, these people would start up.
They'd have a bunch of readmes.
They'd have scripts.
They'd have things checked into repositories or kind of be shared around.
And we're talking about back in the days when, like, devs would, like, log into these machines and have things like golden binaries.
Like, this is the version that we're
installing it was uh delivered to us last month and this is the version that we're gonna be
installing for the next three months until the next version rolls out whatever like that kind
of thing so this is like super early days but as google grew the number of clusters and machines
started getting out of hand so the scripts had to get better just kind of by the definition you
know they couldn't like hire somebody new every time they needed a new cluster
and to run that stuff it just doesn't really scale very well so uh these things had to get
better and this is all the stuff we've already kind of talked about um i also mentioned like
shelling in machines to look at logs and doing regex parses parsing that's not something you
could do when you have a million computers you know it just doesn't make sense and that's where google's
heading you know the google we know now like who knows how many actual servers they operate right
but i'm pretty sure they're they have more uh servers than employees for sure well i think
this was the chapter two where they originally talked about like it was in a single building
or something like that was that where they were talking about like the clusters were deployed in a single building and then like as it grew
in scale to where it was where you know data centers were around the world or whatever like
that's when it became they did more of an issue they did because they had even mentioned that
originally they named their machines a particular way and then they were like oh wait a second
yeah we have too many now we need patterns yeah, you could assume like data center and domain names and whatnot.
Yeah, so that's where they were like, yeah, things just got out of control, right?
Like they got too big.
And then this is where Jay-Z was going with all this.
Yeah, I remember Rack and the Day going to meetups.
So you'd be able to meet someone new.
I'm like, oh, you're at a WebVille bar?
Well, what's your naming convention for your servers? You greek gods or roman gods oh i was gonna say like distinction
remember i remember one of them being like uh the seven dwarfs from snow white remember that yeah
yeah or transformers like i remember all sorts of like creative names like people would name
servers and they were like pets right that's very different from kind of how we talk about and think
about these things but this is the world they were coming from and and this is the this is like in particular is the the point
in the notes that you can't see that uh kind of flipped the switch in my brain where they said
automation eventually evolved to storing the state of machines in a proper database with
sophisticated monitoring tools and this is something where i was like duh why like why
have i not thought about this problem and like until then like i i've always thought
about things like even cloud resources and stuff is like things in an environment that i would go
out and i would have my shell script go check and go look for these things and apply the actions
whatever i would shell into these machines i've got the you know bookmarks in my browser but
things got so big that google started storing this information in a database and to me like
that seems kind of like such an obvious evolution i just never got to that step i never thought
about having a database for our servers and you can go and see uh you know what the version of
the operating system is what's the status what's the last time we heard of it what's its uh you
know uh how long has it been running what's the sub time what racks it in what's location like that
that's all information that i just you know like my kind of small time thinking like you went out
and got when you needed uh but when you flip the switch and say no like let's keep this stuff in
the database and we'll keep that database up to date with various agents or different polling or
whatever but uh once you start getting data into a database then my programmer brain's like oh i
know what to do with data right like i'm oh, I know what to do with data, right? Like I'm a programmer. I know what to do with software and getting stuff out
of databases. And I know how to make things go affect the real world based on changes in the
database. So you could say like, you know, have a little web app where I can go and say,
restart these 10 servers. I do a couple little checkboxes, I hit apply, right? And I know I've
got some process out there that's going to be watching this database and say, oh, okay, I need to go restart these servers.
Once you've turned your servers, your infrastructure, all this stuff into data,
suddenly you've turned this from like a hardware operations problem into a software problem.
And like programmers are good with that stuff. So I can imagine like this is a big leap in like
Google's kind of productivity and scalability. Well, I mean, also security to like, think about it,
going back to that Heartbleed example, right? Like if you have thousands upon thousands of servers,
and you don't have all this data centralized, and then an issue like Heartbleed comes out,
you're like, okay, how many servers are impacted by this vulnerability? How many have
I already fixed and how many I have left, right? Like you're going to go run a script that's going
to run some SSH command on every one of those thousands of servers. No, that'd take forever,
right? Versus if you had it all centralized, then you can just, you know, it's a simple,
you know, SQL query, right? So to put what Jay-Z just said into the words that they
had in the chapter, that was, that also was kind of like a turning point in my brain
about how Google handled this stuff is like, he just said, they turn it into data.
They stopped looking at hardware as hardware and they looked at it as just resources,
right? And, and basically what Jay-Z was saying is they catalog those resources, right? And basically what Jay-Z was saying is they catalog those resources, right? And so once you get to that
point, there's so much that you can do with it. And they started doing more with it.
Can you imagine it totally separates the teams? So you're like, okay, hey, we need some hardware
HVAC power people, and we'll tell you how many buildings to go
build and plug this stuff in. And when you're done, plug it all in
and maybe you'll have some sort of process that will kind of uh investigate that stuff and do it into a database
and then you're done you move on to the next building and then the software kind of takes
over and can say like hey these machines are for this for that you get into like software defined
networking and all sorts of cool stuff that is kind of evolved in like the last you know 20 years
or so uh for dealing with that stuff but like all of
that comes from having like a centralized database with all your resources in it which i it's just so
dumb that i never thought about like having having that in any places i've worked and i'm talking
you know like i'm not talking about nowadays where like you've got like a a lot of times you'll have
a cloud provider and if you want to know all your vms like you go to the vm screen and look at them
there um you know so obviously there's some sort of database they're not going out and looking
for all your stuff at that point so like and you know i obviously i knew that was there but i'm
talking about like back in like 90s and aughts and whatever where i'd be working in places i never
thought to have like my machines in a database like i i would have a list of them sometimes if
i needed to do um there are various tools for like kind of showing the multiple computers at a time and doing things and i
would just have a script that had like the list of all the servers and if somebody had a new server
you'd have to update your list it's just so dumb i was dealing with databases and doing all this
stuff i never thought to put it in there it's like physician heal myself that's why i'm not good if i if i just made that one little leap
you don't be zacking things right now that's right
okay let me let me sack that for you let me sack into that machine
that's amazing uh would that make us zackers then i guess absolutely i
mean yeah it's like way better google right i guess we'll have to change rename the you know
coding zacks that's right there we go so so what he just said right is is exactly what led into
the things that they've been able to do over time, which was
now they know of all the resources in their entire infrastructure, right? So they could start doing
other things and allocate those resources differently, right? So we all remember the
days where you had one computer and it did one thing, right? It had a database on it or it had
your application on it. And they started thinking
about, Hey, well, we can kind of sort of separate out these resources, right? So we have CPU
resources, we have Ram resources, we have all this stuff. And then they made it to where they could
start running different types of tasks on the same machines because it was just a pool of resources,
right? And that's, I think that's what they kind of named this particular section
was the birth of the warehouse scale computer right so you don't think of it as as 5 000 machines
in this in this data warehouse you think of it as these are the compute resources we have available
and that's how they started treating it yeah it's uh you know i i mentioned that kind of flip there
too like there was a part where they mentioned having file descriptors or computer descriptors on computers so you could shell into a machine and look at its info.txt and see what the machine's used for and what it's good for.
And this flips the script and says, no, the authority, the program of record, is actually the database.
And if the machines don't match the database, it's the machines that are wrong, not the other way around.
So something needs to go fix those machines or decommission or just wipe them and start all over again those are the
things that are in trouble so it's just really cool kind of uh you know switch there from those
servers being pets to the servers being treated like kind of cattle or being treated like these
kind of like um flexible kind of reusable um components you took it to a dark place there can we
that yeah that pet cattle comet did not go unnoticed sir yeah that was like i think
we've gotten away from that analogy because of that because it is kind of like a weird like
but uh yeah we don't say that so much anymore yikes but um i forgot what i was gonna say there
oh sorry yeah you ruined was going to be brilliant.
It was going to be the next great
Zachification of the world.
So
what this ended up doing, right, when they started
treating all this stuff as resources,
now they could scale things without
having people do stuff, right?
It was all automated by their software,
by their controllers.
And this this nowadays,
you don't even realize it. You don't notice it, but there's, there's tons of machines that go up
and down all day and nobody cares. Nobody knows because it's all being managed behind the scenes,
right? If something dies, whatever, it gets picked up on another node, another machine,
another cluster, whatever. It just keeps running. I remember what I was going to say now.
So this really became super important as we were moving more into virtualization.
So we stopped even talking about computers at some point and talking about servers.
We started talking about virtual servers, and now we're talking about containers.
So you might have one computer, one node running, who knows, hundreds of pods,
hundreds of containers um maybe thousands i
don't know but um like once you start kind of looking at these things globally you stop thinking
even about the computers you just think about the resources in the way that in the units that are
comfortable to you and uh that you know again led into the kind of the birth of kubernetes but what's
also super important here is that this starts looking a lot like scheduling processes on like a cpu
or scheduling resources like memory and uh this space and allocation these are all things that
programmers have been doing since programming was emitted you know allocating space um scheduling
processes so again it's kind of tying in this analogy over like borg slash kubernetes is really
kind of like a distributed computer.
And once you've kind of managed to shape your problem in such a way that it lines up with that metaphor, then you can start using the techniques of things like Hoban spoke to architecture and
like all the things that people have been studying now for 50, 60, however long,
you can suddenly apply these things to your kind of your hardware. I think it was a novel idea at the time.
I think we've established that all the good ideas came out in the 70s and we're only just
now beginning to understand and implement them.
It took us 50 years to be able to shape our problem into something that fit with those.
But yeah.
Yeah.
I mean, I was going to add on to what you're saying about like the, you know, how you would
rethink about these problems.
Like, I think that once you do get into this you know distributed computing kind of like
borg kubernetes model that you no longer think of it in terms of the hardware at all like who
cares about the memory ram cpu you know disk and stuff like that and instead you're just like is
the service up and available or not and if it's not just restart it like it's so you know or or
let it scale itself you know if it's not uh you know sir if
it's falling behind due to latencies or whatnot yep anything about like auto scaling like all
this stuff can't work like the rise of cloud computing like none of that can happen if the
systems weren't self-healing and you can't have like a team of working people working on like
swapping out hard drives or swapping out computers or putting new racks in and stuff like that.
Totally separate from what those computers are being used for.
If you think about Google, Amazon, AWS, Azure for Microsoft.
That whole thing is basically them selling you resources.
And it's totally divorced from the idea of the computer.
So it's kind of like what we're reading about here.
In a way, it's like the birth of cloud computing.
But it's also like here's another another analogy of way to think about it like right now in you know 2022 it's a
big deal to think about like cloud computing and kubernetes and stuff like that right but we're
talking about this is like it's almost more infrastructure-y as a service kind of stuff
that we're talking about right like you're providing the you know so i'm paying you to provide somebody to manage that
there is a computer that can host all these pods and all these containers and if there's a drive
you know a disk drive that needs to be uh swapped out like you're going to replace that disk or you
know if a new ethernet cable needs to be ran or whatever like you're gonna you're gonna do all
that for me right i mean what about like in a hundred years maybe we just treat this as like how we consider
the electric company today right we don't really think about electricity in there in the terms of
like just how amazing it is that we even have this thing right it's just no it's just part of that we
take it for granted well you know it's funny funny, though. I think you're onto something with that is right now we use it as infrastructure as a service, right?
Very much so, like what you were talking about.
AWS, GCP, Azure, all of them are pushing towards software as a service, which is using their own infrastructure, right?
Like, basically, you'll just be able to use stuff that you don't using their own infrastructure, right? Like, basically,
you'll just be able to use stuff that you don't have to think about, right? Like, like, I mean,
I know Azure has its machine learning, and AI type stuff out there that you can just,
you can just use, right? Like, you can put it in your own software, at some point,
it won't even be, hey, you can use this software, it'll be like, hey, just use the service,
and you're done, right? And I think a lot of them are pushing that way so that you're not thinking
about the ram you're not thinking about the cpus you're not thinking about any of that you just use
the thing and you're done um i think that's the push for everything right now yeah i'm just i'm
just thinking like you know after our lifetime well after our lifetimes like this won't even be
some like it'll be,
it'll just be such a building block that you'll just assume is there.
You'll completely take for granted.
You won't,
you know,
like the problems will be so much more grand.
Yeah.
At that point.
So the next session section is actually kind of interesting here.
So they say reliability is the fundamental feature.
And when they say the fundamental feature,
they're talking about automation, right?
So this gets into something that's a little bit tricky.
They said the internal operations that automation relies on,
they need to be exposed to the people as well.
And the reason they say this is because as these systems got more automated
and more complicated the
ability for just regular people to reason about what was going on it starts deteriorating over
time because think about it right like if your systems run basically hands-free for a month
solid and then something goes wrong you haven't had to think about that thing for a month so now
if something goes wrong where do you go where do think about that thing for a month. So now if something goes wrong, where do you go? Where do you start? Like, where do you get into
there and figure out what's going on? So that's basically said that that is one of the biggest
issues that you run into is if you've automated these things, but you don't expose what that
automation is doing and what the internal states of the systems are, then people are going to have
a really hard time getting in there when something does fail. Yeah. We talked a little bit about that
last episode. Like if your phone breaks, you don't have the tool, like you literally cannot go in and
fix it. You can replace components. Sometimes, sometimes you just have to replace the whole
phone because you're so far divorced from like the actual other pittings that even if you knew
what to do, you couldn't do it. Well, that's not true anymore because of the right to repair laws yeah are you not have you not seen this
where like now apple will ship you a uh set of tools and instructions on how to do the repair
yourself oh no have you really you really haven't okay i'm gonna send you this link that's got to
be right to repair laws kind of coming into effect but i mean like if your battery in your laptop
dies you're not going to crack that thing open, get a screwdriver and like fix the lithium cells, right?
Like you can't.
And even like your chip goes bad
because some transistors got kind of burned
or something like, not transistors,
but whatever they are.
You can't go in there and like straighten that out
with a toothpick, you know?
I'll have a link to it in the show notes for this episode.
But yeah, there was a, in april of this year uh apple's self-repair service is now available
where they will send you genuine apple parts and tools to do whatever the repair is and i'm pretty
sure if i remember right it included the instructions on how to do it so no pretty awesome
no you're not going to like open up the cells on a battery. Cause I mean,
even if it was like a, you know, double a battery, you wouldn't do that. But, uh, in the, in like the
tools that they send you, it's basically like a $50 rental, uh, of the, of the toolkit. So, you
know, yeah, it's pretty neat. That's interesting. So one of the other things that they say here is
when things get automated, they, they called it, there's a difference
when something is non-autonomous. Basically there are manual actions that were automated
that you assume could still be done manually, but that's not necessarily the case, right? So
whatever your automation is doing might've been based off what you did manually previously,
but sometimes that changes and you don't have access to the same underlying things
in a manual process that the automated stuff does. So that's where they say,
excuse me, that, you know, there, there can be problems over time as you automate things.
If you don't make sure there are ways for people to interact with those same systems um it's kind of like like your ears your processes make assumptions that you know
you doing it manually you're not aware of right or maybe you don't even have the rights you may
not even have the rights anymore to do some of the stuff manually right like they could have
stripped all that away um now they do go on to say, right, like we've talked about
all this stuff and outlaw even hit on it with Google has the resources, the maturity to do a
bunch of stuff. And so is this even, does this even matter for your company or for your business
or your software that you're writing or whatever? And the answer to that is still yes, right?
Because the main benefit you get out of automation is reliability.
That's consistency, right?
Like if you have something automated, if you have a person go do the same thing on 20 different
servers, they might mess up on one of them.
Why?
Because maybe somebody came by and said something to them in the middle of while they were doing
their 15th, right?
Who knows?
But when you automate that stuff, you now take that, that accidental thing out of play. And now you've, you've set up these systematic processes to, to go do these things.
And you make it to where like anybody can do it.
Anybody can do it.
It's faster and it's reliable.
So, so what they said is don't focus necessarily
on, I would like to call it consistency though, more than reliability. It could still be,
it could still, the automated version could still produce a bad result, but then you're like, Oh,
I can find that bad result, fix it. And now, you know, but it's just consistency and whatever the
process is going to be. Yep. One thing they called out is a lot of people
want to look at the return on investment when they're doing something like this. Like, okay,
well, it's going to take me one person week to do this, right? And it's going to cost me X amount.
Am I going to get that much return from doing this? And they called out that that's not necessarily
the best way to look at this, right? Because that consistency you get
from it pays off over time in different ways, right? Like you may not be getting a monetary
return on exactly what you did, but what you are doing is setting yourself up for better successes
over a longer period of time. And then how do you, how do you put a dollar amount on centralizing
logic? Right. It's hard. Yeah. It can be almost impossible to, to put a dollar amount on centralizing logic. Right. It's hard.
Yeah.
It can be almost impossible to put a dollar amount on that,
but you can actually see real benefits of it over time.
I mean, if I gave you some source code and said,
hey, I need you to compile and sign it to deliver an executable out to the real world, right?
And the three of us were each tasked with doing that manually,
we might come up with three different things, right? And the three of us were each tasked with doing that manually. We might come up with three different things, right? Versus if you like, if you consolidated that logic into one centralized
place, then it's, it's consistent and reproducible. And, oh, now I've decided to change the keys or
whatever that are required for it. Like, you know, you can, you can, you only have the one place to
do it. And I don't have to like go to each of
us and say like,
Hey,
here's the new signing certificate.
Right.
Yep.
Imagine like if you're an ops person,
you're totally separated from the people who write the products and all three
of us wrote a different way of starting up our services and managing it.
It's like,
okay,
let me open up three ounces,
lowercase set up and the flags are this,
that,
and the other outlaws.
Uh, you have to run, um, some sort of pre-initialization script.
It's going to go do everything for you, but you got to check back in an hour to make sure
it worked.
And Joe's just doesn't work at all.
You know, it's like, he's just got a couple of paragraphs written.
Like, what am I supposed to do with that?
You know, it's just, it's not scalable.
You need these things to be consistent.
And the way you do that is by building a centralized platform.
Yeah.
And they also say to kind of wrap up this section was think about automation in your design phase.
And the reason is, is because it's a lot harder to retrofit that stuff after the fact.
We've talked about that with things like security in the past, too, right?
Like there are certain things that you want to try and do up front because they're important and they can actually save you a lot over time. And then there's one last bit here that was interesting. They, they kind of threw this in earlier, but it didn't make sense where
they had put it because it was going to sort of take you out of the flow of one of the other
stories. So what they said is you also have to be careful about automating
failure at scale. And this one was kind of funny because the short of it is they had a thing called
disk erase, I think is what it was called. And more or less what this thing was supposed to do
was if they pointed it at a machine, it would kind of securely wipe a drive, right? Like get
rid of everything on it. Well, they had screwed up. It had failed at some point. And then they
were trying to figure out what was going on. And so they put it into an area to where they're going
to kick it off manually and just, you know, see, Hey, what, what happened? Where was the failure thing? Well, the problem is, um, there was a, an assumption.
I won't call it a bug.
There was an assumption in the code that said, Hey, if I don't have a list of machines to
wipe it, so basically you give me an empty list that I'm going to assume that means you want to
wipe everything. So they turned this thing on and it went and found all the machines that were on a
particular CDN and wiped all the drives on them. Now they said it didn't end up killing them
because fortunately they had enough, you know, capacity planning set up to where the main machines that were serving
whatever that data was could handle the load but they nuked every single cdn machine in that
particular area with this disc race thing when they didn't know about this hey if i don't have
anything wipe it all and so you got to be careful right like uh yeah it reminded me of uh examples
in our day job where we've like tried to to define like okay how do i want to do i want to give a
different meaning to null versus an empty list versus a list of values like those are three
possibilities do they mean different things does the null and the empty list mean the same thing
or different things and it's dangerous right like especially when you do something like
this built around it and one of the interesting things that they said that came out of this is
they built in more sanity checks so that if they ever did go to run this thing they could make sure
that something nasty wasn't going to happen but they also built in rate limiting right because
this thing went crazy it just wiped every machine efficiently quickly i'm still i'm still kind of baffled about the idea of just writing some
automation to go and erase disks like that part already is like a scary premise to start with
and you're like yeah okay fine sure let's just do it let's run it. Have at it. Yeah. Yeah. That's the Leroy Jenkins approach. Let's do it.
So, yeah, it was interesting. I mean, I guess, uh, enabling failure at scale is pretty scary.
Yeah. So yeah. So, uh, we'll like, we'll have plenty of resources. We like, uh, links to
resources we like, including, uh, some of the stories that I've mentioned in here, you know, in this episode. And with that, well, first, let me ask you this, or let
me tell you this, or ask you this question. What did the Zen Buddhist say to the hot dog vendor?
I don't know. I don't know. Make me one with everything. So with that, we head into Alan's favorite portion of the show.
It's the tip of the week.
All right.
And I stole first spot here.
And so I'm going to dip out early in a second.
But have you guys heard of cube cuddle debug?
No, I have not.
Okay.
This cube cuddle space debug.
There used to be a tool called kubectl dash debug in
older versions, but I couldn't find the exact version that the debug command came out. But
you can actually use it for several different things. And some of those things are really cool,
like adding an ephemeral container to your pods. So have you ever been working in a Kubernetes
cluster and something's going kind of funky? And so you like, maybe you create a job or maybe you kind of do a custom deployment and you keep kind of apply and throw something out there and then you kind of shell in.
Like an example here is a lot of times if I've got a service that's going wrong, I would maybe create a deployment and I would change the command to be tail dev null.
So this container is going to go up.
It's not going to try and run the thing that it usually runs that I can go in there shell it and kind of look around a little bit well it turns out there's
this command that's specifically designed for doing that sort of things and what's nice about it
is that in my example i created a deployment so i could get a pod or there's other ways to do it
that's just one example but there's these things that are lingering that you've got to go in and
delete which is just kind of messy and manual and uh what this lets you do is uh add and kind of make changes to uh pods or various other things
um that go away right when that thing restarts you're not actually changing the permanent state
of your cluster you're setting up something temporarily uh which is really nice there's a
bunch of different flags and actually several different things you can do uh one of the things that the uh couple if you can kind of tweak the
flags in order to do is actually create a copy of a running pod that's ephemeral so that once that
pod shuts down it's no lingering resources i don't know if you've ever seen issues where you have like
you set up like a deployment like i imagine i i gave an example of and you delete the deployment
the pod goes when you think you're done but you may not realize that the replica set didn't get
deleted and so you've got these resources that are just kind of hanging out there and it's just
it's not tidy and cube cuddle what's the cube cuddle debug lets you do a lot of those different
kinds of things um which is really cool and they actually have examples in the docs and we'll have
a link here
for handling situations like if you've ever had a pod that just immediately failed it crashed
they're like okay well here's how you can deal with that situation with kubectl debug and you
run this command it doesn't change the permanent state of your cluster it gives you a way to kind
of shell in and deploy that stuff so obviously not kind of things you're you don't want people
doing in prod but it's nice for dev environments.
That's really cool.
And last thing I want to mention is that I found this resource and then I went up a level in the docs.
I was like, oh, they've got a whole section here on debugging.
And a lot of it, if you're like you've been using Kubernetes for a while, it doesn't really add a whole lot.
It's like, hey, is your service down?
Try describing it, which is kind of like something that you're going to learn very early on.
But some of these things actually get pretty big, like the debugging pods section, huge,
and had all sorts of information about kubectl debug that I'd never heard.
So I never thought to look here because I thought it was just going to be all basic,
but there was some surprisingly good nuggets in there.
So it's good stuff.
Most excellent.
Most excellent.
All right. So mine is actually pretty simple.
And so maybe my first tip should have been when the tip pops up in whatever IDE of choice, read through some of them.
You don't just turn those off forever?
You know, I never have.
And I'm usually annoyed when it pops up.
And I'm like, know, I never have. And I'm usually annoyed when it pops up and I'm like,
ah, whatever, close. Well, so for whatever reason, the other day I opened up IntelliJ and,
and one of the tips came up on the screen. I was like, oh, I'm gonna read this. And it was
actually really good. So when you're debugging an application, a lot of times you'll put a break
point in somewhere and you'll put a watch in so that you can see what the value of a variable was,
right? Like that's pretty common stuff that we all do. Well, there are times that you're like, man, I really don't want it to stop
at these breakpoints. I just want to know what the value of that thing was. And if it was just
move on, right? You can do that in IntelliJ. If you were to highlight the section of code that
you want it to output the value of, right? So let's say it's like application dot
my value, right? You could actually highlight application dot get my value with its open,
closed parentheses or whatever. And then if you shift click in the gutter on, on a point over
there after that, it will actually write it out to the logs that are happening in the application, and it won't stop on the breakpoint.
So you can see the values of the things that you care about as the thing's running without actually stopping.
And if you see anything nasty, then sure, you can go and put it in breakpoint and stop there.
But it's almost more like a sanity check.
So I'll have an image that we can put up on the post as well for this. It sounds like
watch values on steroids. It is. Yeah, it's very much that, right? So instead of it having to stop
your application and look in your watch values, it'll just put it in the same output as the rest
of your application logging. So I thought that was really cool and something I'd never really thought about. So pretty nifty.
Yeah.
I'm with you.
Like those,
those tips come up and my,
my OCD won't let me like close,
like permanently close it.
Cause I'm afraid like I'll miss something,
but inevitably like 90% of the time I'm just like,
no,
not today.
Yeah,
exactly.
Exactly.
But it seems like every time I actually do take the time to read them, I'm like, oh,
yeah.
Why didn't I look at this before?
What other goodies have you not been letting me know that you've probably been letting
me know that I missed?
Exactly.
Yeah.
It's all your fault.
So I'll ask you.
Well, first, let me tell you a little story here before I get into my tip of the week.
So here locally,
there was a man that was caught,
uh,
stealing in a supermarket while balanced on the shoulders of a couple of
vampires.
He was charged with shoplifting on two counts.
So my first,
my first tip of the week is specifically for Alan.
So yeah,
this was to you.
This was to,
yeah,
I mean, helping you out, helping you out first, read the, read the tips that, you yeah, this one's to you. This one's to, yeah, I'm helping you out.
Helping you out.
First, read the tips that, you know, the application comes up with.
Also, test your UPS battery regularly.
So, you know, me personally,
I have like a little, you know, reminder
every, you know, few months to like,
just see, you know, like if I,
can I still run the computer for like you know
10 minutes on ups or does it immediately just die um whatever your method is of it point is is like
you should you should test those things regularly because those ups batteries will die uh just uh
you know from sitting there hey the better tip you should have given me was plug yours up, dude.
You haven't done that yet?
No, I told you.
It's still sitting down there.
What?
Okay, so a little background information here.
This is why this came up.
We were supposed to record this episode a few days ago,
and a bad storm came through,
and we were all joking at the start about, like,
well, we're all on UPS, you know, and we all specifically got each other UPSs to make sure that like we wouldn't have this problem with this.
But Alan's Alan said, well, mine's sitting there next to my computer on the floor, not plugged in.
And we're like, what?
And then guess what happened?
Storm rolls through, knocks Alan off.
And yeah, so now we're recording a few days later.
So this is why this tip is for Alan.
So yeah, I guess you're right.
I should have started with tip one, plug it in.
So yeah, I assume that was already a given though.
All right.
So here's for another tip of the week though.
We've talked a lot about container you know, container type things today,
Borg, Kubernetes, whatever. So let's talk about Docker. So have you ever found yourself in a
situation where you have an image built and you want to get something out of that image,
but you don't necessarily need or want or care to run that, to spin up a container to run that thing, right?
So, and you're like, well, what would possibly be a use case for that? Let's take our build
pipeline, for example. I have this preference, you know, I don't know, affinity. I strongly want like everything to be Dockerized, including the build
chain so that that way, however, something is compiled, it is consistent across every
developer's machine because versions of code and whatnot or versions of I'm sorry, not code,
but versions of the tools can be strongly maintained and enforced through that Docker file, for example.
But now when you do, if you're using that to, if you're using Docker to build your code,
how do you get test results out of the code as an example? So that's an example where you might
want to do what I'm about to say. So rather than doing a Docker run to start that image as a container
and copy the file out with a later Docker CP command, instead you can just do a Docker create
to create the container without actually running it and using those resources. But now that you
have it created, you can then do a copy command out of that container.
So I'll have the exact, you know, examples of like what this flow might look like in the show notes. But the one big call out to that I want to make in this example of the Docker create is that it will be extremely helpful for you if you name the container a specific name that you know of, right? So that you can use that
same container name later in your Docker copy command. And you'll later, you might, you probably
want to remove the container that you've created. So you'll need the container name again to remove
it. So it's highly advisable to name it something that you know ahead of time.
Okay, so I've got to piggyback on this because I'm actually super excited about this.
I didn't know it existed.
So what he's talking about, the reason you want to do this is if there is a file in the image that was created itself, you can get that file out, right, without having to run the thing.
And why does that matter if you've
ever tried to docker run something that requires like 80 environment variables or a bunch of map
paths or whatever it's a pain in the butt just to try and get a file out of it so this i didn't i
didn't know this existed so this docker create allow you to copy the file out of the image
without having to get it actually up and running
because you'll know if you ever do a docker run and you don't give it everything it needs
it'll typically die and then you can't do anything with it then you're trying to figure
out what you need to do to make it run or also there might be a default entry point already
specified in the in the for the in the docker file for that particular image and so if you do
docker run it it's going
to go and run whatever that entry point is and that entry point might not be something you want
done at that given point yep no this is this is killer that's exciting i had no idea this
existed two quick examples i wanted to mention is like one is like um a lot of times you'll run
like unit tests and get coverage files out of it this This is a great way to do that in Docker and then use it and export that coverage file.
Also, jars.
So if you have like a Java build or.NET build, anything that generates like DLLs or jars, artifacts that you want in another place,
then you might do your build in Docker and copy that out and load it into like a repository or whatever.
Yep.
Yeah.
That's beautiful.
So, yeah, all of this will be in the show notes. Uh,
if you haven't already, you can find those on the website, uh, www.codenbox.net. Uh,
and like I said earlier, you know, maybe a friend like said, Hey man, you got to see what these
guys are talking about. Like almonds and pecans. This is crazy talk. And so, you know, you know,
you're, you were just listening randomly through some link, but you know, Hey and so you know you know you you were just listening randomly through some link but you know hey did you know you can subscribe to us uh you go to itunes spotify
wherever you like to find your podcasts uh i certainly hope we're there um and hey if you
find a spot where we're not um let alan know he he'll he'll'll fix that. I just got tasked out. That's good. Yeah, there you go. And, and,
and as I said earlier we, we,
I can't emphasize how much we really do appreciate the reviews. They,
they really are meaningful.
I think Alan's even commented on this in the past.
Like sometimes like we get like some truly heartfelt ones that, I mean,
they can't, you, you,
you can't help but be a little bit emotional when you read some of the things of like the way that we have the positive impact that, I mean, they can't, you can't help but be a little bit emotional when you read some of
the things of like the way that we have, uh, the positive impact that some of the silly things that
we've said, um, but yet they've had a positive impact on other people's lives and everything.
So we, we really do, uh, appreciate reading those and it means a lot to us. So you can find some
helpful links at dub wcodingblocks.net
slash review yep hey and while you're up there at codingblocks.net make sure you do check out
our show notes we have examples discussion and a whole lot more and you can send your feedback
questions and rants to the slack channel at codingblocks.net slash slack yeah and uh like
i mentioned uh you can follow us on twitter at CodingBlocks where we send you those really good gifs
or you can head up to
CodingBlocks.net
find all our social dailies
at the top of the page