Coding Blocks - Software Reliability Engineering – Hope is not a strategy
Episode Date: March 28, 2022It's finally time to learn what Site Reliability Engineering is all about, while Jer can't speak nor type, Merkle got one (!!!), and Mr. Wunderwood is wrong....
Transcript
Discussion (0)
You're listening to Coding Blocks, episode 181.
Subscribe to us on iTunes, Spotify, Stitcher, and more using your favorite podcast apps.
If you can, leave us a review. We would greatly appreciate it.
We do love to hear those new reviews.
Yep, visit us at codingblocks.net where you can find our show notes, examples, discussions, and more.
And send your feedback, questions, and rants to comments at codingblocks.net.
And if you're on the Bird site,
you can follow us at CodingBlocks,
and we don't tweet often, but when we do, it's super good.
And if you're looking for other places to interact with us,
CodingBlocks.net has all our social links at the top of the page.
With that, I'm Jerzak.
Jerzak?
Did you put an R in your name?
I did.
I'm changing. I'm trying things.
Okay.
Well,
I'm,
I'm going to try to say I'm miracle outlaw.
Would that be right?
That sounds too much like America.
What it said,
I was trying to say Michael with an R like to follow suit,
but I guess it didn't work.
Yeah.
I don't think that'll work.
That's good.
Yeah.
So I'm Alan Underwood.
I don't know.
That's so much funnier than it should be.
That's good, right?
It's because it's late, man.
We never start too late.
Well, it kind of almost has like a German kind of take on it.
Would you say it?
Say it again?
Alan Underwood. Or Underbar. That's what it makes me think of it? Say it again? Alan Underwood.
Or Underbar.
I guess it's not German.
Wait, wait. Or is it? I don't know. Whatever.
Whoever we offended, we apologize.
No offense meant.
No offense meant.
So tonight
we're going to start in on a book that
Outlaw was actually really interested in.
And it's site reliability engineering, how Google Tonight, we're going to start in on a book that Outlaw was actually really interested in,
and it's site reliability engineering, how Google runs their production systems.
So that's where we're going to kick off this evening. But first, first we want to get to our news section.
All right.
And so as we, like I said before, we like to say thanks to those who left us a review.
We love to get those reviews.
And so we got a new review from Audible from Amazon customer.
That's excellent.
Yeah.
I mean, everyone knows immediately who I'm talking about.
You're like, oh, wait, really?
You left a review?
Like, yeah, you know exactly who that is.
And it was actually a very nice review.
So thank you for taking the time to leave it.
Um,
but that's it for our news.
So I guess we'll go ahead and jump into this thing.
Yeah.
And,
uh,
the first thing you're going to see down there in the notes is a to do for
me.
Uh,
unfortunately.
So I'm
scrambling a little bit here hang on
so we're talking about a book it's called
site reliability engineering
you just said something I don't know that you
actually said those words properly
I thought that was what we were doing
you can't say his name right
you expect him to say these words right
the sir
it's like the sweetest chef of talk show host this show is right off the rails we're just getting started yeah man oh boy okay well
uh about that show so we're talking about is a book called Sight Reliability Engineering.
It's almost 10 o'clock, yeah.
Sight Reliability.
Oh my gosh.
Oh my gosh.
No, you'll get it.
I swear you'll get it the third time. Oh never again can you see how purple i am in the camera
yes it's gonna scare me uh we're talking about the sre book
you've never heard of sre nobody says the words out loud because nobody can say them out loud
obviously if i can't say it who can
i mean we're not laughing at you joe all right so so where are you going with this so this book is interesting this is an o'reilly book right but this book is uh written by a bunch
of google engineers about Google.
Hey, real quick though, before you get to that next bullet point,
we're going to give away a free copy of this one.
Absolutely.
Okay.
Yeah, we do that.
Yeah.
Okay.
So we're doing that.
The book was published in 2016.
And at the time, the term SRE, whatever it stands for, was pretty new.
Maybe even literally new at the time of publishing the book.
So no one had really heard of it very much.
And it was written by a bunch of people who work at Google about what they do at Google.
And even though the book is published by O'Reilly, you can actually go to sre.google and get the book
for free. You can just download it.
It's just a website hosted by
Google. They have a couple other things too, including
workbooks. It's meant to kind of go along with it.
Another
book called Building Secure and Reliable Systems,
also free.
The easy
link to just the books portion
though is sre.google.com.
And then there's three books there that you can read online for free.
Yeah, it's pretty cool, right?
And so anyway, each kind of chapter of the book is essentially an essay that deals with kind of one aspect of what they're calling SREs, which we're going to be diving into.
And so I just want to kind of get that out there.
And one other thing before we kind of dive in, I just pulled some stats.
I figured we'd throw them here at the beginning because we just talked about kind of salaries
and stuff last episode.
So I went and tried to look at like basically the career trajectory of SREs.
You know, we said it's kind of a new term, new field.
And if you just do any sort of Googling on SREs, you'll see that people are pretty bullish on it, I guess you could say.
I found some stats that kind of summed it up well from a site called global.com.
I have no clue who that is, but they seem to agree with the other sites,
and I liked how they kind of put it.
Wait, global.com.
.com, yes, global.
Terrible name.
GlobalDOT.com. We'll. Terrible name. Global dot.com.
We'll have a link in the show notes.
And so they give you a median base salary for this position.
$200,000.
Hold up.
That link does not take me to a site.
Just so you know.
It's down in the resources we like too so i'll grab it from there
and then we go down there yeah yeah and the subject of this article is why is sre becoming
2021's hottest hire it's global dots.com d-o-t-s.com still a terrible name all right yes all
right moving on there we go keep going So median salary of $200,000.
Remember, median, that's a good number.
That means the person in the middle is making $200,000.
It's not the average, so it's not like totally skewed by numbers.
One of these companies, although I will say if a company has SREs, they're probably pretty big and mature.
And we'll get into that.
And they're also pretty current.
Pretty current. Yeah. I mean, that. And they're also pretty current. Pretty current.
Yeah, I mean, this didn't even exist
six years ago.
It's pretty new. This podcast is older than
this job title. That's true.
We've never heard of it before.
Career Advancement Score. They give it a score of
9 out of 10, which is pretty hot.
Job Openings.
Year-over-year growth.
Up 72%. They got 1 got 1400 new job postings
so yeah so pretty good so this is a hot field this is something that you might be interested
in looking more into if you like this sort of thing that's that's pretty good info right there
so let's go ahead and jump into where we're in this particular episode, we're going to cover the
preface and the first chapter, just, just so that we can sort of set the baseline of what this is.
So the thing that makes this one unique and Joe already mentioned it is this is written by people
at Google and it's only one company, right? So you don't have this mixture of pie in the sky
type stuff. These are things that they actually did
and their experience is doing it and so it's nice to get that from a company that deals with the
kind of scale that they do right um and i can't wait for you to get like later in the book there's
some super because this is just about google they get like into some interesting weeds as to like how Google runs behind the scenes.
Now, with that said, you know, as we go through this book, there might be terminology or things like, hey, this is how Google did it, blah, blah, blah.
And there might be current Googlers that would say like, oh, well,
that's not how we do it anymore or whatever.
So we're coming at it from the perspective of this book, though,
at least at the time of the writing.
That's how it was said, like, hey, this is how we do things.
So I'm sure that they would have advanced their own practices
in the last five years, seven years, whatever.
Totally.
And just because Google does it doesn't mean it's appropriate for your company to do your, not Google,
but maybe it could be.
Maybe there's some things in here that are good for you. And there's a lot of other companies that now have SRE teams.
If nothing else, you get some ideas of things that can help improve your business, right?
And one of the things that they said about this is they were interested
in scaling the business process not just the machinery right so okay yeah that's that's huge
because honestly it's the business processes that seem to get in the way a lot um the communication
around those processes right and then just what joe said a second ago about, you know, hey, this not everybody's the size of Google.
They actually called out, hey, this this tail should be for emulating and not copying.
Right. Like tweak it to whatever suits what you have in your business.
This is not necessarily a blueprint that everybody should have to follow.
You want the results results not the process did
you say uh 40 to 90 percent of your effort is uh is what's the term i'm looking for um
cost is the word i was looking for 40 to percent for 90 percent of your cost for delivering software
happens after you've deployed a system which is funny because we usually talk about kind of the
first launch and how much effort and how long things took to develop the first time.
And we so rarely talk about how long it takes after it gets launched,
how long it takes to maintain what new features.
And so we've got this kind of like industry focus on this kind of first period,
even though the second period, the second half goes on much longer.
Well, they literally refer to it as the labor involved. And they make
the analogy that software
development has one thing
in common with childbirth, and that is that
the labor and delivery
up front is painful and difficult, but it's the labor afterwards that is where you as a parent spend the majority of your time.
Right. And that's what it's like with with software development, too, is that, you know, we have this like industry practice of, you know, that's where we put all of our focus on.
Like, hey, let's develop this green filled app and we're going going to like put all of these best practices in from the front, blah,
blah, blah, blah. And like, um, and then, and then magically like we're going to deploy it
and we'll never look at it again. But the reality is, is that once you do put that thing out there
in the world, you know, from that point forward is where you're going to spend the bulk of your
time. Yeah. I mean, look at Google for example, right? Like they, they started out in the nineties, right? But they're still
constantly, um, you know, iterating on and advancing their search engine that started them.
Right. Yeah. And, and one of the key takeaways here for me was just when you call a system stable,
that doesn't mean that you're not dealing
with it. Right. And that was one of their key, their key call outs is stable. Doesn't mean you
don't touch it. You're still putting time and effort into the thing. It just means that it
mostly runs the way it's supposed to at that point. Right. All right. So I guess the next thing is what is site reliability engineering? And they actually put together a definition that I really liked. And it's engineers who apply the principles of computer science and engineering to design and the development of computing systems, usually large distributed ones. So, so software engineers doing what they do,
except you're applying it to the operation side of things. You write software for those systems
and you're building all the additional pieces of systems need. So like backups, load balancers,
all the other things that come into play that, that you need to run your system. And then also how to apply solutions to new problems.
So, you know, taking the software engineering approach
to doing those things.
And I mentioned that they consider reliability
to be the most fundamental feature of any product.
And that I think kind of flies the face of what you see
on like Hacker News or Reddit or whatever, where you hear a lot about kind of innovation.
So it's kind of refreshing in a way to hear a company talking about reliability being more of a priority.
And that's something you're going to see echoed in this book throughout as they keep coming back to reliability.
Because software doesn't matter much if it can't be used and needs to be reliable enough.
That doesn't matter much if it can't be used and needs to be reliable enough that doesn't mean perfect but once you've achieved reliable enough and we'll get into defining that then you can
spend more time developing new features or new products yeah they they actually um later in the
book they get super into like defining enough you know what constitutes enough. Because I think there's this perfectionist part in us
where we want to develop something that's, quote, perfect and bug-free
and not have to worry about it, right?
That's always our goal.
Talk about all the unit testing always our goal. Talk about
all the unit testing that we've
ever talked about and
trying to abstract things
perfectly and put interfaces to things
so that it is low-maintenance
kind of things.
But the reality is that
if you ever... Let's suppose that
you did create a system that was
perfect.
Well, for one, it couldn't have been very complex, right?
I mean, what's the chances that it was complex and you made it perfect?
But also too, how useful would it really be?
And like, how much cost did it take you in time and effort in order to make it 100% perfect? And they go into numbers where like, uh, later in the book,
where like just adding, like you've heard terms like three nines of reliability or four nines of
reliability. And if you haven't, um, if you're new to software field, then I mean like three
nines of not reliability would be like 99.9% or four nines would be 99.99 and they talk about the cost later in the book about like
adding that additional nine uh you know a decimal point uh to it and like you know how do you decide
whether or not it's worthwhile to go after that uh that additional nine you know things like that
it gets a lot more expensive every ad for going from nine percent
to ninety nine percent uh you get a lot of value there for not as much effort going from five nine
six nines it's gonna cost it's gonna cost you a lot to get there and is it really gonna be worth
it especially when your customers are uh you know communicating to you over the internet and they've
got their own outages you know they're not even going to receive the advantages of that and so
you got to figure out where to draw that line and that's one
of the things that sres do yeah it's actually a repeating thought to you though that was in
designing data intensive applications too because they talked about it over there about how just
increasing that nine was like there's there's a point of not even diminishing returns of just
negative returns right because you're going to spend a ton of development money
on trying to get that, and it may not even matter.
So there's going to be some crossing of the streams
between designing data-intensive applications maybe,
but I was thinking more about the DevOps handbook in this book.
But yeah, they actually do later in the book,
like, hey, here's how you can quantify
some of this stuff to see if it's even worth going after that additional nine. But, you know,
to Joe's point, like, yeah, they, they actually call out in the book that like, you know, your
users may not even recognize the need or that, that you added that extra nine, for example,
or, or let's, let's just say that you were able to get to 100%
reliability, right? But it took you an exorbitant amount of time and effort and money to get to
there, right? They said the reality is that the way your customers are even getting to use your
service, they likely wouldn't even notice the difference between 99.9 and 100% reliability because
the phone that they're using might be slow, the connectivity of their phone service to their
cellular service at that time, or whatever, because they're traveling and swapping cells
as they're trying to browse it, they might not even notice it. So you put all that effort in for this, you know,
fictitional character that could use the system
that already has the perfect, you know, access to it
that just doesn't exist.
Yep.
All right.
So one of the other things that they tacked on,
like some of the other focuses of these SREs are are like managing storage or an email service or a search service.
Right. Like they they also try and keep these things alive.
And this was specifically Google, obviously.
Right. Like they were talking about their Gmail and their and, you know, their storage platform and all that. Um, so reliability is
regarded as the primary focus of the SRE. And it, this was really cool to hear them call this out.
They said that they wrote the book largely to help the community as a whole by exposing what
Google did to solve their problems, right? Their post deploy problems. And they also did this to help them define what they believe the role should
be. Right. So it was like this,
this multifaceted thing that they were doing here that was helping everybody
else and themselves internally to,
to fix some of their processes and things that they had going on internally.
That was one of the things that I thought was super cool about it was that they were trying to help the industry by solving,
there's this need for this role, but we don't know what to call it.
But here's what we think are the responsibilities,
and we're going to throw it out there to the rest of the world.
And the rest of the world can buy into it or add on to it
or tailor it to their needs. you know, the rest of the world can, you know, buy into it or add onto it or, you know,
tailor it to their needs. But, you know, just Google's way of trying to help.
And the guy who actually coined the term at Google, who is one of the Google VPs of like
24-7 operations, he's one of the contributors to the book. His name was, uh,
Ben, uh, you know how I am with names. So get ready. Hear me out. Ben train or sloss. Is that,
I don't remember. That was it. That was it. Hey, I won one. I got one. I feel like I'm in Ghostbusters. We got one.
All right.
So,
but,
yeah,
so,
so,
you know,
I also found that kind of interesting about this book,
though, too,
is that like,
you know,
a VP took time out of his busy day to help write this book.
No doubt.
So kind of open source is almost like they kind of released something out in the wild.
They got some feedback and they incorporated it and other people using it.
Yeah, that's awesome.
And they called out to, like we mentioned that if you're a small business, you're not
going to necessarily be able to do everything here, but you should be able to take away
some of these concepts and, and, and they may be able to help you in your business.
Um, this next part, this next part kills me because I feel like everything I read this is the case
the earlier you care about reliability the better right and what they mean by that is it's way less
costly to implement some things up front right like even if it's just a lightweight support
capability then to do it after you're way further along,
you have nothing in place and it's going to be really expensive to try and get
those things in.
But the reason why this is crazy to me is don't we hear this about security and
just about everything else in software development?
Oh,
you got to do it up front.
Otherwise I was just going to say the same thing.
I was going to say like,
it's literally like pick a,
pick a topic in computer science or in
software development and it's like yeah it's it's much easier if you care about that up front like
unit testing it's so much easier to test your application up front if you've like coded for
it otherwise like you'll have leaked in dependencies and whatnot and you know be harder to abstract
those and mock those out so you gotta do it up front or like oh man if you don't start with devops up front if you don't have like an automated pipeline then it's so much
harder to come after the you know come back behind the scenes and add it so yeah that was the one
that was the one rub here that i have was like yeah okay i get it it's easier to care about the
reliability up front yeah okay but also that's the same truth for everything yeah it's easier to care about the reliability up front. Yeah. Okay. But also that's the same truth for everything.
Yeah.
It's easier to carry,
to care about pointer math up front as you're iterating through your array
than it is to later find out,
Oh,
I went too far.
I mean,
also MVP,
right?
It's,
you should get your thing out to the customers and start selling before you
even build it.
But yeah,
they're all
competing i mean they're not wrong right i mean what they're saying is absolutely correct but i
mean how much stuff can you build up front for your mvp it's frustrating yeah so so uh who was
the first sre uh cool story here so um they uh they kind of picked on Margaret Hamilton here, who was an Apollo program director from MIT.
And there's a story that they told in the book about basically how this woman's kid came into the office one day and ended up pushing some buttons while they were running a simulation.
And the Apollo rocket ended up crashing during simulation.
And it wasn't supposed to happen so they looked into it and found out that the the kid had ended up triggering like sort of sequence that
wasn't supposed to happen at that point it lost the navigation data and uh so you know the margaret
hamilton ended up like trying to write a defect and documented it and uh they kind of pushed back
on her and said well this isn't going to happen this isn't that wasn't supposed to happen the
astronaut would never do this.
Right.
This is not the thing.
This is something that would never happen in production, essentially.
And so they ended up not fixing the bug.
They didn't prevent it.
And sure enough, guess what happened?
But, you know, the cool part, though, is she knew like exactly what you said.
She was like, OK, so maybe it's not supposed to happen but i'm gonna
write up some steps to recover from it anyways and i'm gonna put it in some documentation
so that you know hey even though it's not supposed to happen if it ever does at least somebody will
have something to go back to and that was super important because guess yeah like joe zack said
somebody screwed up yeah and. And did it.
So,
so what's the takeaway?
The takeaway is it's easier to write your read me up front than it is to do it after the fact.
That's right.
That's right.
You have a good read me up front. Then you're okay.
How many times have you heard that?
Like,
this is something that would never happen in production.
So we're just going to manually do this,
whatever.
And we're not going to try to code around it.
Nobody's ever going to try to drop the table,
so we don't need to worry about those permissions.
It's fine.
So that's why you need a big F statement that says,
if you're trying to drop the table, don't do it, right?
Too specific.
It's hard.
Software is too hard.
It really is.
Let's all just agree to stop.
Yeah, we should.
I mean, computers are smart enough to write it all, right?
No. Yeah yeah pretty much they've never they've never gotten the design requirements from product
management yes all right so we're not better not at all no no i mean it's just the way it is
um so the sre way right thorough dedication, belief in the value and preparation and documentation.
So what we just talked about.
I told you that readme would get you.
The readme is important.
And awareness of what could go wrong and the strong desire to prevent it.
So that's the SRE way.
And that is the end of the preface, the opening to the book.
Now, I will say chapter one, I loved the opening quote.
You want to tell us?
Yeah.
Hope is not a strategy.
And they later go on to explain,
because you read that at first,
and you're like, wow, that sounds really dark.
Like, okay, I guess.
But I mean, rebellions were built on hope,
but okay, fine.
Hope is not a strategy.
But they later go on to say that
the point that they explain is that
from an SRE perspective,
you don't rely on hope as your answer to anything.
Like, well, that will never happen in production.
I hope that will never happen in production, so we're not going to deal with it.
Instead, just ensure that it can't happen in production.
Yeah.
Another term you've probably heard in your expression is basically saying, we don't even know if this is a problem in production yet.
So let's not put any effort into solving it yet and those are both there's times
and places like absolutely those are totally fair comments and you can make decisions you know it's
totally fair but it's not those are kind of at odds with the sre's kind of mindset about things
so i think an sre in a situation where somebody said that's never going to happen prod or um you
know we don't know if this is a problem yet as they would say okay well let's document it somewhere let's get some thoughts together so if it does
happen we're prepared just like the margaret uh hamilton example but so you know what's really
cool about what you just said there is when you call that out specifically from two different
perspectives right like the developer of the product, the person who's doing the product features
might say that like,
hey, this is never going to happen there.
That's it all the time.
But they have different,
we've got different concerns
than an SRE does, right?
Like somebody who's tasked
with keeping the systems running
is going to look at that and be like,
oh, well, you say that's not going to happen,
but I can clearly see a case where this could be a problem and it might cause X, Y, and Z you say that's not going to happen, but I can clearly see a case
where this could be a problem and it might cause X, Y, and Z. And so I'm going to put some attention
on that. Right. So you have two different, two different perspectives, you know, the,
the product developers, they're trying to get something out the door, um, for the customers
and, and the SREs are trying to keep it running for their customers right so you have
you have two different perspectives and and so it's really good to have those two different
visions on it i think well and the sre is not a blocker the sre uh we'll get into setting budgets
and for disruption and stuff with all sorts of stuff but um the idea isn't to block changes or
to not take risks it's okay okay for systems to go down.
It's about how fast to recover and managing that downtime and managing your reliability.
I do want to be careful, though, because the one thing that you said, Alan, with the SRE, it almost made it sound like they were just purely in charge of operations, just purely in charge of making it run.
And that's not,
that's not their role.
And so like in this specific chapter,
like heavily,
like I was thinking of this book compared to the DevOps handbook and,
you know,
the,
the,
um,
the clash that the dev,
the DevOps handbook and specifically the Phoenix project book,
you know,
the companion book that went along with it,
you know,
illustrated a good story,
you know,
clash of like developers versus the operations team.
Right.
Right.
And this book,
you know,
as you're reading this first chapter and they're like laying the groundwork
for the introduction into SREs, right? You know, there were some heavy comparisons running through
my mind there. Yeah. They are not purely the ops team, right? So I think that's what you're
drilling out there. Well, yeah. Cause, cause, okay. So let's, so let's step into this. So
the old way is that you would have this a sysadmin So the old way is that you would have this sysadmin.
You'd have the sysadmin approach to systems management, right?
So you'd have a system administrator to run services and respond to events
and update to those systems as they would happen, right?
So that person was, you know, like, let's go back into the 90s,
for example, you know, or early 2000s. And, you know, that would be the person that's like, hey,
they manage this rack of machines. And if you want anything installed on it,
they do it. If there's ever an update that comes out from a vendor from those, um, for any of the software running on those machines,
they take care of it.
And like how they ever determine like which patches they were going to apply,
they're willing to apply and which ones they weren't like was always like magic
to me.
Cause it was like,
how do you really,
do you know,
you're just going to like,
you're just guessing,
right?
Like you're just going to do it,
but you're guessing, right? But at any rate, uh, yeah, that was the old school way. Right. And,
and you know, those, you would have teams of these people is it depending on the number of machines
and different software packages that needed to be maintained and the skill sets that were required
for those, then, you know, those teams would grow as that capacity was needed.
So you might have one group of sysadmins that's just in charge of one particular database technology.
Another group of sysadmins is just in charge of one particular operating system.
These are my Windows sysadmins.
These are my Linux sysadmins. These are my Linux sysadmins. Right. And, and, um, you know, you could have like multiple sysadmins that are
responsible for the same physical machine, right? Like one guy who's, who's the OS guy and another
guy that's, you know, whatever software package happens to be installed on that thing. Right. Um, and, and, uh, yeah, the, the, the, what did we have here?
The, usually the skills for a product developer and assist admin are, are different. Right. So
I think I've even said this before on the show that like, I always had this kind of like mindset
that a, a good software developer was a decent, uh, admin and a good admin was a decent
software developer.
Right.
But,
but that's two ends of the spectrum.
And,
and I just kind of like have always had this vision that they kind of crossed
somewhere,
you know,
but therefore that's why they would end up on different teams though,
because there's different strengths and weaknesses among those two skill sets.
And there's conflicting interests, too.
Like the sysadmin wants to keep things stable.
The developers want to introduce new features.
And so there's always been this kind of pushback with sysadmins or operations teams.
And that's one of the downsides.
And you think about, well it doesn't sound great
right you know to keep something stable sounds sounds ideal in one mindset but another point
that they make in this book though is that if something is truly stable and you're you know
because the the ops team or the sysadmins aren't letting you put new things on it, then that also means that it's just growing stale and it's going to be boring and it'll eventually not be used.
Well, that's what they said, right?
The two are constantly at odds with each other because what makes the system unstable is changes to the system, right?
And the whole role of a product developer is to make changes to the system, right? And the whole role of a product developer is to make changes to the
system. So obviously your sysadmins don't want to change anything and your developers are wanting
to shove something in there every other minute, right? So they are at conflict. And what sucks
about it is they actually talk about some of the disadvantages to splitting up these teams.
And one of them is the direct cost. They said,
it's actually really easy to see these costs because, um, you know, as outlaw said, when the
systems grow, so does your need for more sysadmins. And so you actually are growing your team and you
can see these costs as they happen. Right. Um, and it doesn't scale well because you have to
have more manpower.
So if you're going to add a hundred more machines, um, with, you know, however many operating
systems are going to be on those in today's virtualization environment, it's not necessarily
a hundred.
Now you've got the need for well more manpower.
So it doesn't scale well.
Um, but then they talked about the indirect costs, which was interesting.
They said, this is subtle and it's not quite as easy to see, right?
But it usually costs more than the direct cost.
So like the manpower and all that, you don't realize how much money you end up spending here.
So it's like communication, people building the wrong things, processes.
Yeah, the communication is key because they said that developers and sysadmins often use different language, right?
So they're not even communicating on the same level, right? Like what
the developers say may not commute, may not translate exactly to what
the sysadmin needs to hear. And so you're going to have miscommunication
there. And that's big. That can cause a lot more problems.
And because communication is hard, that's why we need a site where jaya betty is the engineer
and everyone's done
oh and here's another thing too this goes back to the whole conflict and the two being sort of
at odds with each other is you have different assessments about risk and the possibilities
for technical solutions, right? Like the developers, like I just put it in, it's not gonna be a big
deal. And the, and the sysadmin is like, no way, dude, I'm going to be here all weekend dealing
with this thing if, if you mess with that. Right. So that's a thing. I mean, this is exactly
everything that the DevOps handbook was talking about,
right?
It was like how these two groups are,
you know,
they're at odds with one another.
They have different metrics for success and those metrics are,
you know,
opposing,
you know,
are opposed to one another.
So how,
how do you win in these environments?
So it was interesting that like after because the devops
movement had already started by this point by the time that this term came came to be a thing
or at least uh from the the book uh became to be a thing who um did they even say like when sre
the term and job title was created. They just said the book.
They Googled it.
It's funny.
But yeah, in the book, the 2016.
Well, you're right.
No, you're right.
Okay, so the book was published in 2016, which is I think the first time most people had heard it.
I couldn't find a reference sooner.
DevOps first coin was 2007.
It's almost 10 years prior.
So, okay, so then we suspect that maybe DevOps had existed before Google had this term.
We don't know.
Google might have had this as like an internal thing, you know, for some time before, but we suspect that it was DevOps first. But the point is, is that it's still interesting that even after the advent of DevOps, they still felt the need to like take it to this extra step.
Yep.
And so what's interesting, after all that,
like the being at odds with each other,
they did call out some of the things that happened from this,
and this is where some of the other costs come in
and the inefficiencies in the company is,
hey, because you don't want things broken,
operations introduces launch and change gates, right?
So now it's going to be harder for you to get your stuff into production.
They want to check for every problem that's ever happened
before they approve something to go into production, right?
So you have a new button that you're going to put on the page,
and all of a sudden they're like, well,
is this going to break the 500 other problems that we've seen before?
Okay, that's going to be rough.
And then that causes the dev teams to introduce fewer changes
because they're like, wait a second,
every time I go to push something in production,
I got to go through this change gate process?
No, I ain't doing that.
And so they end up putting in more feature flags.
And this is interesting.
I hadn't actually heard of this,
but I could totally see it happening is dev teams will start sharding their features into separate
branches. So they don't even have to talk about them when they go to release this change, right?
Because it wasn't part of the code base. So they don't want to have to go through this review
process. So all of this is a lot of added cost, both time, money, everything,
just to try and get things released.
Okay, so one, I love that they actually referred to it as trench warfare.
Right.
You know, it was the way the two teams would operate against each other.
But I don't know about you two.
As I was reading this, I thought back to a
shared experience that the three of us had in a previous life, right? Where we used to work in a
situation where we would do deployments as needed. We might do three a day, you know, if that was the thing,
or,
you know,
we might not do one at all,
but you know,
depending on what the situation was,
we would do what was needed.
And then we got a new director that came in and one of the first change
gates that he put in place was,
he was like,
we're going to only do deployments twice a week.
And on these days, and that's it. And you're going to only do deployments twice a week and on these days.
And that's it.
And you're going to have 12 miles of documentation and,
and everything else behind it before you ever like it,
the process of releasing took a day.
So if you were,
if you were doing two releases a week,
you spent two of those days literally just documenting things to be able to get things out the door.
And it made it to where people just didn't even want to release anymore.
And it really made me think back to that experience with his bosses, he had some kind of incentive
to ensure that from a reliability point of view that the site didn't go down and whatnot
and they kept making money.
And every time that we would do a deployment, then that technically meant risk of us introducing
something that might bring the site down.
And so from he,
he might have had unbeknownst to us a very,
you know,
in,
you know,
a cost incentive that would affect his own wallet,
you know,
to keep us from wanting to do that.
Maybe.
Yeah.
I don't know if you guys thought if that came to your mind when you guys were reading that part of the book.
No.
Try not to think about the past.
Oh, yeah.
That's probably a healthy outlook on life.
Is that your tip of the week?
That should be,
you should have used that one.
That was good, yeah.
All right, well,
oh, boy. I i guess i guess i will i will do this
review thing because like the last time you guys made it weird so listen we've never
i think like didn't you last time go into like a super
like subtle voice like it was like super NPR kind of
voice or something
or was that Jay-Z that did it?
No, that's why we got a review though.
Dang it. When you're right, you're right.
That's right.
Alright.
Blame Amazon customer
because here it comes.
Hi, listener.
If you haven't already left us a review,
we would greatly appreciate it.
You can find some helpful links at
www.codingbox.net
slash review.
And if you are a Spotify listener,
you can also leave a rating
within the Spotify app.
We greatly appreciate all of those reviews and everything that you can do
to help spread the word about Coding Blocks.
This one going out across the line to Delilah
in Kansas. Thank you.
Shouldn't it have been Delilah in New York?
Yeah, it might have been.
Isn't that the song?
Maybe.
Hey there, Delilah.
What's it like in New York City?
Okay.
All right.
A few episodes back,
we asked...
It was definitely a while back.
We're behind on some of these surveys because this is from new year's or no maybe this was from like uh no this is probably from like
direct right after new year's do you stick with your new year's resolution is the question
your choices were for the first couple of weeks or I'm pretty good until
spring ish,
which I guess for like,
this is about the time that everybody would stop being good about it.
Right.
Or I'm like a machine.
Resolutions are rules that are not meant to be broken or wait.
Those things are to be taken seriously.
They're broken by new new year's day or what are resolutions?
All right.
So this is what?
181 Alan,
you're up first.
Yeah.
This one's clearly wait.
Those things are to be taken seriously.
I'm going to go with 50% here.
Like I'm fairly certain.
Most people are like,
I ain't messing with these 50% here. I'm fairly certain most people are like, I ain't messing with these.
50%, okay.
Okay, and just for fun, I'm going to go with, I'm pretty good until spring-ish with 22%.
Almost got me there.
Almost got me.
Okay, math my chicken.
Strikes again.
All right.
Well, you're both wrong.
What are resolutions is the far and away winner with 48% of the vote?
Oh, I was close on the percentage.
All right, good.
And I had the right notion.
Yeah.
Yeah.
Yeah.
Those were two.
Uh,
wait,
those things are gonna be taken serious.
All right.
Yeah.
Like,
um,
I guess nobody really follows through on,
on their resolution.
I've done terrible last couple of years.
So yeah,
that's why I don't even try.
All right.
So here's the survey for this episode.
You ready?
So since we're talking about SRE
and we've already given some throwbacks
to the DevOps handbook,
the question is,
so DevOps is a culture,
but SRE is a job title?
And your choices are, wait, what?
Or, yeah, I get it.
Or, meh.
Nice.
Hey, a reminder, if you drop us a comment, we'll send you the book.
I should mention to you like the –
Whoa, whoa, whoa, whoa, whoa.
Leave a comment and you have a chance to get the book what did i say what is going on here we're about to go broke
oh yeah there's a chance to chance to win yeah well uh i was gonna say the digital copy is free
so i'm assuming if you leave comment that you're going to want the physical.
So yeah.
Yeah.
Well,
that's why I was joking at the start when I asked like,
Hey,
are we going to give a copy away for free?
Because the very next line of the notes is,
Hey,
this book is available for free.
Nice.
All right.
That's funny.
Yeah.
That's true.
I don't know.
Yeah.
We can give away a physical version too.
So,
you know,
it's fine. Yeah. If you're dying to have a copy let us know there is an audiobook version maybe maybe it would be nice to give that away yeah totally it's awful i've been listening to it
yeah so if if yeah if you leave a comment on this episode you can let us know if you want
a physical copy or if you just want to leave a comment or if you'd like an audiobook you know
there you go.
Before we start this next chance section though,
I just have like one small rant.
I need to get off my chest.
If you,
if you would let me,
I think circles are pointless.
All right,
we can go on.
I got that out.
Oh boy.
Okay. Let like it.
Let's talk about Google's approach to this problem. So, site reliability engineer.
The idea is to focus on hiring software engineers to run their products, not sysadmins.
They create systems to accomplish the work that would have historically been done
by sysadmins and i remember i used to work with a guy that was like a sysadmin slash dba
this person had an amazing superpower they could sleep sitting up just in their chair
and then you'd be like hey george can you put this in production george george i email can you
run this query
and then he would
eventually do it
and they're trying
to get away from that
by building systems
rather than having
kind of people
sitting in those spots
sitting idle
often
or either vacillating
between sitting idle
and freaking out
all the time
because something's wrong
so
a nice quote
they had here
is that
SRE is what happens
when you ask a software engineer to design an operations team.
Isn't that awesome?
Yeah.
That is pretty cool.
I like that.
And they've got a nice breakdown coming up right now.
They say the SRE role, the responsibilities can be broken down into two main categories.
50-60% are Google software engineers or people that were hired during the
standard hiring procedure and 40 50 percent were candidates who were very close to the google
software engineer qualifications but didn't quite make the original cut this isn't the breakdown i
thought it was all right so so hold up hold up right it's a hard breakdown yeah it's very hard
this is not what i thought it was gosh like yo um so this is so
outlaw you did pretty well on the interview but you're not quite software engineer material for
us but they say this but i think you'd be great at being an sre like you want to come work for us like that that sounds pretty good right i guess i mean i'm sure they don't frame it like that when they call you back right but
but that is kind of brutal that they say it although joe did mention at the top of the show
what their compensation is and it's not terrible so so being being second best and still and still making the cut is not
terrible but what a weird thing to say and i do have a feeling they're talking about like the
initial kind of wave of sres here uh and they mentioned and additionally that skills would be
very valuable for sres but that were not as common for typical software engineers like
being good with uh you know knowing un Unix or whatever, networking knowledge.
And so Google has tracked the progress of these two kind of career paths.
And they say there was very little difference in the performance over time.
So whether they were more kind of traditional software engineers or candidates that were a little lighter on software engineering but had those other skills like linux or networking that kind of bolster there yeah so the key takeaway here is if you were in
that 40 that didn't quite make the original cut like the the top tier cut it didn't seem to really
matter and so total tangent here we've talked about this in the past, right? Like just some of the interviews at some of these larger companies,
right? Like the fan companies, they can be really difficult, right? And some people aren't going to
make that cut, but does that mean that they wouldn't have been a great fit and a great
developer for that company? Not necessarily. It doesn't mean that. Like, I mean, we've laughed
about the fact that you have to do the traveling salesman algorithm in an interview,
and then you get in there and you're shifting pixels around on the page because you're doing UI stuff.
Yeah, or you get questions like how many golf balls can fit into a 747.
Here's one.
I don't know if you've heard this one.
Why can't your nose be 12 inches long?
And be a foot.
There you go.
Joe gets the job.
Yes.
But,
but yeah,
so it is good to know that they did call this out.
Like right after they talked about the fact that,
you know,
the first part are the top tier hires and the second ones are like the,
the ones that didn't quite make that cut.
They both tracked almost identically in terms of career growth and and their path and all that so that's
that's really good to hear yeah and this isn't what they're doing today this is what like their
kind of initial you know launch who knows what they're doing today but when they first started
hiring sres they took you know half the past and half they took a chance on and figured out that they worked out the same.
Yep.
So one of the things that is cool about this is they looked at these software engineers and these ones that are going to be automating these old sysadmin tasks.
And there were some things that stood out to them in this hiring process.
Software engineers get bored doing repetitive tasks.
That's so spot on.
I know, like, give me a boring task and I have a hard time staying focused on it.
Like, I seriously struggle with it.
It is terrible.
But software engineers, when they get handed these things, they think, well, how can I get rid of this repetitive task?
Right. Like, how can I make this go away?
And that's a really interesting hiring perspective.
I don't even I never even thought of it as more of like, you know, our nature to get bored or by or repetitive tasks.
It's just more like, OK, how can I just make my life a little bit easier?
And I don't want to ever have to do that thing again.
So I'm just going to like write something so I can write it the one time.
And then the next time you asked me to do it, I'd be like, oh yeah,
I'm going to spend the next three hours working on it.
And then instead I'm going to click this button and then I'm going to go
ride my mountain bike.
And then I'm back and be like, well, still slaving away on it.
Boss man.
Yep.
Just finished.
So yours is inconvenience.
Mine is straight up boredom.
I cannot do repetitive tasks.
No, see, my deal with that is I automate the first time.
So the second time when I'm asked to do it,
I can quickly push the button and run it
because I'm actually like three weeks behind
on the other stuff I was supposed to have done.
So I need to be automating some of this stuff.
Otherwise, I'll really never get done.
And the sad reality, though, is like as we get older in life, well life well i mean we will get older i'm 21 at the moment so i don't
have to worry about this yet but you know eventually like you know the memory cells aren't
what they used to be so you're like you want to like not necessarily like automate it but it's
automated as a form of documentation so you could just ensure that like you did you don't have any
typos anymore in your
execution of that thing.
So yeah, I got to get this written down
because I don't got much thinking left.
The site room loves
the engineering stuff.
Is that a boom hour?
Yeah, boom hour.
There you go.
You know that show's coming back.
Oh, for real?
Yeah, it's coming back.
That's awesome.
Anyway, so SRE teams must be focused on engineering.
Traditional ops groups scale linearly by server size.
The bigger the service, the more people you hire.
By contrast, SRE teams... ops groups scale linearly by server size the bigger the service the more people you hire by contrast sre teams
don't more efficient way to cap yeah i had to cut that off there
hey this is too late for me y'all it's late. You know what's so good is I think that was actually a great presentation tact that you did right there.
That was – leave him hanging for a minute.
Make sure he's still awake.
I'm sorry.
I'm so sorry. So in order to encourage the kinds of behavior that Google wants to see, they put a 50% utilization cap on SREs doing traditional ops works.
They say they do not want you doing more than 50% of your job doing things like backups, deploys, monitoring, learning production support type stuff.
They want to cap it at 50%.
This ensures that the
SRE team has the time to automate and stabilize
the software through automation.
Because if they don't give you time to automate,
you're just going to be fighting fires
all day. This was my favorite
part of this chapter
far and away, was the fact
that they just automatically time box it to say like,
you can only work on this,
but so much of the time we need you to like actually focus on making things
better.
Isn't that cool?
I mean like imagine they say,
all right,
Monday,
Tuesday you are on ops calls,
right?
Um,
Wednesday through Friday we want you to automate the things that you dealt with Monday and
Tuesday.
And then basically what they say is, hey, after a little bit of time, you're no longer
spending any time on those ops calls because you automated all that stuff.
Yeah, this requires strong management when you think about it, though.
What you're kind of saying is like, hey, you know what?
We've already spent 50% of our week on deploys this week so we're not doing this deploy
for you on thursday or and we're not going to check on your backups for you on friday we're
not going to look in your logs on where we're going to work on automating and that only works
if your management is willing to say you know what they're right they spent the time they were
supposed to and you've got to leave them alone to automate this or else we'll never catch up well my guess is they'd probably have a rotation right
like yeah you know there'd be a few guys doing it or gals doing it monday tuesday and then a
couple doing it wednesday thursday whatever kind of kind of oh you read an asterisk on that yeah
we'll get it we'll get there because later in the book there's a whole section on like error budget. And one of
the things is that like you could decide that like, hey, I can only afford to have, we'll get
in this more detail when we get to that chapter, but you might decide like, hey, I can only afford
15 minutes of downtime, you know, for whatever given time interval you have, right? However you spend that 15 minutes, you know, if you spend it all, then,
you know, you might not be able to do any more new releases because those new releases could
potentially add more downtime. And so, therefore, like once your downtime budget is used up,
it's used up, period. And they actually made a point of calling out that it might not be you or any member of your team that's responsible for it.
It might not even be anyone in charge of the physical machine.
It could be something like the power distribution to the rack that died, but it spent your entire error budget.
And so now for the quarter,
you might not get to do new deployment, right?
Interesting.
Yeah, it sounds totally crazy to me,
but it's interesting.
And so I'm keeping my mind open.
Well, so that's why his point about the strong management
is so relevant though,
because it really does require that everybody agree.
Everybody buys in to this and they agree to this right in there.
So,
yep,
we,
we've spent our budget and we have to wait until whatever our time interval
is before we can do it again.
Yep.
Next quote here too, is they said they want systems to
be automatic not just automated the only thing i mentioned here is that sres tend to build up like
a playbook of basically you know common things scripts um tasks you know things that need to
happen on a regular basis and that's a good stepping stone to fully automating that stuff
and but the ultimate goal is to have that stuff be built into the system
itself.
So whatever,
you know,
human decisions are being made there,
get kind of built in.
Yeah.
I mean,
I love this because think about like how many times do you have an alert
get triggered because some,
some condition happened and then you have a playbook of like,
Hey,
if there's ever this alert,
here's how you go and fix it.
And their point was,
well then just automate the fix right
yeah exactly that shouldn't be an alert anymore um so one of the key call outs here too is hey
the only way to make sure that that they are spending 50 on on development instead of you
know pure ops work is you have to measure it. Right. And that again goes back to good management and making sure that that's
actually happening.
And this is my least part of the book.
I don't want,
I don't want to be checking time.
Yeah.
Yeah.
Same,
same.
Um,
but they also called out that they,
Google has looked at this and SRE teams are cheaper than ops teams because
the SRE,
the SRE teams know the product well and they find out ways to prevent the problems that come up.
And so you don't have people doing the same thing over and over and over, over months and years.
And there are a couple of challenges with SREs that they brought up here.
So I mentioned hiring being hard. First of all, the kinds of people you're trying to hire, the kinds of people that everyone
wants to hire, the people who know how to do stuff, whether it's product work or even
networking stuff, it's kind of the middle position.
So it's competitive even with your own org, let alone like other work.
And at the time, this was a new title.
So how do you tell somebody, hey, I'm trying to hire you to be a flibbity flobbity flibbity.
They're like, wait, what is that job?
You're like, site, site, bid, and shibbity shibbity.
They're like, what?
Yeah.
And the book doesn't mention it, but also pretty sure SREs all wear pagers.
So that to me is a big downside.
So it's like, you're telling me you're going to pay me this or pay me that.
Both the same.
Great.
One has a pager. It's like, you're telling me you're going to pay me this or pay me that. Both the same. Great.
One has a pager.
One doesn't.
Hmm.
I didn't see.
Now, come on now.
They didn't say anything about pagers.
No, they didn't. That's because they're biased.
That's the dirty secret truth.
I mean, they honestly don't, but they don't talk about any kind of pager duty or anything,
at least not in the portions that I've read so far.
They talk about on-call schedules and how many defects you should be looking into per on-call rotation.
Yeah.
So, yeah.
On-call usually implies that.
And then, like he has here here it requires dev skills as well as
system engineering right like that was that thing that outlaw was talking about you know
good software engineer with with decent um system skills or vice versa you need that mix
and i didn't mention uh here either but um also i think good communication is of course important
for everyone but especially good for sre's because uh because the postmortems, I think it's really tricky to kind of get those right.
And we'll be talking about postmortems later as well.
Also, one last thing we already touched on, this is requiring strong management in order to kind of protect those boundaries and being able to kind of have the back of the team in order to say no.
So, all right.
So this is just another DevOps title, right?
Is that?
Yeah.
Yeah, I mean, the book goes there. They say that DevOps is kind of a generalization of several core SRE principles
applied to kind of a wider range of orgs and stuff. But basically they say that an SRE
is a specific implementation of DevOps
with some idiosyncratic extensions.
And they are saying that SRE is a role
and that SRE is DevOps.
It's part of DevOps.
So in a way they are making the argument that
DevOps is a role.
Is a role.
Is a culture. this reminds me of
bonnie python right yeah what else floats ducks does she weigh as much as a duck yes she's a
witch and it's something we struggle with all the time because we do know you know the reason why
we fight about this is we say that it's very important that the dev ops people be like intimately aware of the product
and how it works and how it needs to work and how you know the what the things that are important
to they need to know the product and that's why it's important that your people working on the
product know how to run their own stuff and the sre i think accomplishes that by having that 50
kind of budget where they work on the product we're going to automate the work and so i think
that's how they they uh reconcile that at least in my mind.
Well,
I was thinking of it as like,
uh,
if you,
if you adopt DevOps as your culture,
then SRE is the position that comes out of it.
I like it.
Yeah.
That's,
that sounds about right.
So we agree.
DevOps is a culture.
I win!
With positions.
With DevOps, everybody wins.
Yes, I believe so.
So here are some tenets of SRE.
We're going to just kind of blast through the list here,
and then we're going to focus on each one a little bit.
Roughly, roughly.
So, availability.
And you got this first one.
Latency.
Performance.
Efficiency.
Change management.
Ooh.
Yeah, I beat to that one.
Monitoring.
Emergency response.
This one is definitely Joe's.
I think the vowels on my keyboard aren't working very well.
Or in my mouth, I guess.
Emergency.
And finally, a capacity plan.
What's the first letter of that word? I don't know what's going on
so first let's talk a little bit about
availability which they refer to
here as basically a durable focus on
engineering so in order to keep the time
for project work we said
SRE should receive a
maximum of two
events per 8 to 12
hour on call shift that's a very specific and small number
i'd be good with that number yeah it's pretty funny they say if you have less than two
did you got too many sres or you're not doing enough publishing right you're not taking enough
chances you're not you're doing it's kind of like gold plating right it's kind of funny but two is still pretty small number i mean i i've i've had days where i would have loved to have it for it
to only have been two yeah that's yeah and there's days when you know one one outage basically take
more than one day too so sure they don't really say how big it is but uh the idea is just that
the low volume allows the engineer to to really in there, spend the adequate amount of time in order to fix the problem, and then write up a good post-mortem about what happened.
That's the really time-consuming part, though.
That post-mortem can be consuming.
Yeah, if you do it well with timestamps and everything
and exactly what happened, how you found the problem,
that's not fun.
That's not fun.
So sometimes I think I want to be an SRE
and then I think about postmortems and pagers
and I think maybe no.
Yeah, but I don't know.
I think I still like it.
I like production support stuff
so all right so uh we'll just give all the production support stuff to
to joe and let him be the on-call person yeah but i uh well i only i only do like
half the work as anyone else anyway so
a big part of those poor swimsuit uh is that they have to be blame free which is something
we've talked about uh this on the show before which is just the idea that you're not attaching
really people to it it's just more about the processes and where things went wrong and you're
not looking to try and blame someone which is very easy to do another another concept from the
devops handbook yep was the blame free yeah and they also say that the postmortem should be written for all significant
incidents when paged or not. So even if you saw
an issue and it didn't alert anything, you should
still write the same postmortem for it.
Yep, which is pretty disciplined.
Yeah. I guess where I struggle with some of that too,
and this is where the limitation of the two events could come from
because I could see spending a pretty good amount of time
just on some of the write-ups for those things,
depending on the level of event, obviously,
but in the severity of it.
And so I kind of wonder like, you know,
if they have any kind of guidance, maybe they'll get to it later in the book,
but if they have any kind of guidance on that postmortem,
like you shouldn't be spending more than like 30 minutes max writing that
thing. If you can't write it in that amount of time, then it's,
you're either putting too much detail into it or it's like a much,
much larger problem,
you know?
Yeah.
And I will say too,
so they mentioned eight to 12 hour on call shifts and doing these post
wardens.
Are you talking about hours of work?
These on call shifts are happening during your kind of work hours.
Basically,
this isn't your on-call from
you know saturday night whatever type stuff that where you're writing these post-mortems and doing
you know these hours of work so you know i i've been joking about the pager thing but i think a
big part of it is that these are kind of your normal work hours you're just the person with
the bat phone you know that needs to kind of take point on it.
Max change velocity.
So this section is referred to latency.
And normally when we talk about latency, you talk about like the amount of time,
idle in a browser or something,
waiting for input and output.
But in this case,
we're actually talking about limiting the amount of change.
And this is where we talk about things like an error budget which is an interesting way to kind of balance innovation
and reliability because uh if you are pushing out new features breaking things doing things that
require maybe scheduled maintenance right and let's face it like most changes can be done even
to databases a lot of times these days without any sort of downtime.
But you might choose to take the database down because it's a lot faster and easier than trying to migrate in such a way that where you like, I don't know, bring up a replica, spin it up, sync it over while you make changes.
You know, it's just expensive and time consuming a couple of factors we're going to talk about in a second, how much tolerance your users have for disruption.
Yeah, this goes back to where that 100% uptime is generally considered not worth it.
And it gets more expensive as you get closer to that 100% mark. And your customers might not even realize it.
And so therefore, trying to get to that 100% is just wasteful.
Yeah.
And what is the right number?
It's a business decision.
It depends on a lot of factors like what the users expect, how important your –
I shouldn't say important, how critical your service is.
Like, do they have a workaround?
How well does the experience degrade if part of it is working but another isn't?
I was terrible with the typos today.
And have you ever had a manager push back when you're talking about uh tenant uh technical
day you're talking about what you need to do and you know it's important that we refactored this
and that because it's taking too long to uh to actually make changes here and the manager pushes
back and say it's fine for now we've got things that need to go out the door because it goes
we're waiting and the business is going to die if you don't have this done by friday night um
i thought one thing that's kind of nice about just having the measurements here to
kind of show what your actual
disruption is and your
disruption budget. It's kind of a nice way to
say like, hey, this is what
we're losing by pushing stuff out
faster or this is what we're losing
by not pushing things out
faster. The problem is though
like when you're a large enterprise
company, it's easy to have it's easier to have these types of tenants in your company and for management to buy in.
But when you're a small business and maybe there's 20 developers total in the company, it's a much that, for the management to buy into some of these things.
Yeah. Take that especially hard to man or to measure because you're talking about measuring something that is an idea to a certain degree.
Right. Unless unless you take measurements early, like it's much easier to do at the beginning if you do this.
But if you measure, hey, how long did it take to get a feature out and deploy?
You know, when we first started this versus, you know, a year in now, these releases are taking two weeks longer.
You know, if you measure it all the way through, then you can do that. But when you're just talking
about things that people have short-term memory loss on, it's hard to throw those metrics at a
manager and be like, look, if we don't spend time on this tech debt, this is just going to get worse
and worse. And that's just an idea. They're like, work harder, you know, and that's as hard to fight.
Yeah. There was like going back to like, what's the amount of uptime, you know, and that being a business decision.
Later in the book, they talk about an example where, you know, even for the same product, the same service, you might decide to have two versions of that thing that run at different levels of uptime.
And, you know, you charge differently for them, right?
Because maybe they're actually targeting different customers.
Like maybe different customers want to do it as for batch type jobs to where, you know, they don't need it to be super reliable, you know, 24-7.
But when they do, they want to slam it and they want like fast as possible versus another customer who might want, you know, five nines of reliability 24-7 because of whatever their need might be.
And so that ultra high reliability use case, you know, we're talking about the same software, but we're going to configure it in two different ways for different purposes. Right. So that's where that,
that business decision comes into play that,
you know,
we,
as the software developers won't,
won't care.
We won't decide that.
It doesn't S3 have two levels of durability.
Do you like,
I think it's like six nines or 11 nines or something.
It doesn't matter.
But yeah, good call.
So fun question.
What could a team do if there's no more room in the budget for for for downtime?
Yep.
As far as like they don't get to do anything, you mean in terms of like new deployments, right?
Or did you mean like what can they do to solve that for next time i just been like what do you do is basically nothing like you
you do you think you don't do anything risky right you calm it down just you know a hard
pill to swallow sometimes and that was google's answer by the way was that well then you know
they they talked about it on like quarterly budgets and they're like well you know you uh unfortunately that um
that networking switch or that uh power distribution block whatever the case might be
you know it just happened to die in that rack and so it spent the entire budget all right well you
don't get to do any more deployments for the quarter you know yep that's that seems so unlikely
to me but i mean that's the premise of i'm trying
to keep it on with mine just imagine like s3 someone spills a uh champagne on january 1st
a little too much drink in the server room and they knock a rack out and uh you know there's
an outage for one day and that's the entire budget for the quarters seems crazy to think that you
know they would accept that but just that's the premise we're the quarters. Seems crazy to think that, you know, they would accept that.
But that's the premise we're going after.
So I guess, you know, in an extreme case, they would maybe make an exception.
But that is what we're kind of the assumption we're operating under.
Here's a more fun question.
What do you do if you're at the end of the quarter and you still got budget?
You just release everything. I was going got budget, you just release everything.
You just turn it off.
Just take a break.
No,
be back on Monday.
That's awesome.
Yeah.
I don't think it works like that.
I think,
I mean,
the real question is,
yeah,
I mentioned like database migrations and stuff like there's lots of times
where like a rolling upgrade is the right answer. And then you apply a migration. And then once everybody's lots of times where like a rolling upgrade is the right answer and then you apply a migration and then once everybody's transitioned you do
like another rolling upgrade and then you do whatever or you just take the site down for
three hours and have it all done and everyone there is one room and you know you could do
something in three hours that would have taken three weeks otherwise so i think you know that's
the kind of stuff you can do but you gotta got to be careful with that budget. It's good. Yeah. But even in that scenario, though, it's going to depend on what the services that you're
doing, because like if the thing that you're trying to do is like Google dot com, then
you're probably not going to want to take any like you're not going to want to like
purposely introduce that downtime.
But some of these examples are like hard to discuss too in the as as it
relates specific to google because we're talking about like such a massively uh distributed
worldwide system and they even call this out in like uh part a part of the book where they're
talking about like you know like i i've twice at least twice now given an example about like
losing a power distribution or or network switch within the rack, right?
But even in Google, the way Google is set up, everything is distributed across so many systems and across so many different regions and across so many different areas around the globe that even if they were to lose an entire continent, there's a possibility that some portion of that service might be serviced by
another country or you know or something like that right right so here here's a question i
didn't ask what should have uh as a like a i don't know or owner of a company or a director of a
organization how do you enforce your budget you tie it to bonuses that's how you tie it to raises
you tie it to whatever you say is your uh air budget and if you stay within this budget you tie it to bonuses that's how you tie it to raises you tie it to whatever you say is your error budget and if you stay within this budget you get this much more bonus that's a way
to kind of encourage everyone to take it very seriously yeah i could see that working all right
so next we're on to monitoring and this one's pretty important so monitoring is how you track
the system's health,
right? Like that's, everybody's probably pretty familiar with that. Well, the classic approach to this, and this is probably what most companies out there do, because this is how people have
operated forever is when there's a problem, an alert gets sent out. And, and that happens when
there's like some sort of event that happens in the process or some threshold was crossed, whatever, but you know, the, the gist, right? Some alert was sent out and, and typically when
that alert gets sent out, then somebody goes and handles it somehow, right? Like going to go look
at production or whatever the case may be. Well, they say that this is a flawed approach because
anything that requires human intervention is by its very definition,
not automated. So they're saying that software should be interpreting whatever is happening
and people should only be involved if software can't take care of the problem. And honestly,
this, this makes a lot of sense, but this isn't how most people have
thought about it over time. Right. Um, and so they say that there are three types of valid
monitoring that you have. You have an alert and that means that a person needs to take immediate
action, right? That they have to get involved. There's a ticket. So this is when the system can't
automatically handle whatever happens, but it doesn't need to be looked at immediately,
right? Create a ticket. Somebody can get it done within the next couple of days.
And then logging. Logging, nobody needs to do anything. Probably never even look at them unless
there's some sort of event that says hey you need
to go look at these logs so that's what they call out and they really want to minimize the amount of
human interaction that happens in the system at all and then the next piece that we got here is
the emergency or emergency response.
It's actually the same thing, just pronounced differently.
It is the same thing, that's right, emergency.
So I have a quote in here because you just couldn't have said it any better.
Reliability is a function of mean time to failure,
which you've probably seen as MTTF before,
and mean time to repair, which you've probably seen as MTTF before and mean time to repair. So MTTR.
So the best metric for determining the effectiveness of the emergency response is the mean time to repair how quickly you got things back into a healthy state. Um, people add latency.
This is really good. This is why they want systems to handle everything because people are
slower um yeah you're away at dinner when you get the alert or you're driving your car so you you
can't possibly you know get to a computer within the next 30 minutes to even look at it like those
are those are that's latency that the people are adding yeah and and usually there's even
communication on top of
that, right? Like we we've talked about in the past and alert goes out. All right, well now I
need to coordinate with this person, that person, whatever. Um, but what they're saying here, and,
and this is pretty cool is if you can avoid having a person be involved at all, even if there was a
problem and it required, and it takes a little bit of downtime, if the
system can handle it, ultimately it'll probably be more available than it was if a person had
to touch it in the first place. And that's pretty cool. Now here's another part, and this is pretty
good. Thinking through problems before they happen and creating the playbooks they said
resulted in a three times improvement in the meantime to um reliable or repair as opposed
to winging it so like what that margaret hamilton did right where she had written that thing up
the fact that she had it there meant that the people could
go look at the instructions on how to get the thing back into a healthy state instead of people
going, oh, I think you do this. And maybe if you do that, it just goes a lot smoother.
And then they also said that their on-call SREs, they always have playbooks when they're doing
things. And then they also go through these exercises they call wheel of misfortune
that allows them to prepare for these events.
So my guess is they probably simulate some sort of failure and say,
all right,
go fix it.
Right.
And,
and that's probably what happens.
I really liked the idea of that,
but the wheel of misfortune.
But then I was also thinking like, okay, wait a minute.
Is this just like part of the hiring process?
You give a new candidate the wheel of misfortune, see what they do to fix the environment?
Or do you like really play this like, hey, on Fridays, we're going to play wheel of misfortune.
The chaos monkey, but you're part of it.
Yeah.
So change management.
Interesting stat here.
70% of outages are due to changes in a live system, which I guess makes sense.
You wouldn't expect outages to happen in dead systems, but it was still kind of surprising to me to see that.
A couple of best practices we got here.
Progressive rollouts.
So we've talked about canary deployments before, just rolling upgrades where nodes or instances
or whatever kind of go out one at a time.
And if things are not looking good,
you can stop at a rollback to like a blue-green environment.
There's all sorts of different ways of doing this.
But that's the gist of it.
In order to do that, you need to be able to actually detect
problems accurately and quickly and then actually be able to do that roll you need to be able to actually detect problems accurately and quickly and then
actually be able to do that rollback if possible.
So that's the kind of stuff that SRE is going to be
building into systems because that stuff is not
trivial, especially once you start talking about data migrations.
And the idea here
is just to remove people from the loop.
So automate, automate, automate.
Another fun one.
This is one that i especially enjoy demand forecasting
man forecasting and capacity planning so the idea here is that forecasting helps you ensure
service availability and keeping your costs uh you know in check and kind of within budget
and the idea here is to account for both organic which is like your normal usage patterns
but also to try and account for inorganic uh growth which is things like uh major launches
or marketing events or maybe uh you know uh some celebrity tweets about your product or something
getting some sort of unnatural spike and um that's you know really hard to do but you can
kind of imagine what that would look like
with like 10x or 100x growth on um any sorts of numbers that you come up with uh three steps here
so i like this one uh you need to have accurate organic forecast and the important um bit here
is that you need to have your forecast extend beyond the lead time for adding capacity
so if it takes you three months to order a new server then you better be forecasting out more
than three months if it takes you an hour to add a new node or to i don't know add a new
load balancer or whatever it is then you need to forecast at least that much so i thought it was
kind of interesting way of saying like figure out what your lead time is for capacity,
whether it's disk, instances, nodes, like all that stuff, because you may need it.
Also, you need to try and incorporate inorganic demand.
They didn't go into this.
Maybe they do in a later portion, but I imagine this is just kind of like trying to say like, well, this is what a spike would look like if you try to imagine what would happen if you sign on a big tenant or a big client or Taylor Swift tweets out your product or something.
Imagine there's some just multiplication of numbers there.
And then the final piece here of the Triforce, regular load testing.
This is not something you do just once when you launch
a product. This is something that needs to be ongoing because
you're making changes to the system all the
time, so you need to have that be a part
of it. That stuff
takes time.
But also
consistent with the DevOps handbook.
Yeah, totally.
Totally.
At least it's like the regular load
testing. Yeah, totally. Totally. At least in terms of like the regular load testing.
Yeah, absolutely.
This is very similar.
So provisioning, and just kind of like we said earlier,
the faster your provisioning is, the later you can do it.
So if something takes three months,
you need to figure out three months, you know,
ahead of time in order to order it.
If something only takes five seconds to provision, like maybe adding more RAM or something or
adding another node, then you can do that like five seconds before you need it.
And what that means is you can be much more efficient with it, right?
So if you're ordering servers three months ahead of time, there's a good chance you're
going to get that server before you need it.
It's going to sit around idle.
It's going to take a while to get plugged in you may not even need it at all just less
efficient overall to bite things off in bigger chunks like that so the another way of saying
this is the later you can do it the less expensive it's going to be and they also mentioned that not
all scaling is created equally so like adding a new instance to
a stateless workload like trivial right you can set up autoscaler and you don't even have to think
about it anymore uh adding another uh partition to a kafka topic like oh that's uh you know there's
a lot of implications there that have effects on producers and consumers and all sorts of stuff.
Maybe even replication.
And it's going to take a little while to roll out.
It's going to be a process.
And so it's just about kind of figuring out everything you need to care about and what it takes to maintain it.
Looking at the next section.
So the last one here is efficiency and performance so
basically the sres are in charge of provisioning and usage so they're close to the money they're
responsible for it people are going to ask them how much is this going to cost how much do we
spend how much can you save us next quarter and so it's important that you know how to maximize
your resources which uh fundamentally affects the which fundamentally affects the success of a project, which is pretty cool when you think about it.
So you were joking earlier when you said that we could tie it to the bonuses, but maybe not so much.
No, I really you to save 9% operational costs.
And if you do, you get 100% of your bonus.
If not, it's going to start scaling down or whatever.
And it sounds a little cold to say that, but I think the idea is that by associating it with a
financial incentive, you're really sending a strong message to the whole organization
that this is important.
And people are going to really get upset if your organization or your developers aren't
taking these goals seriously.
So people are going to come after you.
And it just keeps the org on track.
Well, I mean, it might sound cold,
but at least it puts it into,
uh,
something that's within your control,
you know,
totally.
Like how,
how many jobs have you ever had where like,
uh,
you know,
anything about bonuses or whatever,
or,
you know,
uh,
pay raises or anything like that.
We're like completely outside of your control.
It's like,
well,
it depends on how well sales did,
but I'm not in sales. Yeah. But sales drives everything and so you know you clearly if you did a good job making the product and they would have no trouble at all
selling it right and so therefore you know that's why your bonus is tied to how well they do
yep really got your bonuses tied to uh net profits
but we're having a good year this year so we're going to do a stock buyback we're going to slow
down on infrastructure or do whatever we can in order to make sure that profit doesn't show up on
the books yeah totally uh so uh you know kind of a cool little balance here they mentioned is that systems get slower
as loaded added it's never going to speed up as you add load to a system right uh and slowed can
also be viewed as a loss of capacity so your system starts blank slate zero users is full
capacity every user every you know system that you bring, all your traffic is reducing that capacity. So you're trading off how much money you spend to set that stuff up and have it available and the speed at which your system runs.
So it's just kind of a cool way to think about systems basically being a balance between capacity and usage and what that means to your cost and how much things cost to run.
So you're excited to be an SRE, right?
Like this is next in the career path?
I think I would be happy.
I tend to kind of like production support stuff too.
If you want me to focus on a problem, just tell me that there's something wrong with it.
I want to know.
So I think I would want, I would, I would want to
be a SRE. I actually liked the, um, I don't know the feeling of accomplishment that comes with
automating a task, right? Like there's, there's almost, I don't want to say instant gratification
here, but, but fairly quick gratification. if you were to automate things that people had to
be involved in previously. And seeing that happen is rewarding, right? Like that's part of what I
like about doing software in general. So yeah, I think that that would be kind of interesting,
right? Like this self-healing, self-reliable system, that's pretty cool cool i think it'd be cool to have goals like uh can i make builds uh
fail 10 less often or can i save uh you know 10 time or you know whatever like increase up time
that all sounds like cool stuff to like come up with and go after to me yeah yeah i was just
gonna say like you know automating this stuff is also fun too, right? But yeah, so Alan said it better though. So did it.
So, uh, yeah. So we'll have a lot of links in this episode. Uh, you know,
namely we'll have links to the book itself, which you can find for free, uh,
SRE dot Google slash books. Um,
but we'll have that and other links in the resources we like section of that,
of the episode.
And with that,
we head into Alan's favorite portion of the show.
It's the tip of the week.
All right.
So this one's kind of apropos for this particular episode because we're
talking about Google and,
and their SRE stuff.
Well,
I had something, an interesting thing come up the other day
that had to do with caching and evictions at certain timeouts,
like seeing if something had come in before.
And if it had, not doing it again, right?
Like trying to de-dupe stuff in sort of a smart way.
And so in my mind, I was thinking, okay, well, if I had some sort of
cache or some sort of hash table, and if I could evict those members that were put in 15 minutes
ago, right? Like anything that's older than 15 minutes, kick it out because I don't care about
it anymore. Then that would be interesting. It'd be a nice way to handle this particular
caching type thing that I wanted to do. And so I got thinking about, I was like, man,
there's got to be something out there. Right. And I, and I'm working in Java
or more specifically, I'm working in Kotlin, but I can use some Java, um, to make this happen.
And Google, I've seen this library come up in a number of projects that I've looked at. I want to say maybe even Kafka stuff, Flink possibly.
I don't even know.
But Guava.
So Google has a library called Guava.
I've put a link directly to the wiki.
I didn't put a link straight to the project because there's not a bunch of information about what it offers in here, but this is a whole set of utilities, collections,
graph capabilities, like all kinds of stuff that can help you in your regular Java application.
So the one that I'm talking about that would have solved the problem that I was just mentioning
is they have a caching thing. And in this cache, they
have the ability to populate and evict from the cache on an automated type basis. And that's
fantastic. They have immutable collections. So like if, you know, typically when you're looking
in Java, you have, you know, your hash map or your map or whatever, and those are mutable. Well,
they have immutable collections. They have these graph libraries,
they have all kinds of things. So it's worth looking at this library because it solved a lot
of problems that, that Google uses in their own distributed systems to solve a lot of issues that
you may encounter. So I'd say, you know, I say it all the time with folks I work with and in general, maybe even on the podcast is I'm not opposed to writing something myself.
Right. And I'm not opposed to other people writing something themselves.
But a lot of times, if there is something out there that is already battle tested and it does what you need it to do, it's probably well-tested and been proven,
so maybe it's worth looking at that.
So definitely check this out if you are in the Java world
because it may help you out in a number of different ways.
Well, I got a tip from my go-out-law today.
Oh, bring it.
Have you ever heard of uh get cherry pick well let's you do is uh pick a commit
uh over from one committed branch into another and this is great if you have like multiple
current releases or sometimes if you just goof something up or you need to bring something over
from uh from another branch into your branch it's. You just do get cherry pick and pass the commit hash.
You can look up from any variety of ways and you'll get it.
Well, today I did something on accident.
I accidentally pasted the branch name that I was going to cherry pick from
instead of the commit.
I copied the wrong thing somehow.
And it worked in the way that I expected.
It grabbed the commit that wasn't in the branch.
I was cherry picking it too,
uh,
which is very surprising to me.
And so I went and I looked at what all you could pass the cherry pick and
surprisingly the docs,
uh,
not great.
Like it doesn't actually mention that you can,
uh,
pass a branch name.
It's the very first example they give.
What's the current branch head points together oh okay yeah so that was going to be my question jay-z is if you'd had
multiple commits in your other branch would it have grabbed those multiple commits yeah so uh
so i did a little bit of reading on that and it would have grabbed uh
the top commit the latest yeah the latest so just the one but you can cherry pick multiple commits
at one time which i did not know you can actually pass a range you can say like this commit hash
dot dot it's just two dots to that one and it's
going to grab a range which is something
I didn't know about many times I've gone and cherry
picked a bunch in a row and so that was
kind of nice to know there's actually a bunch of other
different flags that are pretty cool too
like there's even one for no
commit which is pretty interesting it applies
the changes from the cherry pick
blah blah blah without actually making
the commit. So it gives you a chance to kind of
slice it up and do it a little bit differently.
But I just thought it was pretty cool to see that there were other things there.
There's also a sign off flag in case you want to kind of
update a commit message.
Yeah, the takeaway though is that
if you're going to use that get cherry pick and then the branch name pattern, you have to be careful because you have to know that you only want the one single commit that is the tip of that other branch.
Because if you only specify the branch name and nothing else, that's all you're going to get is that single tip commit.
And I'm fairly certain that this would not work if that commit well no that branch
couldn't be emerged so well no i guess what i'm trying to say is if the tip of that was if that
the tip of that branch was a merge commit i still think you would have to apply the dash M to specify the main line.
I believe I,
the point is,
is like you might,
your mileage may vary and you may run into trouble if that branch,
if the top is the tip of that branch is a merge commit.
You can also do a dot dot in your branch name.
It'll get all of them.
But so I don't think you should do any of these though.
These are all
terrible it's all really confusing like you really have to know this command and all its various
flags in order to do this correctly so this is a terrible tip and you should just not do this if
you can avoid it i wasn't trying to go there at all please don't take that away from it no no no
i think i seriously think that you should not be it's such like a weird behavior that you're
relying on so i don't think anyone should actually use this.
This is an anti-tip.
Don't do this.
The real tip is you should read the documentation for commands.
Even the ones that you've run a million times before.
And just see.
Because sometimes there's stuff in there that might help you out.
That you've kind of overlooked.
And there's good stuff in there.
Not in this case though.
These are all terrible.
But sometimes there are.
You said RTFM is your tip.
Yeah.
Okay.
Just sometimes.
They're all swell.
You were shot.
Yeah.
Okay.
Okay.
RTFM.
I'll have to make a note of that.
What does that mean?
Yeah.
I can't say it.
Yeah.
I can't.
Okay.
Well, so for my tip of the week, then I had this one because like the three of us, we do a lot of stuff just using the keyboard, right?
And so like I know like iTerm, for example, we're users of iTerm, right?
And right?
Yes.
Wait, whoa, right? I love it. Yeah, definitely. So,
so like, you know, you can create multiple tabs in it. Like I have this habit of like,
I'll create tabs for different things. And, and it's in like, I have the habit like on a Mac of
going like, uh, well, I guess iTerm would only be on the Mac, but like, uh, command, you know,
one or two or whatever the number of the tab is that I want to navigate to.
And I was going back to Chrome where I have like 18 billion tabs open in Chrome.
And also Chrome introduced a feature a few releases back where you can group your tabs.
I don't know if you've been doing that.
Maybe that should have been my tip of the week.
But yeah, you can like...
Was it? Okay.
Because you could like right click on a tab and create a group and you can give it a number so you can have like different things
that you're working on in different groups of tabs together
that are all color coded together and everything.
And it just makes like if you are like me
and you have to it just makes like, if you, if you are like me and you have to
context switch a bunch, you know, sometimes it's easier to have just a big group of tabs already
together that, you know, or whatever that context switching is. But I also tend to have like some
tabs pinned. And when you pin a tab in Chrome, they automatically go to the front, right?
So like for my Gmail, right?
I'll typically have Gmail as one of my first tabs.
I'll have Slack as another tab.
I'll have my calendar as another tab, things like that, right?
And with all these 18 billion tabs open, Sometimes it's handy. Like I might be on, you know, tab 123 and, but yet I
want to quickly go back and check my email because I'll see like the notification or I'll see the
Slack. I'll see that one of those two things has a thing. And I'm like, it got, you know,
you can just like control your way through like control function, think uh and arrow keys on the mac keyboard to
like scroll one direction or the other through all your 18 billion tabs that are open until you
get to that one but i found out you could also just on the mac press control and the number and
boom navigate to that tab now that's great for your first 10 or nine tabs at least.
But in my case, that was first eight. Yeah. Sorry. That was good enough because, uh, you know,
like I said, like Gmail and Slack were like the two big ones and those are, those are pinned.
So they're always in that one and two position. Right? So at any rate, I'm going a long way around saying like control plus a number,
and then you can navigate to that tab.
But so I'll also have a link there to just the Chrome shortcuts in general
that apply to both Windows, Mac, and Linux.
But hold up, hold up.
You said you have Slack open in a tab.
Why do you not have the app installed sir because i'm in chrome more than i am anything else so why would i
because then you just then you can command tab to get to your to get to your slack i mean you
no but then but then actually that's more annoying so So, uh, I, I will advocate for this. We will fight. So, um,
no, because now, like if, if you're in, especially like if you're working on your laptop and like,
when I work on my laptop, I tend to like take things into full screen mode on, on the Mac.
So like if I have Chrome in full screen mode, that's all I see. And,
um, but you can see all the tabs, you know, the headers for the tabs. And if there was a
notification in Slack or if I had emails that came in, like I can see that notification while I'm in
some other, you know, like while I'm reviewing a pull request or I'm, you know, reading a build log or whatever,
you know, I can see that and be like, oh, what is that? Control one or control two. And I can
automatically go and check that thing, you know, as time allows for it. Right. Whereas if it was
in another window, well, then until I happened to go click to that window, I might not even notice
it. Yeah, that might not even notice it.
Yeah, that's full screen and Mac.
I almost never do full screen and Mac for that reason.
It annoys me.
So that would be the difference.
Even when you're just working solely on the laptop, you don't do full screen in the Mac?
Nah, I hate full screen on Mac.
Okay, we found our survey.
Okay, so forget that other survey I asked.
What?
That is insanity.
I straight up hate it for that very reason that it hides so many things from me.
And there's one other reason.
There's one other reason.
And this one's fully fair is because I use the Kinesis Advantage and trying to freaking contort your fingers to do something to switch between screens is like I mentally have to jump through hoops to do that.
But you wouldn't be using the Kinesis advantage when you're on your laptop.
And I'm saying when you're on the laptop,
you said you don't have any other peripherals connected to it or else you're
not on the laptop.
So,
so,
uh,
I mean,
there's two things.
One is you reclaim some real estate,
which is valuable if all the monitor you have is the laptop.
But number two reason is actually exactly opposed to your number one reason.
I want, I go full screen because I want the focus of whatever that app is.
I don't want the distractions of the other thing.
Yeah.
Right?
I can't have that.
That's the downside of having,
of seeing that there is like an email
or a Slack notification
when I'm in Chrome full screen
because then I'm like,
oh, well, all right.
Control one.
What was it?
Okay.
Nothing big.
Go back to my other window.
And I can't like control 18 billion so i gotta
like fine i'll use the mouse to click on that tab it's like dialing an old school phone number for
you control one two four five right right i need a rotary dial to get through all my chrome tabs
that's right all right well uh we'll argue about why Alan is wrong later.
But in the meantime, you can subscribe to us on iTunes, Spotify, Stitcher, more using your favorite podcast app.
And if you haven't already left us a review, you can find some helpful links at www.codingbox.net slash review, where you can also let Alan know why he's wrong.
I'm not wrong. Nobody likes full screen on Mac.
So while you're
up there at codingblocks.net, check out our
show notes, examples, discussions, and more.
And leave a comment on this episode at
codingblocks.net slash
episode 181.
And send your feedback, questions, and rants
to the Slack channel at coding blocks.net slash slack.
And Hey,
make sure to follow us on the bird site at coding blocks or head over to
coding blocks.net and you find all the deals at the top of the page.