Coding Blocks - Site Reliability Engineering – Embracing Risk
Episode Date: April 11, 2022We learn how to embrace risk as we continue our learning about Site Reliability Engineering while Johnny Underwood talked too much, Joe shares a (scary) journey through his mind, and Michael, Reader o...f Names, ends the show on a dark note.
Transcript
Discussion (0)
You're listening to Coding Blocks, episode 182.
Subscribe to us on iTunes, Spotify, Stitcher, wherever you like to find your podcasts.
I certainly hope we're there by now.
And hey, if you can, leave us a review.
Yep.
Visit us at CodingBlocks.net where you can find our show notes, examples, discussions,
and more.
And send your feedback, questions, and rants to comments at CodingBlocks.net.
And see what tweets
we got for you we got lots of tweets over at
codingblocks and if you go to codingblocks.net
you can find all our other dillies at the top of the page
with that I'm Joe Zach
I'm Michael Outlaw
and I am
Alan Underwood
this episode is sponsored by
Retool stop wrestling with UI
libraries hacking together data sources and figuring out access controls,
and instead start shipping apps that move your business forward.
Lost you guys.
Okay.
All right.
So I guess this episode,
we are going to continue on with the site reliability engineering book that we
started with from Google.
And this one is going to be all about embracing risk.
But before we do that we want to talk about our news and some of the reviews that we got so outlaw reader of names
is that my new job title that's right that's like the lord of the rings style
i wasn't prepared for that but uh yeah i i would be awful at that that would be like the Lord of the Rings style. I wasn't prepared for that. But yeah, I would be awful at that.
That would be like the worst job title ever.
So fortunately, this one works out to where it's not so bad.
So thank you to Richard Hopkins and JR for their new reviews.
Both very good.
Thank you very much. All right. So before I get into
this next one I have in there, so I do have to mention like our Slack channel. I love our,
our, our, it's not Slack channel, our Slack group, right? With lots of channels. Well,
there's one channel in there that, um, there's a few of us that are a part of the whiskey channel.
And, and I have to mention this one that my wife got me.
Um, I think micro G you will like this, uh, Devin, you'll probably like it. Sean, um,
Garrison brothers. My wife bought me one called honeydew. And we all had a conversation in the past where people like four roses. And I was like, you got to take a sugar cube, melt it up and put
it in there. And it's amazing. That's what this tastes like. It is so
good. So, um, I need to post a picture up in there, but again, my wife got it for me, so I
thought I would bring it up. It's a good reason to go up to Slack anyways, and get involved in
just an amazing group of people up there. So if you haven't already joined the group, go to
codingblocks.net slash Slack, and that's it for that. Now, Hey, before the next thing, while we're
talking about Slack,
go ahead and mention,
um,
we have an episode discussion channel and had an SRE who worked at Google
during,
uh,
some of the period that we talked about,
uh,
was in the channel,
uh,
sharing a lot of really great,
just notes and tips and corrections and just really great perspective on,
uh,
everything.
And we had just really great discussion in general after the last episode.
So,
um,
you should pop in there and,
uh,
see what that MSR has to say. that's awesome all right now for a bit of sad news that we got so
jim hummel sign has been just amazing and and shared with us over the past probably three years
i want to say that yeah maybe three years only gets you to the beginning of covid good god it's crazy um so so he shared with us
a long time ago that acm if you joined acm then you got the entire o'reilly um library of books
and audiobooks and all kinds of stuff that came along with it for 99 bucks a year uh sadly both
outlaw and i received an email this past week saying that they are basically O'Reilly didn't want to renew their partnership with the ACM.
And so that feature is going away, sadly.
So if you signed up for the ACM like we did so that you could partake of all of it, you're going to lose a big chunk of it at the end of June of this year.
So very sad.
Don't know what to do about it.
That's the thing that like,
I get that they have,
you know,
they couldn't get like whatever country contracts set up between them and
O'Reilly.
I just hated that.
Like midway through my membership year,
I'm like losing benefits.
Like it's okay.
Like I might,
I would have felt a little bit different
about it it was like if at least like the current year you got all that benefit and then they were
saying like hey next year you won't get that if you if you decide to renew right but midway through
like after you've already paid whatever you paid you know for for the subscription that they're
like hey uh yeah yeah it does away it does. It stinks. Um, I mean,
there's still a lot of good content out there. It's hard to say whether or not I'll keep mine.
I don't know. Like the deal, right? Yeah. Maybe not a big one, but I don't know. It does make me
sad, but at any rate, just be aware of that. So if you have listened to a bunch of the past
episodes where like, Hey, go get an ACM membership. And this was the reason why you were going to do it. You know, you might want to look
elsewhere. So moving on from there, um, I guess let's go ahead and dive into this. Yeah.
Let's talk about embracing risk as it relates to site reliability.
So, man, I actually love the opening of this thing, right?
Like they were talking about, you know, we all use Google.
Everybody probably on the planet at this point,
if you're in a country that has Google, probably uses Google, right?
And you probably assume that they aim for 100% uptime reliability on everything, right?
No, not at all.
Because this is what's really interesting. They're like, well, increasing reliability has always got to be better for
the service, right? Like it has to be, no, it's super, super expensive, super cost prohibitive
to add one more nine of reliability, right? Like I think, I think outlaw explained it in the past.
Like when we're talking about nines of reliability, you usually have 99% uptime. And then every nine
after the decimal is another nine of reliability. So if you say you have one nine of reliability,
I think it's 99.9, right? If, if you're two nines and you're ninety nine point nine nine, et cetera. Right. Well, every one of those decimal points that you add there increase the cost.
And they say that it not only increases its cost, sometimes it can go up more than one hundred times the cost just for getting one more nine of reliability.
When not only that is that the hoops that you might have to go through to make that additional bit of reliability
might actually impede the user experience.
So, you know, you have that.
And then even if you don't on an upside,
let's assume you don't,
like, are the users going to notice it?
Right?
Yeah, I think, I don't know if you have this in the notes anywhere,
but they actually said that the internet service in general has a between point one and one percent failure rate.
So just alone, you may get blamed for it. You may not. But that's the reliability of just the average Internet user.
Yeah. So one of the things that they were basically saying is when you are focused solely on increasing reliability, that means that you're not able to iterate on the features that you want to
add to the product,
right?
Like customers might want like,
you know,
Gmail,
if you were on that way back in the day,
there's way more features in Gmail now than there were when it first
started.
And part of that is because they spend time developing features and not just
trying to keep the thing,
you know,
up 100% of the time.
Um, Oh yeah. You know what? Um, we didn't say at the start, not just trying to keep the thing up 100% of the time.
Oh, yeah. You know what?
We didn't say at the start.
We are skipping Chapter 2. We're not talking about Chapter 2.
We've moved on to Chapter 3.
A way to talk about what we weren't going to talk about.
What we're not going to talk about?
Oh. Yeah, apparently not.
Nothing we're going to talk about. It's just that Chapter 2
is a lot about Google services.
It kind of sets some stages for some stuff.
And I talked about how they did things internally, which was interesting and good.
And you should read it, but just not really podcast.
It was a super interesting chapter.
It's just not like necessarily a lot of takeaway lessons that you can learn from it because it's more about like what they do how
they do the the the biggest takeaway that i had from it was from a terminology perspective
was as it might relate to like the word server for example like how many times have you walked
into like hey well i mean i'm about to say server room right or uh you know you see some some
computers and on racks and people are like,
Oh,
that server over there does blah,
blah,
blah,
blah,
blah.
And in Google terminology that they would never call it that they call those
machines.
And,
uh,
a server is the software that might serve up something like Gmail or a
webpage or a database query or whatever.
Yep.
So server was,
you know,
like they,
they,
yeah,
it was software.
All right.
So now we're done not talking about chapter two.
Um,
totally not going to talk about it.
Not going to talk about it.
Um,
so,
so Jay Z hit on the fact that,
you know,
the,
the reliability of the systems that are actually using these other systems are usually lower.
And so their whole point with that is because those systems aren't 100% reliable, the chances of you even noticing something on Google services, not being reliable or really low, because you might chalk it up to your cell phone service, not being or or your internet connection being slow or whatever so it's not even that important because nothing's perfect is what it boils down to
um and so what they say is really what sres are really trying to do is they're trying to balance
the risk of unavailability with innovations and features right you don't want to stop innovating
you don't want to stop releasing features but you also need to make sure the stuff runs. And so it's a, it's a fine line,
right? And that's really what they're trying to do in this. When they talk about embracing risk,
it's just accepting that it's, it's a, it's a reality. It's a fact of life. It's going to
happen. And you need to accept that and learn how to manage it, which is a great segue for the next section called managing risk.
Yep.
And they make the most obvious statement here, right?
When you have an unstable system, it diminishes user confidence, right?
Like that's, as a software developer, you know that, right?
Like if you have to keep going back to your customer and being like, yeah, I know this isn't working.
We'll get this fixed.
Next day, something else isn't working.
We'll get it fixed.
Then eventually they're going to be like, you know, do we just need to use another piece of software?
Because like this is this is getting ridiculous.
And you don't want to be in that in that mode.
It doesn't even have to be like your thing that you're offering to your customers. It could
be even internal things. How many times have you had like something as simple as like a set of unit
tests? You're like, oh, those always, those always return an error. So we just ignore those. And then
over time you like, you know, like your whole, you'll see people just ignore all the unit tests
and they won't even run it. And it's like, oh, cause I always got some errors on it. And it's like,
yeah,
that's why it's important to like,
you know,
fix these things as they go.
Like otherwise people are going to lose confidence in it and not rely on it.
Yep.
It didn't take a lot to lose confidence,
you know,
something that could probably fail,
you know,
small percentage of time and it feel like it happens all the time,
especially if it fails when you need it.
Yeah.
That's a,
that's a super important point, right? Like it doesn't, it doesn't have to be all the time. Yeah, it's true.
So one of the things that they point out here is the cost does not scale with improvements to
reliability. So basically what they mean is as you're increasing the reliability of your system,
the costs go way up. Like, it's not like, hey, I made this a little bit more reliable.
It just costs a little bit more money.
No, I made it a little bit more reliable, and it cost me 10 times the amount of money.
Go ahead.
Yeah, well, I mean, we'll get into it later, but I mean, they actually get into the math
of like, yeah, I mean, we just saw stuff we kind of hinted at, too, in the last episode.
You know, we're like quantifying that and measuring it.
You know, like, hey, is it even worth your time to deal with that?
Right.
And so they have what they call their two dimensions of costs in here.
There's the cost of redundancy and compute resources.
Right.
So the computer, the CPU that you're using, and then the opportunity cost, which we already talked about a little bit, which is you're basically trading features
that you could be developing for reliability, right?
So you only got so many developers,
you only got so much money,
you got to choose where you're going to put that.
Even if you try to say like,
oh, we'll just hire somebody else to do that thing,
that's still money that you're paying somebody else
to do that instead of paying somebody
to develop new features. Yep. Yeah, i was really happy with how they phrased things and just
thought about stuff the the the structure because they're really aligned the sres with the business
interest and the business costs and i think that's a really good thing for especially such
an infrastructure and kind of devops oriented role is that it's kind of odd to pair those
together those that you don't usually think
about in business you know goals and expenses and profit aligned with infrastructure but here we are
yeah i think this chapter specifically got into like knowing uh you know what the value was going
to be right was it was it this chapter yeah okay so they actually hit on this right like
what he was talking about is the sre's goals sort of align with the business's goals. So, if the business goal is for 99.9% up not like another nine of reliability, but they got
to hit 99.91. We'll say, right. They try to treat that business goal as their minimum and their
maximum, because that's how they can maximize their effectiveness and what they're going to do.
Right. If they spend too much time trying to go past that 99.9%, then they're wasting time,
right? They just need to
meet the business goals. Now their word for that, you can call it gold plating. That's true. Yeah.
If you're going too far. Yeah, absolutely. Um, all right. So the next thing that we have is,
is service risk measuring service risk. Is there a chapter on this? I haven't read ahead yet,
but I could, I kept thinking, I'm thinking, well, what if you work on internal
tools? This whole section is basically about identifying an objective
metric for a property of the system to optimize.
Google, for example, they talk about how they actually focus on
measuring downtime in terms of
the success rate is the total number measuring downtime in terms of, um, which, uh, where is it? We've gotten this shit, but there we go.
There's success rate is the total number of successful requests over the total
number of requests because something like, um,
counting downtime in terms of minutes and hours doesn't really make sense when
you're this global service that, you know,
you never really go down totally. Right.
So this is a better metric for them. And I just kept thinking like they were,
they were, they were
tying these services to like dollars earned and dollars cost and, you know, coming up with, um,
a ways to really measure that in order to come up with it. I just keep thinking like,
if you work on like internal tools or something, like it's really hard to kind of map those to,
to dollars sometimes. Or if you have a business where, um, you know, if your site goes down,
you don't necessarily lose any money because your customers are all like, I don't say the subscription based or something.
Whereas Amazon.com, if Amazon.com is not available or you can't check out or something, it's very easy to say, well, this is how much we made yesterday at this time.
So that's how much we lost.
So they did talk about that a little bit in this chapter because they went through exactly what you're talking about.
When you have a customer facing type thing or something that is revenue tied, then it's
real easy to measure what you're doing, right?
Like X amount of uptime equals X amount of dollars, right?
There's that type of thing.
But then they said that they do have internal systems that maybe a bunch of different teams
rely on, but you can't tie directly to a target
revenue or something but they do have SLOs for that as well because it does impact so many other
ones so so there are owners of that and then they have to negotiate what their SLOs will be
within the organization and they have to measure it right because it does impact everybody
it's hard to come up with numbers if we say like well it's all internal customers and so if we go down we just
annoy people say well okay let's just take it down all the time if the cost is zero right if the
benefit is zero dollars why are you even have this at all and so obviously that's not the number but
it's hard to find a number so maybe you take people's salaries and kind of do some math like
how much time is it safe you know so i's, you know, what the right answer is.
It's just tough.
I mean, I don't recall that to the level of what you're getting at.
I don't, I don't think I recall seeing that so far in what I've read, but they do call
out that like, that's where, you know, the business owner for whatever that thing is,
you know, part of that person's responsibility would be to know
what that value is, whether, and they don't, they didn't, um, distinguish from an external versus
an internal type, uh, service. Yeah. Because even internal service like has some like external
value. Sorry to interrupt you, Alan to interrupt you but like if let's
say let's say for example you were running the help desk for for google employees you know just
an internal help desk for google right where they can like hey there's something wrong with my um
machine i need you know service or whatever blah blah blah you know uh that that developer you know
whoever is whoever has that downtime because of that need,
that's costing the company money if they're not productive.
So just because it's internal doesn't mean it doesn't have a value.
Yeah, it's just harder to calculate.
Let's say you're working on an HR system.
An HR system goes down, and because of that,
people can't change their tax withholdings, marital status, um, you know, new employees, hiring stuff, stuff goes
down, but the business as a whole keeps going. Customers don't know at all. So, you know,
presumably you need less nines on an HR system than you do for like a 24 seven storefront or
something. Um, but trying to come up with a dollar amount for that and trying to figure out how many
nines you should have, like how long you can go without it becoming a big problem.
It's just kind of hard to,
to come up with a number.
I think.
Yeah,
I would agree.
Internal money always is harder to measure,
but there's,
there's gotta be some way to reflect the business value,
right?
It's almost like tell,
it's like,
Hey,
how many nines can you give me for the HR system?
It's like, well, how much money do you have to spend?
How much time do you have for me to spend on this?
That's how many nines you get.
Right.
So, I mean, going to all of this, this actually hits on the next point, which is what you have to do when you're trying to measure the service risk is you have to identify an objective metric to use.
Because only by identifying that metric, can you start to
measure and optimize for that thing? Right? So like you guys, I know this has always driven me
crazy. Like somebody will come to you and be like, Hey, the system's broken. And it's like,
well, what do you mean? Like, what do you mean the system's broken? What's broken?
Can you not log in? Can you not click on a list? Can you like, please explain because there's a vast difference between,
between not knowing exactly what you're looking for and some vague thing.
Right.
You know what you just reminded me of?
We've talked about this before,
but you remember that website,
the website is down.
Yeah.
I don't remember that one.
Really?
Come on,
Alan,
man,
it's been a long time,
but that's what it reminds me of though
it was like when the guy calls in and the website's down and he has the guy the sysadmin reboot it
but it turns out like there wasn't anything wrong with the website it was you know his computer was
the issue that's awesome yeah i mean that's exactly what it is. Maybe it's wrong. It's working for me. I keep typing in the wrong password and I can't log in.
Right?
Yeah.
So by having this metric, this one metric that you're going to focus on, you can measure the improvements and any degradations that happen over time, right?
That's important because just because you're measuring doesn't mean things are all going to be glorious, right? Yeah, that's a good point too. The HR system, we've said it's
hard to tie back to a system, to a real cost. Well, just start with anything. And then if you
find out it's inaccurate or a problem, it's not worth spending days or weeks and coming up with
a cost center there. And it's also not worth not doing it because you can't get perfect.
You can't say, like, let's just not worry about the HR system because it's too hard
to come up with a number.
Just start somewhere and adapt.
Yeah, so to that point, it may not be a dollar amount, right?
It might just be, hey, what is the uptime of the page where somebody can change their
surname, right?
Like that might be the metric that they use to report on. And then that way they can find out if they need to do anything to fix
things later. So one of the things that is really interesting that they called out here is Google
focuses on unplanned downtime. So it's really important to know the difference, right? Like
if you've, if you've worked for a software company for long enough, there's going to be planned maintenance windows, right? Where, Hey, we need
to upgrade the OS on these servers. We plan on having them down between one and 2 AM on these
days, right? That's planned. That's okay. You did it because you knew you're going to do it.
What's bad is when the system goes down in the middle of the day, because just something went sideways and you don't know what it is. So that's unplanned downtime,
right? And Jay-Z already hit on the fact that Google doesn't use time as a thing because
you mentioned it briefly. They're distributed, right? They have servers all around the world. So while Google search might be up here in Atlanta, it may be down there in Florida where Jay-Z is. So it's hard to measure uptime when you have sporadic services all over the world. So instead of focusing on that, they do what he said earlier, which is the number of requests.
You've read the article, your nines are not my nines.
No.
Rachel by the Bay.
Oh man.
She had wonderful blog.
It's amazing to me.
Like how often this blog comes up on the hacker news.
Like this is the article from 2019.
We're just basically talking about just because the cloud service has a green
check Mark.
Does it mean that your business is operating well because, you know,
different companies count things different ways. It doesn't mean that, you know,
they're what they're calling functional. Does it mean that, you know,
it's something's not working for you because it's typically reflected in that
number usually reflects like the service as a whole.
So just because S3 isn't down, doesn't mean that, you know,
the rack that your stuff is on isn't down.
Right. That's a really good
point. Hey, did I, I'm assuming you guys have looked at the notes here. Um, if you haven't,
do either one of you know what 99.99% uptime is, how much downtime you're allowed to have in a year
without looking? Uh, well, I mean, I have it like right in front of my face. So you looked,
all right. It was a shocking number to me. I never really thought about it it's 52.56 minutes a year so two nines of reliability means you're allowed
to be down less than an hour in an entire year that's that's that's really good uptime man yeah
i i was curious i went back and looked for that that video that i just the website
is down video we talked about it in uh episode 122 designing data intensive applications
coincidentally the sub topic was maintainability oh nice hey in fairness that was about two years
ago how am i supposed to remember that? It was December of 19.
Yeah, you know.
Yeah, it's been a minute.
Remember how good life was back then?
December of 19?
We didn't know anything about the pandemic or anything.
We would go out all the time to restaurants.
No way we thought about it.
Life was grand.
I was heading to London in two months.
You wanted to travel?
You're like, let's get on a plane. Let's it was a big deal man i miss it you would stand in line like
right up against people you know while you're like waiting to get on the roller coaster yeah
you're bumping into each other in line for the roller coaster and everything and that's great
it's great yeah i love having people just breathing on you. It was just such a nice feeling.
Yeah.
Anywhere,
you know,
drugstore or whatever.
We didn't know how good we had it.
Well,
you know,
you've noticed that Joe Zach no longer invites people to kick his shins.
He doesn't like people to touch him anymore.
Yeah,
that's right.
Yeah.
It's done.
Oh,
I'm going to,
Hey,
you know,
I just got an elastic meetup.
Yeah. I just,
I have some,
I'm owed some kickers.
Wait, wait.
You said something about Orlando?
What's going on?
Orlando has an Elastic meetup now.
Oh, really?
Are you actually meeting?
Maybe I'll be there.
I don't know.
There's nothing scheduled yet.
It just started up.
I'm excited about it.
It's fictional.
It's a fictional meetup is what you're saying.
We'll see.
We'll see.
I guess one of the dev rels just moved here.
Okay.
So it may just end up being virtual or who knows how it's going to go.
It's cool.
I joined.
All right.
Well, let us know.
So back to availability.
So less than 52, let's just call it 52 and a half minutes if you're going to measure it based on time. So like if that's your if that's your four nines of availability is based on time, you can afford to be down roughly 52 and a half minutes before it becomes a problem and you're no longer meeting your objective for an entire year.
Yeah, that's that's a crazy number to me.
So so 14 minutes a quarter.
Right.
Another way to look at
that. It's again, insane. Well, so Jay-Z hit on it. Google does it by request, right? So if they
have a service, um, like, like an S3 service or which they don't, Google doesn't, um, uh, like
GCS, right? Like if you hit their, their GCS API, they may be aiming for 99.99%
reliability on that. And here's where things get interesting when you're doing it on a rate level.
If they have 2.5 million requests a day, let's say for, for GCS, then 250 of those can fail.
When you look at it like that,
that's a little bit more palatable,
right?
That's,
that's sort of easier to swallow.
And for them,
it makes a lot more sense because again,
they have this service hosted across the world in multiple data centers and all
that kind of stuff.
So it's easier for them to,
to measure that type of,
of reliability versus the uptime thing.
That's still not a lot of failures though though it's not it's really not that's only we're only talking about four nines of availability so
far so you know yeah yeah i mean getting it i actually three nines i don't even know that i've
seen anybody say that they do three nines on most things? Oh no, there's a, there's an Amazon storage that has this beyond that,
isn't there?
Or is it like one of the,
maybe it's like a,
a backblaze or somebody like that,
that,
or no wasabi.
I think it's wasabi.
They have a,
oh gosh,
what's their uptime?
I'm pretty sure that one of, there was like one that's like six nines or something crazy.
Sound effects are dinging.
And well, oh, here it is.
Durability is what they go after.
Eleven nines.
That's crazy.
Durability.
Now, that's durability, not necessarily reliability.
So, you know, you're talking about storage.
So you're just saying like, hey, we guarantee your stuff's going to be on that desk.
But yeah, I mean,
depending on what your metric is,
it's,
it's really insane to think about.
Um, actually I was going to see what three nines of reliability would be.
No,
not,
no,
not,
not two nights.
Yeah.
Three nines of reliability be for 2.5 million requests.
It looks like that you can have 25 failures that's insane to me 25 if you're going for three
nines of reliability so i mean again that's just that's my god to get to that point the amount of
engineering effort and and redundancy and failover and everything else you have to put into play for
that is hyper expensive you know what that fourth nine would be, right?
I mean, if anybody's falling along the two, it's two, right?
Point five.
You're just moving the decimal point.
Yeah, you're just moving it one more.
Yeah, man, it's it's insane.
So this is this is what's interesting.
So we're talking about this right here in these crazy numbers.
But the reality is not all services should be judged the same.
Um,
and they,
they give a perfect example.
So in their thing, I don't want to give away too much of the book because we want people to
read it.
Um,
but there's a big difference.
If you're not already reading this book,
you need to go to SRE.google and you can get the
book uh for free uh for i think it was like forward slash books right i don't know yeah
yeah i think it was something like that sre.google uh and then you have such books yeah so although
that's not all they have up there. They have,
uh, Google has other books.
This is just our SRE books.
Yeah.
So one of the things,
yeah,
I guess we could give it away and we could talk about every sentence in it,
but we were not going to speak it up.
We are giving away a physical copy of the book.
If you,
uh,
drop a comment here.
Yeah.
On what is it?
Cutting blocks.net slash episode one 82.
That's right all
right so what i was saying is like there's a big difference between having a sign-up form for a new
customer right like you probably want that thing to work you don't want to push away new customers
however like if we take gmail for instance g Gmail all the time behind the scenes is checking for new messages, right?
Like if you've ever been sitting in your Gmail inbox, you'll see it pop up a new message at the top.
It's because it's, you know, polling for info.
If some of those back-end service calls fail, that's not quite as big of a deal as that new user that's trying to sign up
for the site.
Right.
Or for some subservice.
Really?
Did that throw you off?
He's got a pick stuck on the middle of his forehead.
I didn't mean a guitar pick.
Yeah.
I didn't mean to like completely derail the conversation.
No,
Jay-Z.
He's got on his saint patty's day version
um so yeah all right see i actually look at you guys while we're talking this
it's funny like how quickly that derailed it i totally didn't intend for it
the point being like this conversation the uptime of this conversation was really important and uh i
totally like blew out our slo on that you know so yeah jay-z's got an alien pick now that's amazing
um all right so the other thing that they say is um google they check this success rate for
non-customer facing systems as well this is this goes back to what jay-z was saying earlier about hey what about systems that are that are only internal right that don't matter
to the customer um google sets quarterly availability targets and may track these weekly or daily
i'm so sorry we are jerks yeah you guys are you're just sticking sticking stuff to our heads now at
this point this is what happens when it gets super late at night and you've already had long days
right uh so so at any rate by tracking these things daily and weekly that means you could
address these issues quicker, right?
Like you don't have to wait a whole quarter to be like, yeah, things kind of went south this time.
We should focus on it next go around, right?
But there's an implicit requirement there that requires that you have strong observability.
Yeah, there's got to be some metrics.
And that's always the problem
with it right like you know the the problems you're looking for that you're watching for that
they tend to not happen because you're watching for them and you take care of them right and so
like you fix that it's always those problems that you're like i didn't even know that was a problem
we weren't looking for that that like that's the thing that always is the one that bites you right
i mean you know what's funny is when we covered those um devops books that like, that's the thing that always is the one that bites you. Right. I mean, you know, what's funny is when we covered those, um, DevOps books that we did back in the
day that tie really nicely into this, that's probably the one takeaway that I've truly
gotten on board with is having metrics for things. Measurability within your system is so key to actually knowing what the heck's
going on that I want to put metrics on everything,
which isn't quite possible,
but,
but it really does help you overall when you're developing software.
It comes in handy,
really a lot of debugging too,
just like times you're like,
well,
let me go see,
maybe I have something or you don't even,
you don't even have a question for us.
You just go look at a dashboard.
You're like,
huh,
that's weird.
This number is way higher than it was yesterday.
It just goes on from there.
It's great.
I mean, it's funny.
Just an example, like one of the things that I got shot at with this week was, hey, something failed over here.
And because we didn't have metrics around certain things i had to go digging through logs right and
i had to start aligning time stamps with when things happen in the system over here and over
here and if we'd have metrics to say hey what was the latency between um when this thing hit this
particular service and when it hit this service and i could have looked at it and said oh this
thing went backwards in time that's not even. There's a problem in the pipeline here, right?
And it's that kind of stuff that once you start getting it in place,
you really miss it when it's not there.
So, all right.
So the next thing, risk tolerance services.
Somebody else want to take this?
I'm tired of talking.
My throat hurts.
The idea here is, you know, we talked about that. SRE should work directly with the business to define goals that can be engineered.
And sometimes it can be difficult because measuring consumer services is clearly definable.
So this is kind of the example we talked about.
And the idea of the SRE working with the business is really cool, as we kind of mentioned.
And I was kind of thinking I'm still obsessed with the HR example.
So I kind of thought if I go to the HR director and said, hey, would it bother you if the system was down one hour a day?
You just take a long lunch or whatever.
And, you know, in my head, that sounded like a reasonable thing, you know, like seven eighths hours, seven eighths percent.
And but then the HR director might say, well, are you kidding me?
Here's another way to think about it.
Imagine if one percent of our employees couldn't do something they need to and sent me an email. There's no way I
could deal with that traffic. It needs to be far less. And that's a much more reasonable number,
I think. And I think that's the kind of conversation that might happen between
two people. And that's the kind of conversations I think you'd be doing.
Yeah. I mean, there was one extra part though that you skipped over, which was that it can be difficult to do the measuring because consumer services can be clearly definable, but infrastructure services may not have a direct owner.
So going back to your HR example, those internal ones can sometimes be a little bit more difficult to deal with.
Yep, absolutely.
Also, just identifying the risk tolerance of consumer services.
Sometimes a service will have its own dedicated team, which is really nice.
And I mean, this is basically what you just said.
Sometimes there is no owning team.
And I was thinking like Jenkins might be an example here.
Like if a build isn't working, is it, you know, who's responsible for it?
Maybe it's a problem with the stage of the build.
Maybe it's, you know, a problem with one of the services that's, you know, you're pulling from an artifact or something.
Maybe it's a problem provisioning a build agent.
But someone has to take the initiative to go look and start.
And, you know, I would have mentioned it might be three different teams there.
Like, who starts with it?
Yeah, I mean, here's an example.
Imagine you live in a world where multiple teams are contributing to an ultimate site.
And let's pretend this was in, like, a Kubernetes or some kind of, like, you know, shared space like that, right?
And maybe one of the services that is being deployed in this environment is a redis right now all the teams
are using this thing right who's responsible for it when it goes down right and like you know uh
an even better example that came up if we're if we speak strictly to kubernetes what happens when
like sed gets full right yeah like everybody in the cluster is relying on that thing, right?
And now nobody can write like secrets or config maps or stuff, you know, because like, you know, there's something like more overarching.
So that's the point is that sometimes like deciding who is that owner can be a little bit more difficult.
But what they say is a lot of times, if there is no clear-cut
owner, the engineers will end up taking ownership of it and then defining the reliability requirements
themselves. Because, I mean, think about it, right? I know Jay-Z, I know Outlaw, we're all
sort of like this. If you encounter the same problem enough times, like I'm not manually
handling it the third time, right?
I'm, I'm going and finding a way to write something to take care of it. So I don't have to deal with
it anymore. And I think that's ultimately what's, what ends up happening here, right? Like if,
if developers look at it and say, man, it's costing me two days a week to keep Redis going,
we need to figure out how to fix that. Then they're going to start figuring out how can we make it to where we're not having to even touch this thing.
Right. And so they start defining what they want the reliability of Redis to be so that they never have to go look at it again.
This also goes back to like the strong culture, though, within Google, where like, you know,
the people are kind of empowered to do that and to make those decisions.
So, like, you know, you have to have buy-in from the management,
which was something that we had mentioned from the,
the previous episode,
uh,
as it,
as it relates to this topic is,
you know,
this can't just happen because you alone as a developer on your team want it
to happen,
you know?
And so because they have that kind of,
uh,
culture where this is,
you know, a thing, then they can do that.
Yep.
And so some vectors in assessing the risk tolerance, level of availability is needed.
We've talked about that.
Do different failures have different effects on the service?
Redis is a good example because maybe it just slows things down or maybe it leads to a database failure.
So you got to figure out what's going to happen there.
If you use the service cost to help identify where on the risk continuum it belongs.
Okay, so hold on.
We need to pause here because this was in the book and I didn't really go into the notes and describe this very well. They call this risk continuum. It's sort of like this line, like where does this thing fall?
Right. And, and you're trying to align it with, with the objectives versus the reliability stuff.
Right. And so this risk continuum, they're basically saying, Hey, if you can align this
thing on the line using cost, then you could sort of help figure out where it should belong.
Javier, I think we talked a little bit about DREAD scores in terms of security.
Like many years ago, DREAD was an acronym where it's like discoverability, repeatability.
It just basically had to do with kind of classifying,
coming up with a single number for vulnerability.
So you could say, well, you know, it's a five on discoverability
because it's probably not going to, you know know someone have to have a lot of knowledge but the reproducibility
once you know about it is really high it's really easy to reproduce so that would factor in in the
end you kind of average those things up and get a single score so that's what i kind of imagine
with there being a continuum here well i mean it was kind of a weird one though because are they literally saying like
hey it cost us let's say a million dollars to build this service so on the continuum like if
it's down if it's completely down then that's a million dollars that we lost to build that service
right like that's what the service cost us so So what I was imagining is, you know,
the HR example I keep coming back to
is saying that that's hard to come up
with some good metrics and good numbers around.
But one way to look at it is to say,
well, how much does HR system cost us?
It costs us $3,000 a day.
So if it's down for an hour,
it costs us this much money.
So it's kind of a way of saying the service,
this is how much money we just kind of burned on having stuff running that wasn't
helping anybody. Whereas compare that to like Google search, right? Where it's like, well,
that costs us a million dollars a day to run. And so an hour there is much more valuable.
So it costs more. So it's just kind of saying like a way of saying like,
this service costs this much money to run.
So if it goes down, it's that much more important.
Just because of how much it costs us.
It's just one factor.
It's not the only one by far.
But if you're having a hard time coming up with things, you know, it's something to consider.
Yeah.
So in fairness, this one is identifying the risk tolerance of consumer services.
So I think it might be the cost to the consumer.
So, you know, if it's a super high expensive thing, like, I don't know, big table, then maybe you want to make sure that it's more reliable depending on what it is.
So I think that's where they were going with this one.
It makes sense.
So if, you know, if I'm paying for Kubernetes and a whole bunch of nodes and it's down and I'm, I just paid you a bunch of money for this thing,
you know,
then that stinks.
Whereas a Netflix,
Netflix goes down,
you know,
for an hour,
like,
well,
it's like,
okay,
I paid $8 a month for it or whatever,
you know,
so it's kind of,
okay.
So said another way,
like,
you know,
you can create a Gmail account for free and,
and check your Gmail.
Yeah.
So if it's down,
eh,
not such a big deal, right?
But if I'm like a big customer of like specifically,
let's say Google, right, Google Cloud,
and now Google Cloud goes down,
then that's a much bigger service cost.
And so on the risk continuum,
they are on an opposite extreme from the free gmail account
totally yes i think they use the example of like google apps business apps they do later on yep
so so yeah next up we got the target level availability so this one's kind of interesting
you know what do the users expect? Is the service directly linked
to any revenue? So this would be, you know, are you paying for a Gmail account or something,
right? Like if you're a business customer, is it a free or a paid service? Is there a competing
service and what is their level of service, right? So, so somebody like Google may take a look at
their GCS product and say, Hey, what, what is Azure blob storage? What
is their reliability SLO or what is AWS is S3 SLO, right? Like that's probably something that
they all shop around to make sure that they're being competitive there. What's the target market?
Is it consumers, is it enterprises, you know, whatever. And then this is where Jay-Z, if you
want to jump into this one, you just kind of brought it up a second ago,
the,
the apps.
Yeah.
And so kind of an idea that,
um,
if I'm paying for something,
I expect a better service.
And there's also even,
uh,
amongst the,
the Google docs.
Like if you think about like maybe a PowerPoint presentation might,
or,
you know,
uh,
whatever they call it,
um,
slide slide,
it might be more valuable because the chances of it going down while someone's
actually about to do a presentation you know the the severity might be worse uh compared to email
where you just refresh and remember google measures stuff in terms of um percentage of
failure request failure so we're not really talking about total outages we're talking about
number of requests failing but can be really scary if you're camping your presentation up
right before a big presentation whereas email email, you know, so what?
And so that's the kind of thing.
Same with Maps.
Google Maps is a free service.
But if I can't get directions when I need it, if I'm running late and I can't get the service I've been relying on,
that's going to sting me more as a customer than maybe something else like email.
It was interesting, though, when they talked about their apps, because you know, when companies buy into the like G suite of products,
their company business is sort of running on Google's infrastructure at that point. Right.
And so they prioritize that kind of stuff super high because they know that if you're, um, you
know, like you said, your slide decks are down or if your internal email or internal calendars are down, like that can actually cost a business a lot of time and money.
Yeah.
And sometimes it's more than just that time to like, you know, you're doing a presentation on the board of directors and, you know, you can't bring up your stuff.
Like that's a big problem.
You may not get the opportunity again.
Right. I mean, they got to be pretty good at it right because would you care to take a guess at how many times google has been down google the i remember website the google google services
let's say google services how many times do you think Google services have gone down? What does Google services mean?
Like Gmail or YouTube or Drive, whatever.
Any of the Google products.
Like down, down, like can't reach it down?
Down, down.
I'm going to say single digits.
It's probably ridiculous.
Are we talking all time?
Yeah.
All time. Give me a number, man 20 20 okay that do you realize like google was formed in the late 90s right 90s 20 is an incredibly
low number right agree jay-z went even further extreme to cut that in half and say somewhere in single digits.
He said nine.
Are you ready for this number?
I don't even, I'm not going to believe it.
It's four.
Wow.
Four.
That's incredible.
There was an outage in October of 2018 that took out YouTube for a period of time.
And then the other three were all in 2020.
Oh, wow.
There was a six hour one that took out Gmail, Google Drive, Google Docs, Google Meet and
Google Voice.
Then that was August of 2020.
In November of 2020, there was an outage that took out YouTube and Google TV.
And then in December of 2020, there was an outage that took out YouTube and Google TV. And then in December of 2020,
there was an outage that took out Gmail,
YouTube,
Google drive,
Google docs,
Google calendar,
and Google play.
And that one,
uh,
how I forget how long that one was.
That one was for like 40 minutes or so.
Wow.
Four outages.
They blew those budgets for like years,
right?
That's impressive.
There's a Wikipedia page that lists all four of the outages.
It is only four.
Yeah, that's impressive.
When you are is older than that.
But in fairness, in fairness, they got a little bit of compute power, right?
But the point is, is that like what we're talking about here with the SRE, right?
It is no joke.
I mean, like they obviously know what they're doing.
Not only did they write the book, but I mean, they have the.
They practice it.
They have the proof is in the pudding, right?
Like you can see here, they only had the four outages.
Hey, so there was actually a cool thing in here that they're talking about.
Like you mentioned, YouTube went down and all that. When Google first purchased YouTube, they didn't care
as much about the reliability side of things. So if you went to hit a video on occasion and it was
down or whatever, they were okay with that because they were way more focused on adding more features
to the platform.
I mean, you've got to imagine, they wanted to get some ads on there real quick because they knew that was going to be a moneymaker for them.
So they were willing to take the hit on the reliability side. So it's interesting that even within their own company, they look at things and they say, hey, yeah, this is okay.
And we're going to iterate on this as fast as we can,
and then we'll come back to it later.
Yeah.
I just looked up to see the time that I remembered it was in 2013.
It was down for five minutes and they say that global internet traffic went
down 40%.
Wow.
Wow.
You know, it was like for Google search going down.
That's awesome.
I remember when Amazon went down.
When we worked at Amazon, I remember hitting the page and going, wait, is my internet down?
No?
NBA.com's loading.
Remember that S3?
There's an S3 outage when someone does some misconfigured router or something around the holidays.
Yep.
Woo!
Fun times. Bad things. some some misconfigured router or something around the holidays yo fun times bad things so uh what like when we talk about failures what kind of failures are we talking about
uh nestle section here on the shapes of errors and they talk about uh like what's worse a constant
trickle of errors throughout the day or a full site outage that happens for a short amount of
time answer of course is that it depends.
Some services, you just go to lunch and try back later.
And there's other times when it's really important that people keep trying.
Like the map example, if you're trying to figure out how to meet somebody in 30 minutes and it takes about 30 minutes to get there and you can't pull up the restaurant, that's
something that you want.
A trickle would probably be more desirable.
You don't want that to be gone for two hours.
You'd rather have it just try a few times and have it eventually work
yeah but they did share one that was like really bad that you would probably rather have an outage
if if there was a potential bug in some service that you had out there that could allow
allow people to get private information that they shouldn't have access to
then they were like
you know what it'd be worth having the unplanned outage taking the thing down so that nobody could
get that private data right right yeah yeah private data was a good example of something right
yeah fine fine to be down um and they'd always prioritize uh security over uh i think it's like
security number one reliability reliability number two.
Yeah.
Oh, there was another interesting one that they brought up, too, is their ads.
So they said that typically their ads were accessed during working hours, right?
Which is not surprising.
If you have marketers or whatever, they're hitting that stuff between eight and five, nine and six whatever and so they were okay with taking the service down in later times at night when they knew that there weren't going to be that many people on or impacted by that that time so
they even take a look at what they're doing and say hey you know we don't necessarily have to
have 100 uptime you know depending on the service and depending on the usage patterns, we're totally fine with setting up
planned timed outages, right?
Yep.
And as far as cost
goes,
it ranks very highly
on the deciding factors for how much
money you make, how much money you cost.
So just a couple of questions
you wanted to ask to help determine the
cost-reliability kind of ratio is if you built in one more nine of reliable reliability, how much revenue go up?
Essentially, this is this is a recurring theme in this book.
It keeps coming up.
It could be very expensive.
This is where the additional that business owner knowing the value of whatever that business is, you know, for the company comes into play.
Yep.
And if that additional revenue offsets the actual cost of the reliability.
And they have an example of basically if, you know, getting that extra nine costs you $900 a year, but it brought in $1,000 a year.
Yeah.
You know, you may like, you know, in that case, it's a good example.
It's a hundred bucks.
So, yeah, sure.
Go ahead. you know you may like you know in that case it's a good example it's 100 bucks so yeah yeah sure go ahead but uh you know you can imagine or if it could definitely be more expensive to get that um
then that extra nine than it's ever going to be worth
i mean if you didn't well i mean it depends how long is it going to take me to get that nine
right now we're talking about my time,
not the company time.
Right.
Yeah.
I think I just got,
I would have gotten
flied by that interview
question right there.
Did not get the job.
So other service metrics.
So knowing which metrics
are important,
which ones aren't,
help you make better decisions,
of course.
They mentioned,
you know,
we mentioned the example
of the AdSense and search. uh searches primary metric is speed to results we want the lowest latency possible
and of course uh you know we want the best research uh results up top the adsense's
primary metric is uh was making sure that it didn't slow down the page load so uh this is kind of example where these are um
you know they work in tandem they work together to make for a good user user experience but
ultimately we don't care as much about ad search sorry ad since being late or being slow as long
as it doesn't affect uh the primary search And so because we have a looser goal,
basically on AdSense, it's okay to basically pop those in later and just go ahead and show
the search results first. No one has ever visited a webpage. And as they're like trying to read
whatever the blog or, you know, the site is or whatever thought, you know what, I'm going to,
I need to leave the site right now because all of the ads
targeting uh you know the last guitar i looked at are not showing up in the right rail and you know
i'm fed up with it if i can't see it again that's right right like that's never happened
but also people love when this the ads actually pop in after you started reading the page they
bump the text it moves your text all over. Oh man, drives me crazy.
Hey,
but you know,
you know what was really cool about this?
What they,
what they explained behind the scenes on this particular one is with the
ad sense,
because they don't care about it loading later into the page.
What they said is that reduces the cost of their infrastructure because they
don't need to have that stuff running in as
many data centers and regions and whatever else, because they're not trying to get like sub second,
like craziness, like they are on their search results. Right? So their search is got so much
compute power and, and is redundant all over the globe where the AdSense stuff, they're like,
ah, it doesn't matter if it loads from, from that's two states away. It's not going to kill us. And so they can save on
costs because they know the metrics that they're aiming for there. And so they're not going for
the fastest numbers. So it's really interesting that they use those metrics to drive decisions
on how they configure their internal infrastructure and all that. Yeah. It's funny. If you think about it, if the ad doesn't show, it's a better customer experience, right?
It's a better user experience.
And the person, you know, presumably who bought the ad didn't pay for it because it wasn't
shown and wasn't clicked.
So it's like, hey, it's a win-win.
So just don't show ads, Google.
You fixed it.
Yeah, I'm fixed.
There you go. Now, million dollars a day step for profit that's right yeah oh man so uh last six last little thing here note is uh just the different
requirements with customer services typically um because they often have uh yeah sorry different
wait help me out here what different requirements than consumer services typically because they are serving multiple clients.
So infrastructure services are different than consumer services.
Okay.
Because they're serving multiple clients.
Okay.
The header was part of it.
The header was part of it.
Yes.
Yes.
I didn't say that right either.
So I didn't help you
there at all we cleared that up we're good sorry it's like as we know sometimes i have a hard time
talking about things like um you know like um sheila boof i'm pretty sure that's not how you
pronounce his name i'm the one who's the you know, what was my new job title?
A sayer of names or something.
What is it?
What is this book?
Uh,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she,
she, she, she, she, she, she is difficult. It's like, what's the difference between a poorly dressed man on a tricycle and a well
dressed man on a bicycle?
A wheel.
Oh,
Oh no.
A tire.
A tire.
That's good.
Very nice.
That's good.
Told you.
Yes.
One last one real quick.
When does a joke become a dad joke?
When Outlaw tells it?
I don't know.
I like that.
I'll take that.
All right, we'll just end it there.
No.
That's when his punchline becomes apparent.
Oh.
That's good.
Thank you, Gregory.
This episode is sponsored by Retool.
Building internal tools from scratch is slow.
It takes a lot of engineering time and resources.
So most companies just resigned to prioritizing a select few and settling for inefficient hacks and workarounds for every other internal business process. Hey, but Retool helps developers build internal tools faster so they can focus on development
time on the core product. Retool offers a complete UI component library. So building forms,
tables and workflows is as easy as drag and drop. And that's no joke. We saw that in the, you know,
the UI, we were given a demo of it.
And it is silly simple how easy you can construct these pages.
And I like easy, but more importantly, Retool connects to basically any data source, database or API.
They offer app environments, permissions and single sign on out of the box.
And they offer an escape hatch to use custom JavaScript whenever you need it.
With Retool, you can build user dashboards,
database GUIs, CRUD apps,
and any other software to speed up and simplify your work
without Googling for component libraries,
debugging dependencies, or rewriting boilerplate code.
Thousands of teams at companies like Amazon,
DoorDash, Peloton, Brex collaborate around custom-built retool apps to solve internal workflows.
And when Jay-Z was talking about all the integrations, whether it be a database or an API, he's not joking there.
You can go to retool.com slash integrations to see everything they have there.
If you want to hook up to GitHub, for example, or maybe you just have some GraphQL query that you want to connect to
our friends at Datadog. They've got integrations with them, CircleCI, PostgreSQL, my favorite
database. Or, you know, maybe you want a SQL server or, you know, we talked about Redis.
Whatever your integration is, they've got plenty of uh integrations there to
help you out learn more visit retool.com oh all right here we go okay so so seeing as i naturally
have the late night dj voice right now because of pollen in the atlanta area ah So good. Pollen. Man. God.
So I am going to do the bag.
So if you have not had a chance to leave us a review and you would really like to give back to us and put a smile on our faces, we have a nice little link set up at coding blocks
dot net slash review where you can go and we have links to leave a review on either
iTunes or audible or I don't know what else we have on there, and we have links to leave a review on either iTunes or audible,
or I don't know what else we have on there, but we have stuff. So again, we really do appreciate
it. We, we super love reading that when, when people leave nice little messages for us. So
if you wouldn't mind, please do that. And now you can even see reviews in Spotify,
which outlaw has graceful grace graciously put a link up there for all that
gratefully gracefully and graciously did he do this um the sayer of names so yes i really think
that like alan should be talking to us right now about how he fell into a ring of fire
like i really want to hear him say it right now uh i i don't have the johnny cash isn't that johnny
cash yeah come on i just just one time alan the ring of fire there you go oh my god it was better
than i thought it would be that's not johnny because that's social distortion gosh well played sir all right well uh okay so let's get into my favorite portion of the show
survey says all right so a few episodes back we asked hey for this year's game jam you are
and your choices were super prepared been practicing all year i'm ready yeah i'll figure
something out or oh my god i have no idea what i'm doing so i see you there i see you there uh johnny
underwood sticking something on his forehead trying to get back at uh joe and i to distract us as we now try to talk
but i am a professional and i'm going to stay on topic here so right um this was this is episode
182 so it's even number to tuck his trademark rules of engagement jay-z you are up first
all right i'm gonna say first. I'm going to say I'll figure
something out. 10%.
That's cheap.
Is that enough?
He shot the moon.
Is that
chicken strikes again?
He's counting on his fingers and his toes.
You can't see it off screen.
You can't see it.
This is, oh my god, I have no idea what I'm doing. He's counting on his fingers and his toes. You can't see it. All right.
So this is, oh, my God, I have no idea what I'm doing.
We'll go 11%. I mean, we're going crazy high here.
And, of course, it is, oh, my God, I have no idea what I'm doing.
61% of the vote.
I have no idea what I'm doing.
That's awesome.
That's good.
And yet great stuff came out of it. So that's awesome. and yet yet great stuff came out of it so that's
awesome absolutely really good stuff came out of it yeah um all right so for this episode's survey
i thought you know we're last episode we kind of tied in devops to the topic and and even in this
conversation so far we've kind of tied in dev into the conversation. There's a lot of overlap there I think we could all agree to, right?
So, you know, in general, we've spent several years now talking about DevOps,
gone through several books and whatnot.
So just as a general rule of thumb, like, how do we feel about DevOps?
And your choices are, love it, it's the greatest.
Or, it's great when things work or it's okay it's overrated
or i wish we had a good devops pipeline or it's a dream nobody does that
should we should we leave the witnesses on this?
no no no
I know the answer
it'll be 10%
I was going to say 10% of the vote
I think he learned his lesson
he's going to go with like 13 or 14% next time
maybe you never know
also
don't forget we're giving away a book so drop a comment
here and we'll hit
you up yep yep and uh you know those those comments really do help um you know anything
that you could do to uh uh well i was thinking more about the comments on like the leaving the
review comments but yeah the alan was referring to leave a comment on the episode too we appreciate
that too but those comments the the reviews that Alan referred to in his not Johnny Cash pollen voice
help because they bring in more listeners and then
we can have ads that we get to pay more money for.
We have expensive costs, right? We have to wear these headphones
all the time. Now I'm going deaf, so I sent my hearing aids in
for repair two weeks ago.
That's right.
I haven't heard anything since,
but
you know,
I can't even work without headphones on anymore.
Like I don't even have to have anything playing.
I don't have to be in a meeting.
I just,
I'm like,
where,
where they block out sound.
Yeah,
for sure.
No,
not do we need to go back to the shoppings pre-episode? Because with these beautiful Calis.
That's right. That's right.
I'm trying to block out the world, man.
No. Thank you, Derek, for that joke, by the way.
All right. So let's talk about target level of availability.
So one approach, I think we kind of just hinted on this, though.
One approach of reliability may not be suitable for all the needs, right?
So what was the example that we just gave was related to the business apps, the Google Apps versus AdSense.
So they gave an example of Bigable in the book. And it was actually a really cool story because they were
saying that like, depending on what the usage is of the application, you could actually even like
have different tiers of service that you could charge different levels of reliability for.
So they gave this example where they're saying like, Hey, if you want like super high reliability, and I think I hinted on this even in the last episode,
if you need to like super high reliability, then, uh, you know, it's going to be hyper expensive,
right. Due to the additional compute that you're going to get for that. But if that's what your
use case is and you're willing to pay for it, then here's a service level for you. Like here's a, here's a cost that you can pay. But if what you want to do is like a more like offline batch analytical type
programming or, uh, you know, processing, then, you know, you might not need that higher level
or liability. And so therefore, uh, you know, you can be in a different tier of service and pay less
for it. And, you know,
but as a result,
you know, you're willing to take that hit of,
uh,
you know,
it might not be quite as fast as the other one,
but it's always running.
Right.
Yep.
Oh,
I should mention too,
that,
actually,
uh,
you re you mispronounced a big double there.
Big double.
Yeah.
It's not a capital T.
That's always bothered me about Bigtable.
And they actually do pronounce it as Bigtable.
But it's little t, one word.
It really is.
So it looks like Bigtable.
That was your tip of the week.
A little journey through Jay-Z's mind.
Kind of scary, if I'm honest.
Me too.
So different types of failures.
So real-time querying once requests queues to almost always be empty.
So can service requests as soon as possible.
Offline analytical resourcing.
Oh my gosh.
Offline analytical resourcing. Oh, my gosh. Offline analytical processing.
It's more about getting right answers and just throughput in general.
So we care less about latency and more about just always getting the work done.
So it's two different cues but different goals with it.
One cue is we want to always be empty and the other is what we don't want to always be
full.
And you know,
what's weird about that is it's literally the same technology stack,
right?
Like it's same exact stuff,
but what would be successful for one would be viewed as failing on the other.
Right.
And so it's,
it's pretty interesting that you can't even look at the same freaking thing
and say,
uh, uh,
Hey,
what's my success criteria,
right?
It's not,
it's the use case that you have to go after.
Yeah,
that's funny.
Um,
so cost,
uh,
can you partition the services such that,
uh,
different clusters can have those different needs.
And we kind of talked about that a little bit.
Um,
you know,
maybe for some big table,
uh,
customers, they care more about, let me fix this, low latency and high availability.
And others may, you know, care more about throughput.
And exposing those cost savings, giving the customers the leverage to make the right decisions for their business is fantastic.
And it kind of takes some of the decisions away from your SREs, although it does complicate things, right?
So you've got to kind of split your – you'll have different levels of objectives for those.
But I don't know that it complicates things too much because what they said is typically it's the same exact stack.
It's just configuration levers that they change, right?
So does it really complicate things?
I guess you have more things you need to test out to make sure they operate
properly in those different environments.
But technically you're spinning up the same software,
just,
you know,
changing variables here and there.
Yeah.
I guess I was expecting to have different SLS based on,
on,
uh,
the kind of the,
the service tier,
but yeah, I don't know. Maybe you wouldn't. Yeah. It is kind of interesting service tier but yeah i don't know maybe you wouldn't yeah it is kind of
interesting i'm guessing what they were talking about with like always wanting empty queues versus
always full queue or queues that have things to to be processed would they set up separate metrics
for those different types of environments i'm guessing they would i would assume so like yeah
i mean because you know like
like you just said with the queue thing for example right like um what was it the where was
it the real-time querying you wanted to have always an empty queue so that it can be real time
you know versus the offline you you're trying to just do as much as you possibly can so you
always want it we're doing something so you you'd want to just do as much as you possibly can. So you always want it. We're doing something. So you,
you'd want to know like, Hey, is the real time queue backing up?
Because if so, I need to address that somehow you,
you'd have to have different observability for it.
I think you'd probably, you might have the same metric, right?
Like the queue size. Um,
but you might have different alerts set up for,
for different sets or something. I don't know. Yeah. It's, it's definitely interesting. Um, but you might have different alerts set up for, for different sets or something.
I don't know.
Yeah.
It's, it's definitely interesting.
Um, yeah.
So, oh, this, this was actually my favorite part of this entire chapter here.
I don't know about you guys, but this, uh, this motivation for error budgets.
So when I read this title, I had no idea what this meant.
I still didn't know what it meant even until until I got down to like another couple paragraphs.
But basically what they get at here is there's tension between SRE teams and feature development teams, right?
We've talked about this in the past, too, right?
Like SRE, they want to keep things going.
They want things running.
And development teams want to release features.
And those two things are at odds because every time you release something, you're introducing risk and potential downtime.
But I jumped way too far ahead here. So there's a few things that they need to look at here. So
software fault tolerance. How fault tolerant should the software be, right? Like how well
does it handle unexpected events?
Testing.
If there's too little testing, then it could be a bad user experience, right?
And we're talking about unit testing.
We're talking about end-to-end testing, all kinds of testing, not just one particular type.
If you have too much, you never ship, right?
Like if you're trying to make everything absolutely perfect, hit every edge case, you're not going to ship the software.
Let's go back to like that DevOps handbook, right? remember like there was this whole pyramid of testing that you might do so
like you might have like uh unit testing integration testing uh end-to-end testing
performance testing uh user acceptance testing right like if you were to say well we can't even
ship it until we get to that top tier right you you you might not never ship
depending on like what your product is you know and you got to be willing to accept some totally
and the worst thing is you know we say you'll never ship which i mean chances of that happening
are slim to none but but what you could do is you could miss your opportunity right like if the
market is positioned in a way to where if you get your software out the door, you can profit on that.
If you're trying to wait to get to perfect, then you may miss that opportunity, right?
So you could have missed the boat.
Push frequency.
Code updates are risky.
We've talked about that.
Anytime you push to production or wherever, you're introducing potential risk, right? Cause if it's been running fine for the past month and then you change
something,
there's a chance that it might not run,
run fine for the next day.
Um,
so should you reduce the number of pushes or should you,
um,
like work on getting more features out there?
Like that's a question you have to ask it to feature that.
I still want to live in a world where like we've read stories about you know like i remember facebook had a article out like well over a decade ago where the developers could
um you know they weren't done until they saw it in production and they could literally do their own deployment for their thing as part of the effort. Right. So, you know,
you get a ticket, you're going to start like, oh yeah, I need to like move the pixel, you know,
the, the image three pixels to the left, or I need to make the logo on fire and you could like
do it and then deploy it. And that was all within your capability because they had so much automation in
play and so much testing of that in play to prevent you,
you know,
the kind of active is like gates,
automated gates to,
you know,
protect you.
But because of that,
there was like this huge confidence that they could just get it out there as
soon as possible.
Right.
So I definitely want to live in that world where like,
you know,
don't reduce the pushes to production,
get the feature or the bug fix out as soon as,
as soon as it's ready.
And there's also something to be said for like smaller deployments too.
So totally smaller deployments are safer.
We've talked about it in the past,
but for sure.
This last part that they had here in this section for the motivation for air budgets was canary deployments, right?
The duration and size.
Now, this is interesting.
I hadn't really thought about it in these terms before.
So canary deployments, you're typically trying to see how something will go, but you're usually doing it on a subset of the workload.
So you're sizing it down, right?
Hey, does this thing operate well in this environment? And the questions that they asked
were, how long do you wait on canary testing to see if something does go wrong? And how big do
you make the canary? Meaning like what size of the subset of data do you do, right? Like
we have talked about this, in different terminology uh regards so
um most notably in like the way of feature flagging to where you could feature flag uh
you know uh you could use feature flags so that a portion of your traffic gets directed to that
new feature so specifically the example that we did talk about bringing up
Facebook again was how they had like the messaging was already out there. The Facebook messenger was
already out and deployed in the wild before, you know, you know, everybody realized it and they
were able to like slowly add on people and see, you know, how well it was working and get some metrics about it beforehand.
And then over time, you could keep increasing that, in this case, canary size
until you're ready to make it a feature that everybody can have.
And of course, I assume they're referring to canarying as the canary in the coal mine
kind of thing where you're just seeing like,
which is kind of a gruesome,
uh,
you know,
way to approach this,
you know,
like,
you know,
you put the canary in the coal mine and see if he dies like that.
Right.
An awful,
it doesn't come back.
Yeah.
Way to talk about this.
So yeah.
Um,
yeah.
What'd you do?
Hey,
we talked about tracer rounds in a previous episode.
So you're shooting your code, right?
Yeah.
So, you know, why not fly a bird into something to get her set?
Yeah, we did.
All right.
So, Hey, there was actually a quote in here and I think it's from somewhere else too.
I don't remember, but we talked about it.
Yeah.
Did we?
Oh, I thought it was literally the tie, the title of the last of chapter one, I believe.
All right.
So we're skipping this.
Um,
they got to tell us now that people are going to go back and be like,
what are they talking about?
That's our motto.
Um,
hope is not a strategy.
Um,
all right.
So this, this is the part that I thought was really good.
So forming your error budget.
Now,
what in the world is this?
So both teams, the SRE
team and your, your feature development team should define a quarterly budget based on the
services SLO, right? So whether it's 99% uptime or 99.9 or whatever, um, what's cool about this is
this determines how unreliable a service could be within a quarter. And it removes the politics between the SRE and the product development team.
So you say, hey, we want our service.
If you're Google in this case, right,
the number of requests have to fall within our 99% SLA, right,
or SLO, service level.
Objective.
Objective, yeah.
So going back to what we said earlier 99.99 percent on 2.5
million is 250 failed requests right so that's how many you get for the quarter if if there's
only 2.5 million requests that are made in a quarter I got 99.99 problems but service request ain't one. Ain't one. That's right. Um,
maybe it is some of the times it depends 52 and a half minutes out of the
year.
It is,
but God,
that's still crazy.
That's so crazy to think about.
Um,
so here's,
here's where this gets interesting,
right?
So this removes politics,
but product management sets the SLO, right? So this removes politics, but product management sets
the SLO, right? So, Hey, I've, I've got my new Gmail thing out there. I'm going to set the SLO.
I want it to be 99% uptime because you know, who cares if some of the background polls for
new emails fail, that's fine for the quarter, the actual uptime then needs to be measured.
But the important part is it needs to be by an uninvolved
third party, right? And Google says basically they have their own monitoring system out there. So
that's like the third party. Now, this is where things get interesting. The difference between
the actual downtime and what the SLO was, is your error budget.
So if you said that you're allowed to have, just for easy numbers, let's say you're allowed to have 10,000 failures for the quarter for your service.
If you've only had 10 failures, then you've got 9,990 failures left in your budget.
That's kind of a cool way of looking at things.
And as long as you have budget left,
then you can do a release.
I love this approach.
We actually did a hint on this last episode.
I kind of like got ahead because there was an example that they gave where
they were talking about the rack where you might have like the switch at the top of the rack,
right? And they were saying that like, hey, you know, sometimes your error budget can be shot,
not because of anything that you did, right? That switch goes out, that networking switch goes out
and takes out everything in it.
And if that's the only place where your, uh, application was, well then for the quarter,
you, your entire air budget is spent. You don't get to do any deployments.
And that's another one of those examples that we talked about where it requires strong buy-in
from management to be able to say like, Hey, Hey, I know it's only January 2nd,
but we're not going to do another deployment until April 1st.
So I did want to say,
I talked to,
I'm sorry.
I'm sorry.
And it's like,
and he mentioned that security fixes are still getting out.
Bug fixes are still going to,
it's really about those feature releases that are not going to go out. No one's going to say
like, ooh, log4j
exploit, you know, but...
Sure. Heartbleed comes out.
You're going to go ahead and fix that.
He also mentioned just how important the management
angle was. Yeah, they
all have to have buy-in. But how cool is that, though?
I mean, when you think about that, really,
that's a really nice
approach, right? Like, hey, as long as the development team is doing a good job in making their software reliable and when it deploys, there's not problems when it deploys.
They can keep deploying every day if they want, right?
Like, you're not burning your budget. However, if you did something particularly nasty that took you down for a while and you ate into seven thousand of your ten thousand budget, you just blew through two thirds of your budget.
And and you're at the beginning of your quarter. You now got to think about what you're going to do over the next three months because you had a particularly nasty release. I think that is a really good way
for teams to sort of do a good job themselves, making sure they're putting out a quality product.
It's also a super mature way of addressing your service, right? Like, you know, there's a maturity level there. The
management's bought into it. You have the observability and the metrics for it. You know,
you were able to calculate what is a reasonable error budget and you're able to track on that.
And, you know, yeah. So, I mean, there's a lot of buildup is the point,
is that, you know, in order to get to this point.
Well, I think the first building block,
and we've talked about it before, is the metrics.
If you don't have metrics,
how are you going to measure anything, right?
Like, how do you know how successful you are
or how bad you are or how you don't, right?
So getting that in place allows
you to start making decisions without it. You're flying blind. Well, I mean, like where I'm where
I was thinking about this, though, as I was saying that is that, OK, let's take our current, you know,
work, you know, life. Right. We couldn't just say like, hey, you know, here's, here's, you know, our error budget.
Right. Like we would have to go, we would, you'd have to sit down and think about that. That's not
just something you can just arbitrarily. Like, I think, you know what? A thousand, a thousand
errors sound feels right. That that's what my gut's telling me. And I'm going to go with it.
Like, no, you can't do that. Like you, you really need to have the time, taking the time to sit down and
figure out like, okay, what's realistic. Like what is, uh, you know, what does it matter to
my customers? Like what's, what's the, the perceived loss or, or actually, you know,
a law actual loss with, you know, any customers for outages and things like
that. So like, you know, it, it, there's a level of maturity there to be able to get to this point.
Yeah. It's pretty awesome to think about. So the benefits we'll touch on real quick here. I mean,
we've already talked about some of them, but this approach actually provides a good balance for
both teams to succeed, both the development, the product development team and the SRE team,
right? It's, they can look at it and they can see what their budget is. Um, and if the budget's
nearly empty, then the product developers start spending more time testing and hardening their
product, right? It's, they don't stop working, but they're not going to be releasing more features.
And so they can take that time and make it to where the next time they
release,
it goes a little bit smoother,
right?
With,
with fewer problems.
So that's pretty awesome.
Or write more testing around previous errors to make sure that they don't
happen again,
or,
you know,
they're caught ahead of time.
Yep.
And then,
so outlaw hit on the switch thing,
right?
Like if it goes out,
you know,
it's like,
well,
sorry,
that's just what happened. You know, you guys are gonna have to eat it with the budget.
But what they said that this can actually do is it can bring to light some, some overly aggressive
reliability, um, targets that people have hit or trying to hit, right? Like, let's say that,
you know, something happened that you couldn't control. And all of a sudden your entire budget's eaten. You can look at it and be like,
yeah, you know what? Our reliability goals are way too high. We should back off this a little
bit because otherwise we're never going to be able to release another feature. Right? So again,
I think that goes back to what outlaw said with the level of maturity, you have to be able to
reevaluate that stuff and say, and be honest with yourself and the, and the group and the product and be like,
yeah, I think we were, we were a little too aggressive here.
Yep. So, uh, we'll have links to the resources we like in this episode. Clearly, uh, sre.google will be one of the many in there and uh yeah with that we head
into alan's favorite portion of the show it's the tip of the week yeah so i've actually got a couple
here um today so i'm going to lead off with the one that is a follow-up to the one that i did last
last episode with the guava library from Google.
So Michael Warren wrote us on the previous episodes,
show notes,
and let us know that there is actually an update to the cash library.
So Google has even pointed to this other one,
this GitHub library from Ben mains,
it's called caffeine.
And it's a,
I guess it's more of a standalone
caching implementation that they've done. And Google even recommends it from their own Guava
pages. So if you are looking for some of those caching features, go take a look at that instead
of necessarily pulling in the Guava stuff, which I'm assuming if they're, if they're pointing to
this other library, they don't plan on building this up or maintaining it. Um, they're probably deprecating it eventually. So, um,
excellent. Thank you for the tip there, Michael. And then the other one I wanted to share because
I get into this flow where I'll be working on things and I don't know, as a developer that
gets involved in a lot of stuff, I get pulled off tasks a lot. I don't know as a developer that gets involved in a lot of stuff i get pulled off tasks
a lot i don't know about you guys does that ever happen to you do you get off task a lot you get
pulled off off your task to do another task i don't know what you're talking about
sometimes i get to work on the tasks i'm supposed to oh that's that's kind of it's kind of like it
yeah i think that's kind of like it maybe that's what happens to me maybe sometimes i actually get to work on it
but so here's the gist of it right so i check out a branch um and i start working on it and i get
pulled off on something two or three days later i come back to it and i'm like oh man what did i
branch this off of where is this supposed? Where is this supposed to go?
Is this supposed to go into my release branch?
Is it supposed to go into the dev branch?
Like, I don't remember where I started or what I was doing.
Well, there's something that I do when I do get checkouts that helps me with that is the actual command is get checkout dash B space branch name space dash T. The dash T tells
get to track the branch that you branched off of. So let's say that I created a branch
called ABC, right? And I told it to track the branch that I branched off of, which might
have been dev. What I can do after that, there's another command in Git where you can type in
git branch space dash vv. And it's a verbose version of the branches you have. Because if
you just type in git branch, it'll give you a list of all the branches you have locally, right?
If you do git branch dash vv, it'll give you a list of those branches, but it'll also show
if you tracked another branch, it'll show you a list of those branches, but it'll also show if you tracked
another branch, it'll show you the branch that you tracked off of. So I can look at it and say,
oh, okay, my branch ABC, I branched off dev. Cool. I'm going to, I'm going to end up what I do need
to do a pull to get new code in, then I'll get it from the dev branch. Right. So, um, and it also
makes it easier when you do the tracking like that.
If you happen to switch back over to your dev branch
and you do a get pull to get in the latest changes
from your origin,
and then you switch back over to your other branch,
it'll tell you, hey, this branch is, you know,
25 changes behind dev, just do a get pull
and it'll pull it in
because it was tracking your local dev branch.
So that's all really nice.
There are tons of little caveats to when you don't have to do this.
I'm not going to go over all those because, I mean, it's just too much information.
But there is a way to set this up to where you never actually have to do the dash T
if you don't want to have to remember it.
You can do a git config dash dash global branch auto setup merge
and set it to always,
and it will always track the branch that you branched off of.
So that's a nice way to do it.
I recommend that if you're doing it locally, that's fine,
but I still always do the dash T
because if I get on another system, another computer, another environment,
then I don't have to check to see if that global config set, it's nice to know that it exists so all that will be in the show
notes um that is that is my michael outlaw tip of of this i want to clarify that was branch
dot auto setup yes yeah yeah totally i didn't read it verbatim. Dash dash global branch dot auto set up merge dot space always.
Yep.
Very nice.
All right.
So just check my watch and it's been 10 minutes since I mentioned MSUR.
So ding, ding, ding.
So two tips from him.
And the first one, I'm just going to read part of this talk description here.
And y'all can let me know if this rings any bells.
So, your job title says software engineer, but you seem to spend most of your time in meetings.
You'd like to have time to code, but nobody else is onboarding the junior engineers, updating the roadmap, talking to the the things that got dropped asking questions on design documents and making sure that everyone's going in
roughly the same direction if you stop doing those things the team won't be as successful
now someone's suggesting that maybe you're not uh that maybe you'd be happier in a less technical
role if that describes you congratulations you're the glue if it's not have you thought about who
is filling this role on your team?
I'm going to skip the next paragraph and just end with this.
Let's talk about how to allocate glue work deliberately, frame it usefully, and make sure that everyone is choosing a career path that they actually want to be on.
Cool.
That's just really cool to talk.
It's from XGoogler.
They're a lead at Squarespace now.
And so there's an excellent talk that I listened to that's really nice.
And I think that it makes a lot of really good points, you know, of course about career management and, you know, making sure that you're doing visible work.
But also just about recognizing when you are doing these kinds of tests and recognizing who is doing it and kind of what that means for the team.
So I thought it was really interesting.
And that could be a source of frustration for senior engineers a lot of times where you feel like you're not getting any work done.
But really, you're doing this really important stuff that isn't always seen or noticeable.
So, great talk.
Yeah, definitely.
I'm going to have to get that one.
I'm going to watch that.
Yeah, for sure.
And I got a second one here for you. So
I mentioned earlier, I hinted that there are other books
from Google. And actually, this one's even listed on
the sre.google site, but it's not under the slash books section for
some reason. There's a book called Anatomy of
an Incident. Google
Site Reliability
Engineering. You know I can't say that.
You can't say that.
You want to give it one more try?
Cyber...
It just can't do it.
You could never have that job title.
No. What's your job title?
I really want to. I really want to.
I really want to as well.
I'm an SRE.
That's pretty good.
SRE.
So it's all about Google's approach to incident management for production services.
So not only managing with the incidents and dealing with them proactively and preventing them essentially and being ready for them, but also doing things like postmortems and whatnot.
This is a free book.
It's really useful if you're getting into that or you just want to get better
at it.
Awesome.
Yeah.
And if you want this free book,
just leave a comment.
I don't know if this one,
it's in a Riley book.
I assume you could buy a hard copy of that or no.
Yeah.
They got PDF,
EPUB and Moby.
No reason.
That's nice.
So now you want a physical. It feels good. It smells good. Yeah. Yeah. They got PDF, EPUB and Moby. No reason. That's nice. So now you want a physical,
it feels good.
It smells good.
Yeah.
You know,
yesterday I had a clown open a door for me.
What?
I thought it was a nice jester.
So,
uh,
for,
uh,
my tip of the week,
one,
uh,
you know,
I mean,
we do a lot of Kubernetes stuff and in our, uh, day to day the week, one, you know, I mean, we do a lot of Kubernetes stuff in our day-to-day.
And if you are too, then, you know, if you aren't already using Minikube, if you, well,
first of all, if you're using Docker Desktop for your Kubernetes work, I guess this is
like a PSA, like there's better ways out there.
No offense to Docker Desktop, but, you, but I'm a fan of Minikube
because you can very easily specify
the version of Kubernetes that you want to use,
which to me is critical
if you want to be able to test your infrastructure
against a prod-like environment
going back to our DevOps handbook.
And so being able to specify that version is critical.
And, you know, Minikube allows you easily to do that.
But now that you're using Minikube,
I've convinced you of all of its wonders.
Such a salesman.
Thank you.
Sold it in like 30 seconds.
That was really good. good yeah so um you know
you might want to be able to see like hey of all my pods like do i have any like heavy pods like
what which ones are doing the most work and whatnot so go to your favorite terminal minicube
space add-ons space enable space metrics dash server, enter that in. And then
you can do something like a cube cuddle space, top space pods. So, you know, or if you have it
alias to K so K top pods, and you can see like your pods based on a memory and CPU, you can see
how they're performing and, and whatnot. So that and whatnot. So that's pretty cool. And also,
hey, here's another really cool thing that you can do with Minikube. If I didn't just sway you
already with that 30-second salesman speech that I did before, you can type in Minikube dashboard
and that'll bring up a page that will, if you're using something like a Google, you know, a GKE Google Kubernetes environment, you know, it'll be kind of like that where you can see the nodes and the pods and all the different Kubernetes resources that are in that cluster of 1VM on your machine, but whatever, like you can see some cool stats that are going
on. It might be like helpful to you to know, like, Hey, do I have, um, am I, am I, you know,
requesting too much, uh, you know, for all of these pods that I want to dev on locally, or,
you know, maybe the limit is, uh, you know, maybe I'm well above the limit and that's why my
local cluster keeps crashing every time I'm trying to dev on it, you know, maybe I'm well above the limit. And that's why my local cluster keeps crashing every time I'm trying to dev on it.
You know, things like that.
I do want to call out one thing on the Minikube thing, only because it confused the ever living
heck out of me when I first started dealing with Kubernetes.
So Docker desktop nowadays, as Michael mentioned, has Kubernetes built in.
You can easily turn it on.
When you, at least back then, when you'd start looking into Kubernetes things, they'd tell you to use many, many cube.
And, and what confused me is I thought that many cube was using Docker for, for its images and stuff.
So they're two totally separate products, right?
So many cube is running
a Kubernetes cluster in its own little like, um, configurations, right? It's its own little world.
It doesn't care about Docker desktop or it doesn't have to generally speaking. So
what you can do with many cubes, you can run a Kubernetes cluster and do all the cube cuddle
stuff there. What you can't do is you can't do like a Docker run and expect everything to work there.
Even though it has a Docker daemon, it won't let you do it exactly the way that you think you
should. So my whole point in saying this is if you're confused, if you want to get started with
Kubernetes, you could totally just use Docker desktop and turn on Kubernetes there, but you
are stuck with the version that they bundle with Docker desktop, which is what outlaw was saying. You can't specify a version.
And if you allocate resources to that thing, let's say that you give it 10 gigs of Ram and whatever
that's taking that. Now, if you wanted to run many cube as well, that is going to start another
separate VM that is going to require its own resources.
So if you give it 10 gigs of RAM, it's going to be its own cluster.
Docker is going to have its own 10 gigs of RAM, and they don't operate together.
So it was confusing to me when I first started out because a lot of tutorials would say,
hey, use Minikube, but then Docker desktop's like, yo, use Kubernetes here.
And I didn't understand it.
So just know that they are two totally separate things.
Let's be honest.
If you're just starting out,
then everything involving Kubernetes is a bit to take in.
It's a lot.
It's a lot of information.
Just do an episode on it.
You're in good company there.
And you're not the only one, well you you know what's funny the reason
the reason i got so frustrated is i was following a tutorial and it was like hey um go to the
metrics dashboard or whatever or this this metrics add-on thing right and i'm like it's not working
i'm using the docker desktop thing and it's like it's not there i can't get it to work i spent
hours trying to figure it out and then i was like okay i can't get it to work. I spent hours trying to figure it out. And then I was like, okay, I don't get it. And it was just the fact that the tutorial was from
the standpoint of mini cube where, you know, I was using the built-in Kubernetes and Docker desktop.
So yeah, it is, it's a mind wash. We should totally do an episode on it. I think we have
enough information at this point to, to spend a minute or two on it.
I'm all for it. Yeah.
Sounds good.
Um,
yeah.
Reminds me that some people are like slinkies though.
Yeah.
It was that they're really good for much,
but they bring a smile on your face when you push them down the stairs.
Uh,
thank you,
Jesse.
Everyone. That's the best one. uh thank you jesse everyone it's really dark though so sorry for ending the show on such a uh you know like i feel like i feel
a little bit like dexter saying that one you know like you know here's the clean room and we're gonna
tell you something funny but uh's a slide. Yeah.
So, yeah, subscribe to us on iTunes, Spotify, or don't.
Maybe you heard that last joke.
You're like, you know what?
No, that's too far.
You went too far.
But I really wish that you would.
And if you would, you can find us on whatever your podcast platform of choice is.
Maybe a friend gave you, like, hey, go check out these crazy guys. And, you't realize that we had a podcast. But yeah, we are there. And so leave us a review if you can
as Johnny Underwood
said before. We greatly appreciate it. There's some helpful links at
www.codingblocks.net slash review.
In the ring of fire while you're at codingblocks.net
check out our show notes examples discussions
and more and send your feedback questions and
rants to our slack channel
and make sure to follow us on twitter and
send us some I don't know social
distortion related
trivia
from mommy's little monster
head over to codingblocks.net and
find all our dillies at the top of the page.
He says dillies.
That's what they are.
That's what cool kids call them now.
You say dillies. You say it like real quiet.
Cool kids.
He talks about social distortion that's been around
for like over 40 years.
I think, yeah, that's about right.
No, I know because i was at the
40th year uh really yeah nice