Coding Blocks - Site Reliability Engineering – Service Level Indicators, Objectives, and Agreements
Episode Date: April 25, 2022Welcome to the morning edition of Coding Blocks as we dive into what service level indicators, objectives, and agreements are while Michael clearly needs more sleep, Allen doesn't know how web pages w...ork anymore, and Joe isn't allowed to beg.
Transcript
Discussion (0)
You're listening to Coding Blocks, episode 183.
Good morning!
The morning edition of Coding Blocks.
Subscribe to us on iTunes, Spotify, Stitcher, wherever you like to find your podcasts.
And you're probably saying, like, how do you know it's morning when I'm listening?
And I don't.
But if you can leave us a review, we would greatly appreciate it.
Yep, and you can visit us at codingblocks.net where you can find our show
notes examples discussions and more send your feedback questions and rants to comments at
codingblocks.net and you want more twitters we got more twitters we got twitters right at coding
blocks also if you go to coding blocks then you can find all our dillies at the top of the page
that i'm joe zack i'm michael outlaw I'm Alan Underwood. This episode is sponsored by
Shortcut. You shouldn't have to project manage your project management.
All right. So we are back with another chapter on site reliability engineering. Today,
we are talking about service level agreements, objectives, and indicators indicators so before we get into that i think outlaw has some things that he
wants to talk about that he's in love with here of late well i don't know why you gotta start it
like that we're not in love with maybe yeah well it was just more the idea of like we've talked
about the the the benefits of like monolithic repos before so i'm like yeah okay you know meh whatever you know
because i could see like some pros and cons to either right but the current thing i've been
working on has me like more and more and more like even if you go monolithic repo right so
which can make a lot of sense like if your code needs to version together and whatnot
you know there's benefits to monolithic repo uh you know different teams all using the same repo
right if this stuff needs to version version together but the monolithic build though i hate
like just build the little pit the little bits here and there as they as they change
and as they need to get built and then you like compose them all together for the final thing
for the final deliverable you know i can totally get behind that yeah as your code base grows bigger
like the smaller percentage your code changes
are likely to be right so as you know uh if you've got a 20 gig repo chances are you're not changing
very much of that with every commit or every build so building all of that every time is uh
it's pretty nutty yeah and if you have a 20 gig gig repo we should talk because that's cray cray
yeah well i mean we even talked about this in the past i think
merle was one of the ones who mentioned basil.io right like there's there's tools out there to help
with this kind of thing and yeah do it doing a massive build of everything in your thing every
single time seems wrong just just like in fairness making it to where you have to deploy everything at the same time is wrong, right?
Preach.
So I guess that's my thing.
I still believe that if it versions together, then it should live in the same place.
I still believe in that because if you break that apart, now you have to manage version compatibilities and and some sort of matrix of of how all that
works right so that's dependency dependency management is hard it is so so i'm i'm okay
with the mono repo but you don't just because it's in the same place doesn't mean you have to do
everything every single time it doesn't make sense yeah yeah and so that that's basically like I was just curious to throw it out there to see if like anybody else would have experiences where they would say, no, no, no.
Here's the reason why monolithic builds rule and you should embrace them.
And, you know, here's our experiences as to like, you know, how how the monoild solved our problems. So, hey, if you do have some stories to share with that,
you can throw some comments on this episode.
You'll be able to find it at codingblocks.net slash episode 183.
And you leave a comment, you get a chance to win the book
or a physical copy of the book,
unless you want the free online version.
Or, I mean, hey, maybe you want a Kindle version.
Hey, and also, we didn't get any reviews this last time.
I think, I don't know, is anybody out there?
Yeah, no, we got, well, not for this time, but we did get it last episode.
Yeah, we got one last episode, but no comments either, man.
Two, two, don't take away from them.
So, somebody say hi something
virtually something all right okay so dive in yeah let's get into this thing so
we're going to talk about slow today right slow service level objectives um actually we're going
to talk about all kinds of stuff that involve service level objectives. Actually, we're going to talk about all kinds of stuff that involves service
level objectives because people, they even talk about it in the book. I'm sure you guys saw this
where they're like, yeah, people just kind of use this as the de facto thing that they say.
They might say SLO, but they might've meant something else, right? So we're actually going
to talk about three different things. There's the SLI, which is a service level indicator, and SLA, which I'm surprised they put this one second, but that's the service level agreement, and the SLO, which is the service level objective.
I think we did that in our notes because in the book it's not in that order.
Is it not?
Oops.
I'm surprised I put it in that order.
What you're saying basically is generally when people say service level objectives, it's like a blanket term.
What they usually mean is like one of these other three SLI, SLA, or SLOs or maybe a collection of more than one.
Totally.
So SLI is the first one we have in the notes, not alphabetically.
In case you're wondering, I does come after A in the alphabet that I use anyway.
But service level indicators are a very well and carefully defined metric of some aspect of the service or system.
So an example might be response latency, maybe error rates, system throughput, and typically aggregated over some period of time and the idea
here is that uh this is information that you can use to determine um you know what i'm trying to
say it's kind of their health their indicators right so it's a it's some sort of metric or
number that gives you an indication of how your system is doing yeah Yeah. I mean, it's the quantitative thing that you can,
that you can put your finger on,
right?
Like you can actually measure it.
So,
uh,
I wanted to call out though,
that one thing that we've,
we've said in the past that,
uh,
there was a lot of similarity,
a lot of overlap between this book and the DevOps series that we've covered.
Um,
uh,
be it the, the, uh, the, the DevOps handbook or what was the other series that we've covered, be it the DevOps handbook
or what was the other one?
No, that was the big...
It's 12-factor app maybe or...
No, I thought there was another one.
It'll come to me.
Minimum CD.
Minimum viable CD.
No, no, no.
There was another one.
It was like the big machine or something.
I can't remember.
Phoenix Project? No uh at any rate the point is is that they're in this specific section as it relates to
like indicators objectives and agreements there's a lot of overlap in uh the designing data intensive
applications section so it's like specifically i think it was in like the maintainability portion of the book. There was like some overlap, but this is coming at it from Google's perspective.
So, you know, some of these terms, if you listened to that series, then, you know,
there might be like, you might think like, wait a minute, I've heard this term before.
Where did I hear it? I will say this particular chapter, I actually liked a minute, I've heard this term before. Where did I hear it? I will say this particular chapter I actually liked a lot, especially coming from Google, because they have so much data and they have so many services that they really had to focus in on what mattered.
And I think that was super important.
I mean, we'll get into more of that a little bit later, but like we'll wait for it because there's some stuff that I have comments on.
So so one of the things that they pointed out here with the SLIs is.
There are things that are really easy to measure in a system, right?
But sometimes it's not possible to measure exactly what you want with what you have in front of you,
right? Like if you own the server farm or something, that's easy to do. But if, if let's
say that you have a service and you're getting complaints from a customer, it might be that the
client side is experiencing issues that you're not aware of. So that might be a little bit more
difficult to measure. So sometimes you actually have to go outside of what your own purview is to, to try and measure things
externally as well. Um, they also, one of the, they said probably the biggest SLI that they have
is availability. Um, and what's interesting is we talked about this i think on the last episode is
google doesn't necessarily um measure uh uptime the same way that other people do so theirs is
basically yield a ratio of the number of requests that succeeded versus the total number right
that's how they do it so that they can measure things in different regions and all that. Yeah. I was curious. I went back to just find like where we did cover this and it was in
like episodes 121 and 122, I believe related to like scalability and maintainability. And we were
talking about how like using your SLIs and SLOs as like a to know how to deal with scaling your application and defining,
well, let's do this by the numbers, but what does it mean by the numbers?
And so in this particular chapter where Google's talking about the SLIs and having those metrics
to know what that, to even define what that SLI is, to know that you those metrics, right. To know what that, you know, to even define what that
SLI is to know that you're even doing it. Right. So the indicator might be like, uh, well, how long
is, is a, uh, page taking to load or, uh, a query, like how many queries are you able to return
per second or whatever? Like, you know, those, those might be your indicators.
Doesn't, you know, that number by itself doesn't mean anything bad or good. Right.
And that's why you need to be able to capture it first and then trend it over time so that you can
then know, you know, you can then make a decision as to like, well, Hey, we're, we're doing good or
we're doing bad or whatever, you know, like, um, and, and, you know, going back to our conversations from the DevOps handbook about like the importance of visibility and observability and tracking and having those metrics and tracking those things.
Like all of these things, all these concepts that we've been talking about for years now.
And it's like they're they're tying together.
Right.
So from multiple different perspectives.
Yeah.
And I think, you know, that what you just hit on, too, is really important. When you hear SLI, that's basically your measurements. Right. So from multiple different perspectives. Yeah. And I think, you know, that what you just hit on, too, is really important.
When you hear S.A.L.I., that's basically your measurements. Right. Like that's if you're going to simplify things in your mind.
This is the thing that actually goes and gets the numbers for you. What was my latency? What was my number of requests?
Exceeded the ratio, all that. And so they even called out like for storage purposes.
It's more about durability, right? We talked about the number of nines and actually it's funny
because I may have said something wrong on the last episode. I can't remember. I think I did.
I think that I said 99.99 would have been two nines and that's wrong. It's the number of nines.
Basically, if you take it away from a percent and just do the number of nines after the decimal, then that's how many nines it is.
Right. So ninety nine point nine nine percent is the same thing as zero point nine nine nine nine.
That's four nines. So it was actually a discussion on Slack around it.
I was like, I don't know what I said.
Yeah, whatever it was, it was probably wrong.
So clear that up a little bit.
So if we take all these indicators, right, like I said, like just random number, you know, like, hey, like how many new you like select count star of new users that have been created in like the last hour, right?
Like that number means nothing by itself, right? Like that number means nothing by itself, right? But you might have an, now you want to like take these numbers and put an objective on it to say like, well, I want that
number to, you know, I never want my error rate to go above a certain number, right? Or maybe I want,
you know, like think about it from like a sales or marketing point of view. Like I want new users,
you know, coming to my page, I want like a certain number per hour or whatever i think that might be their kind of objective so this is where
we take the indicators and now we start talking about service level objectives and how we can use
those indicators yep so i mean go ahead jay-z i was gonna say one thing that was interesting is
they mentioned sometimes you'll have like two ends.
It'll be a range.
So you'll have like a minimum and a maximum and you'll want your service level objective to be in between those two numbers.
I just thought it was kind of interesting because every example I could think of off the top of my head is generally one side or the other.
So it's like you either want to be more than this or less than this. I couldn't think of an example where you wanted to be right in the middle.
I think that goes along with what they were talking about, where you sort of have your internal SLO and then you have sort of your external SLO.
So that range, I think, is in between those two, like meaning, hey, internally, we want
to we want all of our requests to be served within 200 milliseconds, right?
But what we want for our users in another department, we want them to never experience anything over 300 milliseconds.
So as long as we're somewhere in that range, then we're good.
That could be the only thing that I could think of.
But they did.
It wasn't or
yeah they mentioned uh yeah i just pulled it up they mentioned having a lower bound and upper
bound but they didn't give an example the only game example they gave was for a search which
you know presumably you're fine with search being faster than whatever so they didn't give an
example of i just thought it was interesting yeah so what they say in here is yeah they even say it is the range of values that you
want to achieve with your sli right so the latency would be one um so they say a range um yeah i guess
you can say like we want our response time to be between 100 and 500 milliseconds so like well why
wouldn't you want less than 100 it's like, that means we're paying too much. They actually did call that out later, right? We'll get to that one too.
Now, here's one that was really interesting. As I said, choosing the SLOs can be difficult,
mainly because you might not have any input into what it actually is, right? Like it's the business that might be driving what your SLO is.
So for latency, we just gave, sorry, go ahead.
Well, I was just going to say like,
that would go back to the example that I gave of like how many new users you
want, right? Like, you know, that's something for, you know,
the business owners to decide it's out of your control. Yeah. And to decide. It's out of your control.
Yeah.
And some of these might be out of your control.
And they actually like did call it out in some of these things, too.
Like, you know, like queries, for example, I mentioned queries as an example a moment ago related to search.
Google has no control over like how many people actually start executing searches on their, on their service.
You know, that's going to be based on like popularity and, you know, whatever.
But what they can control is like how many they can return within a given timeframe.
And so that's why they would target like the queries per second, not necessarily like,
you know, what, trying, trying to to like increase they're not trying to increase the
queries necessarily because it's out of their control i don't know that probably didn't make
sense the way i said that did it no i mean you're right you can't control at any given time that's
why they just try and make sure that their systems can handle a certain amount right yeah okay said
better yeah you're right dog you're right it you're right i think we mentioned it's morning time um we all have the the groggy you know voice sound
right now so yeah uh well one thing they mentioned that i thought was interesting is um
or i just thought it was a good thing to call out was um that these slis aren't necessarily
independent so if you get more requests for example your latency might go up. If you have multiple
SLOs based on these, you might have multiple
alarms going off at once because
these things are related, correlated.
It's reminding me of that scene in the movie
where someone's flying a plane and every
dial is just going nuts and everything's going
wrong. It's this kind of funny
example. I've definitely seen that
in production problems. One
little thing can cause a cascading failure there and then yeah it feels like everything everything is blowing
up yeah i'm i'm thinking of scenes from airplane yeah was that what you were thinking of yeah okay
yeah literally but you realize like that's such a dated reference though too like i know but the
movie is really like predates all of us anyways.
But then on top of that, anyone new listening is like, what?
That movie?
Yeah.
What?
Rent it.
Wait, can you rent movies anymore?
Yeah.
Anyway.
I don't know.
I think planes now just have like iPad and you kind of rotate it around like a Wii controller.
Oh, no.
Another dated.
So you can play Angry Birds.
Oh, wait, that's also dated.
Dang it.
Dang it.
Yeah, we're old.
Yeah.
So one thing that they mentioned here, and I love this, I absolutely love this,
is the SLOs need to really have a realistic understanding of what the availability
and the reliability of a system are so that they can actually publish that information so that you don't get claims that, oh, the system's slow or,
oh, this isn't working because if you don't define these well enough, then that's what you're going
to get your feedback. And that feedback is nearly useless, right? Like when somebody says, oh,
the system's not working. It's like, well, what part of it? Can you log in? Can you go to a page?
Can you do this?
Right.
Like, so, so well-defined is helpful.
Just the publishing of it is also kind of crazy.
Like, have you ever, I've never worked in a, in a environment where like, let's say that I was responsible.
I worked on the team responsible for like the front end of the website and another team was responsible for the back end of the website. We never, neither team ever,
I've never worked in an environment where either team was like, okay, here's our expected uptime.
And, uh, you know, what's yours? Like, I don't know, a hundred percent. Like we can't, we need
to keep the site running. Yeah, exactly.
But that's also the key, too, though.
Some of this is kind of interesting because when you work for smaller companies,
then being down for that period of time, that can be super costly to you.
Percentage-wise, in of like how much it impacts the,
the operations of the, the company, right. Versus, you know, a much larger corporation.
Sure. The dollar amount might be, you know, more for any kind of downtime, unplanned downtime,
but, um, you know, they can, they can kind of like offset that a little bit better.
So it's easier for them to say like, Hey, you know, we're, they can kind of like offset that a little bit better. So it's easier for them to say like,
Hey,
you know,
we're going to have this planned downtime of this other percentage.
And,
you know,
we can accept that we can eat that.
Right.
And,
and so like,
that's where maybe,
you know,
having been at smaller companies where it's like,
no,
no,
no,
we,
we need to stay up and running.
We're always up.
Yeah.
So there's like a careful balance there of like well okay even in the beginning of this book google was like hey don't don't follow this
is a blueprint right this isn't going to be applicable to every company um but you can you
can see what we did and see you know and apply it how you know how it works best for you. So there was a chapter that we skipped on the podcast, Chapter 2, which talked about Google internal services.
And one of those services was named Chubby.
And I just wanted to mention that because we're looking through the notes.
I was like, wait, what?
But there was a cool little kind of breakout in the book, which I normally hate breakouts, I know, famously.
I've said this, but this one was good,
about Google basically having this service Chubby
that internal teams had grown to depend on,
and they built these services kind of assuming that Chubby would never go down
because Chubby was so good.
It had a great track record.
And then whenever there was an outage with Chubby,
they noticed that all these
other systems would go down uh and so um kind of it was kind of a cool example of what can happen
if you um uh if you do too good i guess that's what i'm trying to say uh that people just come
to respect it and so they started doing planned outages in order to kind of uh let those other teams you know get used to the the notion of this not always being
there and having to figure out how to deal with it yeah and to shake out those dependencies yeah
yep yeah but uh and that's a good example like somewhere where you want to range where you're
not really aiming for 100 and where in fact 100 is is even kind of bad. Well, I mean,
it's weird, right? Because like you said,
they did so good that everybody just
expected them to always be perfect.
And so they had no
published SLO.
So there was no way for them
to indicate to people like, hey, you really shouldn't
depend on this, right?
Yep. They even had
a big plan to, what do they call it? I they even had a like a big planned out like what
they call it like uh i forget what they call it but it was like chubby outage day or something
like global global chubby planned outage yeah so really really interesting so now this leads us
into the soas and this is you hear this term lot, I think in business, especially if you're using
like cloud services or anything, you'll see what the SLAs are. These are the agreements of what's
to happen if the SLOs aren't met. And this is why they said that people interchange these terms all
the time, right? So really an SLA is the consequences that happen if the, if something doesn't meet the SLOs.
If you're not, if there is no consequence, then you're likely talking about the SLO,
right? And that's, that's really the big difference. And if you've, if you've looked at
your cloud services, whether you're using Azure or AWS or Google or whatever, typically you'll see that
there are so many nines with something like, you know, your storage or whatever. If that's not met,
there's usually a monetary consequence, meaning they're going to credit you back a portion of
your bill or, or whatever the case may be, right? Like that's typically what you're going to get.
There was a section and I have been hunting for it so far while we've been recording trying to
find it like where they said we're like one of these was like technically a measure of the other
do you remember that where it was like slas were a measure of the slos or slos were
i didn't see that when i was reading this but i don't know that SLAs would be a measure of anything because SLAs are really just your almost like your legal obligation to whatever you've promised the customer, right?
Yeah. I mean, they called out like, you know, what's the difference between an SLO and an SLA? And, you know, you just ask the question, okay, like what happens if the SLO isn't met?
If there is no explicit consequence, then you're talking about an SLO.
But if there is a consequence, then you're likely talking about an SLA.
Yeah.
And what they say here is the SLAs are, they're, they're decided by the business.
But as a site, site reliability engineer,
your job is to try and make sure
that you don't trigger the SLO
that will trigger that SLA.
And sometimes there's interesting
kind of time constraints built into SLAs too
that are a little bit different.
So like, for example, you might have to,
if there is an outage,
you might be contractually obligated
to respond within 15 minutes
or some sort of level of support.
Or,
you know,
if someone opens a ticket,
it's sort of certain priority,
then you might have a service level agreement.
And that's,
that's getting more into the kind of business side of things a little bit
different than we're talking about,
but those are frequently kind of associated,
at least in my mind.
Yeah.
I mean,
and again,
those are going to be like,
but you ever noticed how like some companies
like they'll have that sla where it's like well we have to respond within like x amount of time
doesn't mean they have to do anything with it they don't have to like solve the problem
they just have to respond and their response could just be like an automated you know system
email like yes we we acknowledge that there is a problem. Okay. You are down. Yeah.
SLA, meh.
Right.
So the SREs, they also are sort of responsible or they're, I guess,
tasked with helping come up with the objective ways to measure this stuff,
which basically means finding the right SLIs in order to make these SLOs
something that they can work on. Right.
And go ahead.
I was going to go on the next point.
It was kind of cool to see here to mention that Google doesn't have an SLA
for most of the services,
like consumers use consumers use directly.
Like there's no SLA on search,
for example,
they don't,
they're not going to pay you if their search is slow,
but for their business consumers, like companies that buy like, you know, the documents, whatever
their business suite is called, they do have SLA. So if like, uh, you know, maybe internal
searches down or Gmail for business users is, uh, is down, then that's where those SLAs come
into play, but they still have SLOs for those other things we mentioned, like for general
consumers, because obviously they have an, uh, a stake in, uh, and instead of to make things fast and good for you, but they're not
going to pay you for it. You didn't have to sign some agreement, you know? Well, I didn't, I didn't
put all this stuff in the notes, but what was interesting about the, the search, not having an
SLA is the reason they still have the internal SLOs is because for one, they want their search to be
fast because that's what the customer trust is, right? Like that's, that's one of the big things.
But the other is if their search is slow, that means their Google ads are getting served slower.
And so they actually have a monetary hit internally because, you know, search is slower.
There's, there's a lot of things that happen there.
Right.
I found the statement I was looking for, and I'm just going to tease it right now
because we're going to come back to it later, and we'll talk about it then.
But I did find, I'm not crazy, the statement does exist.
All right.
That's it? Yeah, that was it. i was just teasing you with it yeah it was it was
a straight up teaser not even any contest just teaser um all right so this this this section
right here is the reason why i liked this particular chapter because it's very relevant
to the kinds of things that we've been working on lately. So what should you care about?
And this very first statement is so important.
You should not care about every metric you can find as an SLI,
right?
So an example I can give is,
so we,
we all use Kafka and both love and hate it at certain levels.
But you have hate for Kafka.
Can we wait?
Pause.
I mean,
I have hate.
I hate,
or I have hatred for anything that makes my life more complicated to a
certain degree.
Right.
And Kafka enables so many things,
but it is also like,
it's a,
it's a decently complex enough system.
I haven't gotten to that with it, but okay.
Just for instance, we work in the Kubernetes world.
If you need to resize the volume, you need to make it bigger for some reason, and trying to make it smaller is just not easy.
So there's a lot of things, but it's with any system. So in general,
I'd say I really do like Kafka and it does enable a lot of cool stuff. But where I was going with
this is there are some amazing dashboards out there for like Grafana and Prometheus
for tracking every single metric inside Kafka.
But how many of those are actually relevant
to you meeting a service level objective?
Yeah.
There might only be two out of like 50
that they can provide you, right?
That's cute.
You thought there were only 50.
Oh, yeah.
I don't even know.
Like there's probably way more, right?
There's so many.
Way too many.
Yeah, there's a lot. You can see how it's kind of tempting to say, okay, well, here are't even know. Like, there's probably way more, right? There's so many. Way too many. Yeah, there's a lot.
You can see how it's kind of tempting to say, okay, well, here are all my 50.
Like, what should each of these be?
And so you start setting up, you know, alerts or surface-level objectives around, like, CPU and memory and stuff like that.
And those things don't make sense because it's okay if they go high.
It's okay if they go low.
You know, it's really the surface-level objectives need to be around, like, the business cases. Well, here's the thing, like we we've all seen this
situation. So like, let's say you decide to start using Prometheus and Grafana for the first time.
And you're like, Oh, where do I even get started? Like Alan was picking on Kafka for a minute. So,
so you're like, Oh, well, where do I do this? And you go and you find there's already like
some dashboards that, um, like maybe, you know, if you're using Strimsy,
like they've already put out, uh, some for their operator or, you know, you can find other people's
dashboards that they put out and you're like, okay, I'll start here. And so you basically start
with like everything under the sun, like, okay, it's all in my face now. Right. And you know,
now you have a problem and you're like, well, I can't see the needle through the haystack.
There's too many things going on. I don't know which one of these things really matter.
Right.
Or you take the flip side where you're like, hey, you know what?
I was listening to CodeBox.
They were talking about this DevOps handbook thing and they were talking about observability and like getting metrics.
So I'm going to do one of those for my custom web app.
So you're like, OK, well, this might be a good metric to know.
And so you spit that one out and you build a little dashboard panel for that. And then you're like, Oh, Hey, here's another indicator here or another metric that I can
easily put out. Right? So now you build a panel for that thing. And, and before you know it,
you, again, you've like recreated the, the, the other situation where you have too much
data happening in your face. And the problem is, is that, that the thing that I liked about this
portion was they were calling out, like, just because you can put the metric out there
doesn't mean that it's a metric that helps you in any shape or form. Like imagine if in your,
uh, you know, okay, you mentioned Kubernetes. So let's pick, let's mention in your Kubernetes
cluster. Um, if you, I don't even know if you could do this, but let's say that you had a metric
that was being spit back out to Grafana
on your dashboard there
that showed you the temperature of the CPU for that pod.
Why do you care?
Why do you care?
Like, I mean, temperatures are like,
you know, that's an easy thing.
Like there's a lot of, you know,
solutions out there for getting the temperatures
of your, you know, the different components on your system in this situation is that a metric you care
about is that like it might be easy to do but who cares right because in your kubernetes cluster in
theory if that node dies because that cpu got too hot you're gonna get moved to another node right
like it it does not matter to you at all shouldn't and some of these good indicator it right you know
after the fact,
like,
Hey,
why did my pod die?
And go in like,
Whoa,
this temperature was way too high.
Like good info.
No objective needed though.
Right.
None.
Yeah.
And I was gonna say that like some of these things might even be like,
uh,
the indicators by themselves you don't care about,
but maybe combined with other indicators,
then you do care about it to know that like,
Oh,
well the temperature rose and they're like, you know, this person, this small amount of time it to know that like, oh, well, the temperature rose
in like, you know, this person, this small amount of time, that's a problem, you know, but in general,
I don't care. Never show me that. So, so that's one end of the spectrum, right? Like you have so
much there that it's like you said, it's a needle in a haystack, right? Like that's a problem.
There's the other end of it too, though, that could be a problem is if you just have one or two metrics you may be missing entire gaps in your observability of the system because you don't
really have enough to give you the picture of what's happening so it's it's a balancing act
i i would say though um my uh take on it now, right?
Because we've gone through this for a minute or now, right?
And I think this might be even consistent
with some of the things that came out of the DevOps handbook.
But to me, less is more.
So start with, you know,
I know I need to track the 500 level errors coming out of my web app.
So I want that indicator being presented, and then I can trend that over time, blah, blah, blah, blah, blah.
And then maybe I can know that, oh, hey, there was a high level of them because we took the system down for an upgrade or whatever.
So you start with one metric
five hundreds, right. But then over time you're like, Oh, you know what? I also needed to know,
uh, you know, I had, I had some crashing on my database because I ran out of space for the, uh,
the right ahead log. So, uh, I need to monitor what's the size of that log,
or maybe I want to know the size,
you know,
free space available on that disc.
So either way,
like now you've learned something.
And so you're like,
Oh,
let me add a new metric for that.
So the point is,
is like,
as you learn that you need some metric,
add the metric,
but don't start up front with like every metric known to known to man,
and then try to like whittle down.
These are the five I care about.
Right.
So along those lines, though, I think the important part is to get to what you just described.
You need to know the service level objective.
Right.
Once you have the service level objective in mind, then you can at least intelligently say, Hey,
you know, these are some important metrics that I need to track in order to even be able to see
if we're meeting these SLOs. Well, yeah. So said another way, like the way I was just describing it
is like, you'd have to have the problem first to know that you needed to have the indicator.
You need that indicator, which will happen, right? Yeah. And I guess I'm just saying like,
I embrace that. I'm fine with that. And, and's doing here, though, with, you know, like thinking about the service level objectives up front is they're trying to get in front of that and say like, well, what are the things I care about to know whether or not it's working correctly or uh that it you know
incorrectness might not also be the thing either like you know if you could be correct 100 of the
time but if you're really slow at being correct then nobody's going to care to use it right um so
you know those kind of factors they're trying to get in front of those things by thinking through this. Which is cool because they sort of have some templated layouts for what their
SLOs are.
Um,
Jay-Z,
you want to grab a couple of them?
Yeah.
Uh,
so the examples they gave are availability,
latency,
and throughput availability.
You know,
we've talked about a bunch of times,
but the service is up or not a latency here.
Um,
their example is specifically talking about,
uh,
like web requests requests like how long
it takes to you know basically how slow something is but in our world we talked about kafka latency
also has a different meaning because it can mean how long it takes for something to get processed
in your pipeline and what i like about this is that if you have requests that come in every uh
say 30 seconds but it takes you more than 30 seconds to process them, then
you've got an outage waiting to happen in just a matter of time because you're not fast
enough to keep up with the data coming in.
So you're going to have a problem.
So latency there, it just means something different.
But that's kind of a great example of something where you might only have a serviceable
objective on availability because that's what the customer sees.
But if you don't have one on latency, it can take you out and get you in a really bad spot where you're hours behind or whatever.
And also never able to catch up.
And throughput was last one, which is how many requests were able to be serviced.
And this is a good example, too, where zero is not good.
If you've got zero throughput on the system, you might want to have an alert on that.
Just like you might want to have one if it's too much.
Hey, so real quick, on those three that he just mentioned,
the availability, latency, and throughput,
that was an example of their template for user-facing services, right?
Like those are anytime they stand up a service
that an external customer is going to use,
those are the three
SLOs that they target there.
Yeah, and then they have...
Go ahead.
Nope.
No.
Storage systems.
My
SLO
was too slow, and so you beat me to it.
All right. What's that? You, you, uh, my, uh, SLO was too slow. And so you beat me to it.
All right.
What's that?
We did a bit.
All right.
So storage sections, uh, storage systems was the next one.
They had examples of, so latency, how long did it take to read?
Right.
Uh, obviously that makes sense for something like a S3 or Bob storage availability.
Uh, were you able to retrieve it at all and then durability is still
there when it's needed and uh yeah that's where all those nines come in uh are generally around
uh you know i need to look that up but um the all those nines for s3 are they for availability
or durability or both i think durability i remember we talked about wasabi and they had the 11 nines,
but it was for durability.
Yeah.
So not,
not necessarily availability,
meaning that like they could take the system down for maintenance and they
wouldn't lose your data.
So it's not available,
but the durability isn't,
you know,
it's still,
it's still there on disc as soon as they didn't like boot it back up.
Yeah. So, uh, they're, uh, they're a S3 website for Amazon literally says designed for durability. Yeah. And then for big data systems though, the, the template is throughput.
So how much data is being processed and then the end to end latency. So how long, uh, from ingestion
to completion of processing. so this goes back to
your pipeline example you know like uh it's it's taking into consideration more of the overall
process not just one piece of it so so you can almost think of like end to end latency as like a
higher level objective because like there might be components within your objective like if you did
like we were picking on kafka for example. So let's say you were doing a Kafka
pipeline and maybe you have, you know, something like a Debezium that's reading from one source
and putting it into a Kafka topic. You have Flink that might be reading it out of that Kafka topic
and maybe writing it out into like, you know, another topic or database or, you know, like elastic search or whatever. And so like, you know, you might have an SLO on like how, how fast, uh, Kafka can read
and write into a given topic. And you might have an SLO as to how, like how, uh, the availability
of elastic search and, uh, you know, new documents being updated and like when they're searchable
again, but none of that matters when, when you're talking about, well, let me say,
let me not say it that way.
Let me say that that doesn't paint the full picture of what was the end to end
availability. I get a new, uh, document in my source now in,
and you know, there's all those touch points that I mentioned.
So it has to go through to BZM, Kafka, Flink, Elasticsearch,
four different technologies, and that's the overall end to end so yeah when they talk about the big data
pipeline or big data system there and template includes the end to end latency
yeah of course everything should care about correctness and a little section here on
collecting indicators too so like most of the things we've talked about have been server side
and so you know have something like a promet or Influx is going to kind of scan
those and store those and get those from logs sometimes. But also there are client side metrics.
So there's things you can do sometimes with like a mobile app or whatever, where you can kind of
collect metrics or just kind of on a website. And what's important there is you might catch
something where there's some sort of bug or something else is going on that's actually in the app that makes the customer experience
bigger response times or latency than you're seeing in their server-side metrics.
I do want to add, though, that related to the correctness,
they say that that's typically like a property of the data in the system rather than necessarily the infrastructure.
Right.
So they don't measure that.
That's not something that the SRE is responsible for.
Right.
Like a database.
Like it should work.
Right.
Yeah.
It did make me wonder there, like, because like, have you ever like written a query that returned back you know incorrect results but it did it fast right like i mean you could like select the
wrong columns or like have some error in your predicate right and so that's an example of like
well the ascent the correctness is assumed to be there so if you if your predicate was wrong
or you're selecting the wrong columns i mean that's just a bug in the system.
Yeah, in your application code.
And remember, we're separating out the product development teams from the SREs in this Google world, right?
And so that's why that would be an issue that the product team would be responsible for, not the SRE.
Hey, so do you guys remember?
I don't even know if this is how web pages work anymore.
You don't know?
If you don't know, we're up the creek, man.
So, I mean, the thing is I haven't done any UI work for the web in a while,
but you guys remember when back in the day, and I say back in the day,
as in a long time ago, you were supposed to put all your JavaScript in the head, in the head section of your website, right?
And this goes to the client side latency.
At some point, they told you to stop doing that because that would block the rest of the page, right?
So it would wait for all of the JavaScript stuff to be loaded up in the head
section before the rest of the page could be rendered. Now, the reason I say, I don't know
if it's this way anymore is I don't know if Chrome and Firefox and all those have gotten smarter and
they do things a little bit differently now, but at some point they said, take that stuff out of
the head. Like for instance, the Google AdSense or the ad tracker stuff, right? Like if you wanted
to track your stuff in Google analytics, they would tell you to take
that script block and put it in the body at the bottom of the page so that your page could
render first before it fired off and loaded that JavaScript to let Google know that, Hey,
there's been a visit to the page.
And this is like why they say tracking the client side latency actually
matters because in the old days, and like I said, I don't know, maybe it happens this way today too,
but with those scripts being up in the head, it might be three seconds before your page would
render because it was loading up all this stuff in the head, even though it had the content for
the rest of the body, right? Whereas when you move that stuff out of the head and you put it down at
the bottom of the body, your page could when you move that stuff out of the head and you put it down at the bottom of
the body, your page could start rendering immediately.
So within, you know, I don't know, 300 milliseconds of the request being made to the server, you
can start painting the page and then that hit would be unnoticeable at the bottom.
And so that's what they're talking about.
Like, that's why sometimes it's important to go down to the client to find out what's
going on, because there may be things happening that you're not even aware of
that will require some investigation.
Yeah.
You ever like look,
gone to a website and you'll see the content come in,
but then it like starts moving around and taking shape as it's,
you know,
as it's loading in.
And that's why,
because,
you know,
maybe the CSS or the JavaScript to fill in and define like what those were supposed to look like wasn't loaded until later.
But you might have already had some Hello World kind of messaging or whatever
popping in. It probably wasn't Hello World, though.
It was probably something... an image?
From render image server side, just send one image and then you know it's going to
look the same. Ordf even better there you go yeah so get rid of html altogether i like it yeah that's that's the
web 3.0 right there there you go uh so next section was on aggregation uh which was really
nice so um typically you're going to aggregate raw numbers and metrics and so example would be
like web requests you know it doesn't really make sense to say like 13, like you, it's a rate.
So you would say 100,000 over the last 15 minutes.
But aggregations are dangerous because they can sometimes like hide the true behavior.
I was thinking about like sensor data here.
Like if you're looking at like a large window, you might look and say, okay, well, the temperature over the last five minutes has stayed the same.
Great.
But what it might be hiding is that it may be spiking really bad so like you know i said i think five minutes
there like maybe minute one was way high minute two was way low minute three and so you know it
averaged out the same and maybe you might even have the same median i don't know but um yeah it's
just uh the resolution of those metrics can can hide what's going on and things you might care about.
Same with latency.
Well, hold up.
So you use temperature.
Temperature is kind of hard to equate into a machine-level type thing.
But if you had something similar with requests, like what you were just saying, super high and then super low and then super high and super low. The problem is the average might look the same, like you said,
but your system's getting taxed way more when there's those bursts that come in. And so that's
the thing that if it's hidden you and you don't have metrics to actually look at that thing
properly, then, then you're just saying, hey, the average looks like this. So I
don't know why the system's having problems. But in reality, what's happening, it might be two to
three times worse, but it's hidden because of how you're showing your metrics. Yeah. And they use
this example of like, if you had 200 requests per second on the even numbered seconds and zero otherwise compared to a system that had a constant
hundred hundred requests per second for every second right they would both average the same
but their burstiness is definitely different and you you know so they're basically calling
out like the difference or the importance of not using averages and instead going after percentiles.
And this definitely goes back to the conversations that we had in the
designing data intensive applications related to scalability,
where they refer to it as like P 95,
P 99.
What was it?
P 98.
Or I forget what the different ones.
Oh,
it's P 95 deviations, right?
No, it was P95, P99, and P39.
So P999.
But the P marking the percentile and then the numbers being like the 95th percentile or whatever.
So basically, how do you know how uh, you know, how well your system is doing. And so like, if you're going
after the 95th percentile, then you're saying like, okay, 95% of the time it's in this, this
acceptable range, but 5% of the time it's, it's bad. And like when it's bad, like, you know,
it's really bad. Right. Um, and so, you know, using these percentile ranks for your, uh, whatever your metric might be in this case, like if we're talking about latencies, right?
If you're going after if you're targeting like a ninety nine point nine nine percent latency, that would that would be kind of a person.
If that's your percentile, that would be an extreme one. Right.
But let's just say ninety ninth percentile for the latency.
Then, you know, one percent of the time, the latency might be unacceptable.
But 99% of the time, it's fine, right?
Well, here's something to be careful with these.
And so I actually threw in some notes regarding Prometheus and how this stuff works here because I've actually been messing with this a little bit.
So if you do that P99, right, like they call them quantiles in Prometheus.
But if you do 99%, then that means that 99% of your requests all happened within the given amount of time.
Right. So so if we're talking about latencies. Right.
So, for instance, let's say that your 99th percentile is five seconds. That means that 99% of all your requests happened or were
serviced in five seconds or less. If you go up to 99.99, you just added two additional nines
of tracking and it might jump from five seconds to five minutes, right? So that's, that's the thing
that you kind of have to wrap your mind around is typically when you set up these things,
one of the things that Google mentioned is right, like they want 95% of their requests for a
particular service to be in 300 milliseconds or less, right? So they would set up a quantile of 95%. And then hopefully
that number they see comes in under 300 milliseconds, right? Because then that means that
they are meeting their service level objective. You might put in the double nines, the quantile
of 99, just to see what's happening for your long tail users, right? It might be
that they are having an absolutely horrible experience, right? Like it might've jumped
from 300 milliseconds to 10 seconds. And they may want to address that. I mean, they may not,
but at least it paints the picture. But then also you want to drop down to your 50% Quantile to see
what the average requests are doing, right? So it might be that your
quantile of 50%, you're serving most requests in, or, or a lot of requests in under a hundred
milliseconds, right? So I guess the important part is if you don't go all the way up to a hundred
percent in your quantile, you won't see what the absolute worst request was, right?
You're only getting what the population is hitting in those things.
Right.
And that's,
that's what you kind of have to wrap your mind around where it's different
than averages.
Well,
yeah,
definitely different than average,
but I think that's also like a difference of the tool though too.
Right.
Cause in this particular case,
you're talking about Prometheus, but like like you could you could target 95th percentile but still
show over time like oh we definitely went over 99 percent or over 95 percentile like in a graph
form right you could show that you went over that metric i think the important thing, though, is that, I mean, this goes back to like the start of this book where, you know, in that 99 percentile that you just gave where you said that it went up to five minutes.
That's definitely bad.
Super bad.
Nobody's going to argue that, especially if your target is like, I think you said 300 milliseconds in your example, right?
Yeah.
So, but maybe that's stuff out of your control, right?
Right.
That user could be on a cell phone in a really bad area.
You know, they could be, you know, trying to browse your site from the Amazon rainforest
and, you know, cell reception isn't so hot there, right?
Or, so I guess my point is that where I was thinking as you were describing this is like, that's all fine and dandy.
And I agree with it.
Don't dare set an alert just because you crossed, you'd want to say like, okay, it's been, we crossed it for a period of time,
you know,
rather than just like one occurrence.
Agreed.
And that's why the percentiles actually work out to your favor,
right?
Cause when you have that,
you should never trigger on one,
right?
Assuming you have more than two events that happen in the system.
It should only happen if it starts trending that way over a given
amount of time. Yeah. All right. So where do we have it? So studies have shown that users prefer
a system with low variance, but slower over a system with high variance, but mostly faster. So
this kind of goes back to what I was saying before about like, you know,
if you're a hundred percent correct, but you're really slow, you know,
people would prefer there some response time in that than necessarily the
absolute correctness.
Actually it's the inverse of what you just said.
People would prefer consistency. Whether if it's now, if it's super slow, obviously nobody's going to like it, but people would prefer knowing that every time that they use a service when they do it again, they'd rather be relatively
decent all the time as opposed to screaming fast and then sometimes really slow.
If there's a road and you can go 35 miles an hour, people are fine with it. But if there's
a road that's 45 miles an hour, but every once in a while there's a bus that stops and people
have to wait. And and overall it averages higher but
people really feel that they noticed that time that stops and so they're going to complain about
it more they're going to rate that lower and they're going to say it feels slower even though
on average it ends up being higher yeah i'm replaying what i said in my mind and i'm like
thinking that i need to go back to bed when you said it i was like yeah this is totally not
i'm like i said that out loud
i like the bus example that's really good
yes uh just said and did you say at google uh they prefer distributions over averages just
like we said because they kind of let you get at those long tails there's better representation
of data like if you tell a data scientist uh you know hey the average of my numbers is 50 you know you're not telling
him anything if you tell him the median it's a little bit better if you can tell him percentiles
then suddenly you have a really great way of just describing a set of numbers but you know there's
more overhead whatever but um i mean like much better you know the one that um my argument to
like averages has always been like if you want a really easy to comprehend – like anybody, they don't have to be in computer science, right?
Like an easy to comprehend version of why averages fail you at times is if you were to talk about wealth or just income, right?
And you talk about averages, you have some extreme outliers right like a bezos or a musk or
gates or whatever that totally will throw off an average right and so it's like i don't know
that's so helpful right yeah totally yeah it's like the worst way to describe a set of numbers
but i mean it's still helpful and certainly it's got its uses but uh yeah it's all it's like the worst way to describe a set of numbers. But, I mean, it's still helpful, and certainly it's got its uses.
But, yeah, it's like, in my mind, it's like average if you got it, median preferred in almost all cases.
And then percentiles is preferred even more than that by far.
Yep.
So the one thing that they say here is if you don't really understand your distribution of data, it could be a problem because your system might be your system or you might be taking actions that aren't right for the situation.
Right. Like if if every time like like Outlaw mentioned, don't don't dare alert me if just one thing hopped over this number.
Right. Like, don't do that. Well, if you don't have the right distributions in place, you might be restarting systems because you think, oh,
the CPU is too high now, right? And it just spiked for one thing. Like, you might be doing things
that are more harmful to the situation than helpful. And for those that are into statistics,
they're saying, like, don't assume that the data is normally distributed. So, uh, the bell curve. Yeah. Don't, don't assume it's
a normal bell curve. It might be a skewed curve in like the income example that I gave is a,
is an example of a skewed distribution where like, you know, the spike is going to be on the left
hand side, you know, and you're going to have this like very long tail out towards the right-hand side. Right. And so, uh, you need to understand what your data is because if you go
after it, assuming that it is a normally distributed data set, then that, um, where the
tip of that bell curve is, uh, is, is might not be where the median, uh uh is in the case of a not normally distributed data set
so it'll throw off all your all your metrics
uh it's been being thrown off like as a human i'm often thrown off when um
sometimes you ever see like a page of graphs or charts or whatever but the units aren't the same
so maybe the top one on the right is like days of the week and the one on the left is like minutes the times don't line up uh so
it could that could be a real problem with as a human especially time zones i've seen that where
um yeah anyway just like two different charts will have different kind of time ranges um but uh
just go ahead well i was gonna say can i go off on a tangent for a minute because
when we talk about like humans and readability and charts and things like that one thing that
like super irritates me and i actually appreciated it that it was called out in this book
specifically in this chapter because so often and we've talked about uh grafana here so let's let's
pick on grafana for a minute um because grafana
does this and it it irks me at times where you draw a graph and you know maybe the the left
bottom corner is zero zero but also maybe not maybe we've scaled the graph right and and we've
like super zoomed it in and And so like, or in the
case of this chapter, the Y axis is logarithmic. So, you know, meaning that the bottom half of the
graph might only represent like say 50 points of data, but you know, the top half of the graph
might represent a billion, you know, so it's like totally scaled weird, you know, the top half of the graph might represent a billion, you know, so it's like
totally scaled weird, uh, you know, changing the scale on, of that Y axis as it moves along the
graph. And Grafana will do this thing where like, depending on what the data is, it will just zoom
in altogether. Right. And so your, your bottom left corner, instead of being 0-0, might be like 98, right?
And so you'll see these large jumps in your graph, in the chart, right?
And you'll think like, oh, the world's on fire.
Like, look what just happened.
Like, look how steep that climb is.
And you're like, oh, wait a minute.
The axis is like, it's super zoomed in.
Yeah. It's showing two
different points right like 98 99 now instead of 0 through 99 or or maybe even like 98 and 98.005
and yet it looks like you know the world just caught on fire because of the way it zoomed in
and so like that's why like i get it i look at
these graphs and charts that are so often and i'm like wait what and then i have to like go back
into me like uh hold on like one of the accesses here have you ever seen it where like um sometimes
uh you know the charts have to find like inputs and stuff so maybe you'll have a thing in the
top right where you can just like shrink the time range to say like show me last three hours
right and all the charts on the page that take that input will adjust but maybe one of them Maybe you'll have a thing in the top right where you can shrink the time range to say, like, show me last three hours.
Right?
And all the charts on the page that take that input will adjust.
But maybe one of them still shows, like, the total count per day or something.
And so, like, it wasn't set up as input, and there's not a good way to see that. It's not, you know, it's not respecting that field.
The woes of Grafana.
Yeah.
Yeah, it's really, it's user error, right?
It's setting up the charts properly but
but still it'll freak you out like what you said yeah it's definitely not cool why is this bad why
does this look like this it's definitely not a problem with the tool it's a problem with the
tool using it yes totally yeah totally exactly uh so uh also another good point is like you can imagine if Google had just one dashboard so that like CEO or whoever could just log in and be like, how are we doing on our SLOs today?
And you can see the different teams like use different measures.
So like maybe, you know, the search team is like, hey, we've got requests per second.
But then, you know, the office stuff is like well these are this is our email
delivery rates per minute and the next one is like this is our uptime per hour it just makes
things difficult to kind of read as a human so the more you can kind of keep those units the
same and just kind of standardized the better yep oh that was that was actually leading into
this last little bit here was the standardization.
They almost don't even describe it when they're setting up new services because they do have a standard set of things that they measure on every
single service.
Right.
So those are almost assumed in,
in the primary reason for that is so you don't have to convince or describe
that same thing to everybody.
Every time you set up a new service,
these are the SLOs that the service has to meet done,
right?
Everybody is on the same page already.
So,
um,
you know,
that that's helpful for both the business and the SREs.
And you imagine having a dashboard and it's like,
well,
this one's,
these are,
this is latency,
but this is like latency plus plus and like latency plus plus also measures
like,
well,
I can't compare these two
now right right this episode is sponsored by shortcut have you ever really been happy with
your project management tool most are either too simple for growing an engineering team to manage
everything or too complex for anyone to want to use them without constant prodding shortcut is
different though because it's better. Shortcut is
project management built specifically for software teams and they're fast, intuitive, flexible, and
powerful. Let's look at some of their highlights. Team-based workflows. Individual teams can use
Shortcut's default workflows or customize them to match the way they work. Organization-wide goals
and roadmaps. The work in these workflows is automatically tied
into larger company goals. It takes one click to move from a roadmap to a team's work to individual
updates and vice versa. Tight VCS integrations. Whether you use GitHub, GitLab, or Bitbucket,
shortcut ties directly to them so you can update progress from the command line.
And a keyboard-friendly interface. The rest
of Shortcut is just as keyboard-friendly with their power bar, allowing you to do virtually
anything without touching your mouse. Iterations planning. Set weekly priorities and then let
Shortcut run the schedule for you with accompanying burndown charts and other reporting. So give it a
try at shortcut.com slash coding blocks. Again, that's shortcut.com slash coding blocks.
Shortcut.
Because you shouldn't have to project manage your project management.
All right, who's doing the beg here?
Want me to do it?
Well, I think Alan did such a great job last time.
Well, didn't you do a funny voice last time?
What was it?
Oh, you did like a Johnny Cash voice last time or something.
I think I did. I don't remember. We also didn't get any reviews last time yeah we did so maybe
we don't let alan do it yeah no i think maybe it should be jay-z this time yeah i think i think he
needs to beg all right uh all right hey y'all uh i would like to ask you for reviews because
we are doing really bad we're doing really bad reviews We're doing really bad on reviews lately.
We haven't been getting any.
We haven't been getting many.
And yeah, I mean, even if you got a bad one,
just let us have it.
We're so desperate.
Whoa, whoa.
Who let this guy talk?
Sorry, this is why I don't use it. So I'll take over from here.
If you haven't left us a review,
we would greatly appreciate it
if you would leave us a review,
especially a positive review.
But if you do want to leave a negative review, hit up Joe on the Slack. He apparently
likes that. You can find some helpful links at www.codingbox.net slash review. And also,
I don't know how long we're going to keep saying this, reminding people of this, but I guess we'll
continue. But apparently, this is a thing in Spotify too so uh you can uh i guess it's just like a thumbs up or something or no it's like a star or a plus or
something i forget you see how often i use spotify i'm like the one out of 10 people that don't use
spotify yeah that's crazy maybe that should be a survey like do you use spotify and everybody's
gonna be like yeah duh yes and right it's just like sock sock shoe shoe and i to be like, yeah, duh. Yes. It's just like sock, sock, shoe, shoe. And I'll be like, get off my lawn.
Yeah.
Well, I mean, sock, sock, shoe, shoe.
Come on.
Right.
Everybody does that.
That's crazy.
Sock, shoe, sock, shoe.
We already established this.
We had a poll.
That's right.
Yeah.
And it was sock, shoe, sock, shoe, right?
Okay. I still think about that. Every time I it was a sock shoe, sock shoe, right? Okay.
I still think about that every time I put my shoes on.
I'm like, I'm still baffled that that was like such a thing.
I never would have guessed it.
It never would have dawned on me that it was like any kind of controversial statement.
All right.
The little things that we take for granted in life are sometimes funny, right?
All right.
Well, we move on to my favorite
portion of the show. Survey says. All right. So a few episodes back, we asked how awesome was game
January and your choices were. I learned so much or I forgot how much I need to play other people's
games or how much time I need to play other people's games or how much time I need to play other people's games.
Or I thought my game was good, but oh my, some of these are pro-fesh.
Or I now know that I want to be a game developer.
Followed by I now know that I do not want to be a game developer.
All right, this is episode 183.
Alan, you're up first according to to tucko's trademark rules
of engagement yeah i'm going to say i thought my game was good but oh my some of these are so
profesh because that would that would have been my takeaway uh i'm gonna go with 30 percent all right uh well i'm gonna go with 31 percent oh man because you felt the same way huh
well i didn't say which answer though oh that was not assumed yeah i was just trying to be
funny it totally didn't work uh yeah so well i'll stick with it though let's go with it
with the profesh yeah yeah you're both wrong really what was it i learned so well i'll stick with it though let's go with it with the profesh yeah yeah you're both
wrong really well what was it i learned so much i now know i do not want to be a game developer
75 of the vote oh wow oh my gosh all right yeah uh awesome yeah i thought it was pretty funny so
you know um you know we've been talking about measurements and everything, though, and it made me think that – because here in America, we use the – it's either referred to as standard or imperial system.
Do you remember – well, we wouldn't remember it.
It was technically before our time, but in the history books, you might remember having come across – there was a where america was trying to switch to metric back in like the 60s i think it was or
something like that like it was either the 60s or 70s like they literally did put in like you know
um various legislatures or whatever were like going to make a concerted effort to like we're
switching america to the metric system to be like the rest of the world right but they didn't and if
it failed miserably and then made me think like you know americans can't switch from pounds to
kilograms overnight it just can't it caused mass confusion ah
excellent long lead up but poor execution, but whatever. It's morning.
It's morning.
Yeah, I'll take it.
So how about this?
We're talking about all these metrics and how to identify these things in SLIs and SLOs.
So for this episode survey, SLIs and SLOs sound awesome, but does your team use them?
And your choices are, of course,
how else would we track our error budget? Or, I mean, they sound great, but yeah, we don't have
those. Or, oh, wow, we have so many slow parts. Oh, it's an acronym. Never mind. Or, we're on
our way and it's looking promising. That's for the optimistic people out there.
Or I'm convinced and we'll implement them in the near future.
And that one's for the procrastinators.
Oh, man.
This is a...
I can almost guarantee you I know what it won't be.
I'm so curious, but I don't want to like, I mean, I really want to know.
I mean, do you have an error budget, sir?
I just want to know.
Me personally?
Sure.
Sure.
Right.
Right.
All right.
Well, let's get back into this. Just a quick reminder, though, if you want a chance to get a copy of the Uh, Google made it freely available on the web,
probably so that they can like keep metrics and track who's reading the book and how often it's
being read and things like that. But that said, I did notice, uh, this week there was actually an
update available to the book that, uh, on the online version, you just, you know, you get,
well, it was kind of nice of nice. So objectives in practice.
Yeah.
I don't know that I like this one.
No, really?
No.
Find out what the users care about.
Not what you can measure.
That's so much harder.
Well, it is.
But I mean, this goes back to like what I was, this is similar to what I was describing
earlier where it's like, you know, don't just everything that you can measure under the sun isn't necessarily the things that matter.
And so this is kind of flipping it on how do you define what matters?
Well, what do your users care about?
What's the user experience?
And let's start with that.
And I'm totally kidding, right?
It should absolutely be driven by what the users care about because they even say
right just because it's easy to measure doesn't mean it's useful to your slo at all right it
doesn't mean anything yeah right yeah if you have a static web uh website right let's say let's let's
talk let's keep it in a kubernetes world container world so that website isn't changing until you do
a deploy like why do you need to know like how many free inodes you have available on that system
right like who cares don't matter yeah now i like uh next section on defining objectives
slo should define how they're measured and what conditions make them valid and uh so here's an
example of good slo definition 99 of rpc calls averaged over one minute return in
100 milliseconds as measured across all back-end servers that's fantastic yeah and it's up to you
to type that into prometheus or grafana or whatever you know you can kind of define these things
but that tells you so much so how many times have you seen something that's like latency five?
Like, well, wait.
What does that mean?
Is that good?
I'm confused.
I thought we said the averages were bad.
Oh, no.
All right.
My head just exploded.
That's right.
But hey, at least you know what it is.
Yeah.
And so, you know, as we mentioned before, it's unrealistic to have your service level objectives objectives meaning 100 and uh yeah striving for
100 takes too long expensive it's just not worth it so do you remember when i teased you guys at
the start of the show or early in the show about oh hey this was the section teasing yeah this was
the this was the section where like it made a reference to something else so we have previously talked about error budgets i believe in the last episode if i recall and so this they they made a
point of saying an error budget is just an slo for meeting other slos and that was the thing that was
the like one thing referring back to the other thing that was like wait was it this chapter it
was that one okay so it wasn't measuring measurements it was observing observability
it was some it was some sort of recursive call to itself yeah all right so we should probably
google recursion and you know it'll all go back i'll go well uh here's another section i like because
this is like the first like this is instantly what i want to do i know it's wrong but uh they
talk about when you're choosing targets one thing you should not do and this is something i desperately
want to do is to chart is to choose slo targets based on current performance and what i mean by
that is like if i'm setting up metrics for the first time on a system that's already existing and i'm trying to figure out what
numbers should be the first dang thing i want to do is take a look at what they are now
and that is such the wrong answer right yeah well because like for example let's let's go back to
that web server uh you know in kubernetes example if you were to say, okay, I just spun up my Apache or Nginx instance on this pod.
How many times can I hit a static index HTML file?
Maybe it's even a default one or whatever.
And whatever that number is, that's the metric
for like how well I want my web server to perform. But in that example, like that's not even relevant
to what you're doing because that's just a static webpage. Whereas your other one's really dynamic,
has a bunch of API calls to make. There's authentication to deal with things like that.
So like, you're not really comparing apples to apples. So who
cares what its maximum was in this one particular scenario? What's really more important is like,
well, what's realistic. And, you know, those high, those crazy high numbers might not even
matter to the user going back to Google's point from the beginning of this book, right?
They might not even matter. So let's come up with like a real, uh, uh, something that's more, what am I trying to say here? Like, uh, uh, representative of what
the users are going to care about. Yeah. And that's definitely what I'm thinking too, is,
uh, it's about what your users expect and what your users want, not about what you have now,
but I guess you can make the argument and be like, nobody complained last week so let me see here's our average sometime last week i guess
it's fine i mean so it's a shortcut but it's just it's coming out from the wrong direction
i i you you three or two i can't count this morning um i'll average that out and it'll come
out to like you one and a half um ah dang it wouldn't even be that, uh, you won. So I'm going nowhere fast. You might recall, like, do you remember, uh, way back in the day we had, uh, uh, uh, server environment where like we were trying to decide like, okay, well, what do we want to be able to serve? And like, how many servers do we need? Blah, blah, blah, blah. And we ended up with like a really really high number do you remember that i remember i totally remember this and and we were using like at the time we were
we were trying to say like okay how many concurrent users do we want to be able to
maintain on a given server right and so we had this formula of like okay here's the average
think time that a person's going to stay on a given page. Here's the pages per second, you know, divided by the CPU times CPU, blah, blah, blah, equals concurrent users.
And we ended up with, like, many more web servers than we needed.
A few.
By orders of magnitude.
Yeah.
We needed it.
I mean, that makes it sound like way worse.
But, yeah.
It was bad.
It was a lot.
Yeah.
But,
but it made me think back to like this portion made me think back to that.
Like that was an example of,
you know,
um,
we at least tried though,
right?
Like we,
we,
we had a,
a metric that we wanted to start with and we use that metric to try to define,
uh,
how to build out from there but then you know
ended up overshooting and and had to go back and revisit things so i guess the point there being
like even with these slos you're going to try to come up with a target number rather than like what
the system is capable of necessarily but uh it's also okay to like re-evaluate as time goes by right yeah and having
a calculator is fantastic because you can go back and adjust those numbers so i'll take a calculator
that gets it way wrong any day over a well 10 is too much let's try four right exactly well that
that kind of leads into the next thing that they said is keep your SLO simple, right?
If you make it too complex, then it's hard for people to even understand.
And when there are changes made in the system, it might be difficult to even see what that impact was on what your original SLO was anyway.
So, you know, the simpler, the better.
And avoid absolutes.
I like this one, too. Like, oh, and I'mutes. I like this one too.
Like, oh, and I, and I'm sure we've heard this with Kafka and other things too, right?
You can scale indefinitely, right?
Like this thing will handle everything.
Don't say that, right?
Like, because as soon as you say that you're going to hit a tipping point to where it doesn't
scale indefinitely without a ton of work, right?
And making those statements means that you're going to be spending a ton of time
trying to make sure that the thing will do what you tried to promise up front.
I like this too.
They say that have as few SLOs as possible.
And you want to be able to have just enough to ensure that you can track the status of your system
and they should be defendable.
And I think that's really cool too. So I think you should like take away until you can track the status of your system and they should be defendable and i think that's really cool too so i think you should like it's like a take away
until you can't take away anymore yeah i mean this this goes back to kind of what i was thinking
before with the the grafana and like you know you can definitely find some easy to start with
dashboards out there for like a given system that are totally generic, right? Like, so for metrics for a Postgres or a
Kafka or Zookeeper or whatever, right? And, you know, they know obviously nothing about what your
business needs are in your application. So all of those metrics are super generic and not helpful.
And, you know, if you were to use those as your starting point, definitely start whittling away
at it and, you know, the
things that you don't care about, like get rid of them because you don't want to have like more
things in your face that you ever like, have you ever been in a situation where, uh, maybe you
thought you had something like really nice together that had like a bunch of metrics, like, Oh, I can
know exactly what the health of the system's doing. Right. And maybe some things are like,
you know, red and on fire you know alarming
but you're like you have trained yourself like okay well you know i mean i kind of care about
it but it's not like you know the end of the world or whatever and then like your boss's boss's boss's
boss happens to stop by and it's like hey what you got there and you're like oh i see this i can
like monitor the whole thing he's like oh my god why is it on fire and you're like oh well that one doesn't matter and then his immediate reaction
is like why is it there yeah and you're like well because i want to get to it eventually it'd be
you just said it doesn't matter yeah it kind of doesn't then you're not going to ever get to it
because i'm never going to let you get to it right yeah because good point yeah yeah they they also say here perfection can wait right
you never it's basically what we just said with the web server thing right we started out with
this crazy high number well um that wasn't right refine it right like trim it down numbers did we
well like what arguments did we get wrong here and let's fix those try again yeah don't don't
shoot for perfection, man.
I've heard so many times and I actually liked the statement, right?
Perfection gets in the way of good enough and, and good enough is usually what you want
and all you need.
So that's a, you know, go that way.
And this, I actually liked a whole lot too.
The SLO should be a major driver for what the SREs are
actually working on because that SLO defines what the users care about. So if the users care about
it, then you should make sure that you're meeting those users needs. And so that should be what the
SRE is focusing on. So said another way, let's think back to the purpose of this book and what
the SRE title was, right?
This means that these are a group of people who aren't necessarily going after new features to the product or the service.
And instead are saying like, oh, I see this thing trending in what could become a bad for us, you know, dashboard or whatever. And I'm going to
go ahead and get ahead of that. I'm going to put in a fix to address that before it becomes a
problem, right? Right. To even automate the fix, right? Which is what we talked about earlier on
with this. Like that's the whole goal. Hey, real quick before we go on to the next section.
So I went looking for the S3 SLA because we were talking about that earlier
and I was curious how they define it. They don't talk about durability at all in the SLA. The SLA
is only for uptime. So I think around their SLOs, they may have durability, but I'll put a link here
in the page. It was kind of interesting to me.
I'm going to stick it down here.
Well, S3 definitely, I mean, it might not be part of their SLA,
but they talk about a durability of like,
I can't even count how many nines it is.
It's like five nines or something.
No, it's like more than that.
11.
Yeah.
The durability is 11.
Okay.
So, but that's what was interesting to me is in their SLA.
The only thing that has a consequence is if their uptime is down, I put it down there in
the resources we like, if you guys wanted to check it out, but, but yeah, and they actually
have monetary returns, right? So they have a service credit percentage. If, if the uptime goes down to below 99 but above 98%, then you get a 10% credit.
If it goes lower than 98 but above 95, you get a 25% credit.
And if it's lower than 95%, then you get all your money back.
So it's pretty interesting.
But everything that I looked at on this page has nothing to do with durability.
It's all about the service being
available when called by the end application or whatever. Yeah. And I just saw this. I Googled
that because I just had read about the durability. And so S3 does claim 11 nines of durability,
but it's not part of the SLA, just like you said. Yeah. And that's kind of why I wanted to bring it up is because, again, we mentioned earlier, the key differentiator between an SLA and an SLO is the consequences of it going down, right?
And the only consequences here is if the uptime is not what they claim it to be.
So it's interesting, I mean, how people define these things or companies define them.
Just to close the loop on where I was going with the SRE that we're working on.
I mean, this goes back to the maturity level of the type of company that you're going to work for as to whether or not this is going to work for your team because it it it's a it's definitely a certain level of maturity for a company to be
able to afford to have a team focused only on sre type uh initiatives right and and not focus on
you know new features and whatnot for the product yeah totally and next section was on control measures which is
basically what you do are like that the kind of knobs that you can tweak in order to fix things
when they go bad so if you're imagine you're monitoring your system slis you've constructed
slos over those slis to know when things are you know going wrong when you have a problem and then
when that service level objective is out of compliance,
basically if there's an alert, something's wrong, you need to take action.
Then control measures are the actions that you take.
So, for example, if you see latency going up,
and so your SLO is kind of in violation,
and you've got an alert glowing, blaring,
then you can go and see that maybe your CPU is too high.
So then you can go increase your CPU capacity
and you should see that latency go down,
assuming that's actually what's going on there.
And so it was just kind of cool to talk about that.
And I think that ties in with the playbooks.
So knowing what to do about these things
when those service level objectives are going wrong.
But of course, you can't write out every permutation of what's going on,
but even just knowing that CPU is one of the things you can tweak and how to
do it is a good thing.
Also remember that too,
though,
that like Google specifically at the start of this book said that they prefer
to like,
rather than having a playbook of like,
Hey,
here's how you fix it.
They prefer to automate that process so that it can fix itself.
Right.
So I forget how they referred to it.
It was like,
they didn't want automated.
They wanted automatic.
Or do you remember the phrasing that they used for that?
Yeah.
Yeah.
It was the automatic over automated.
I forget.
Yeah.
I think that's what it was yeah i think that's what it was
i think that's what that's the goal of the sre right is to automate the things that can be
actionable based off some sort of playbook like if you have a playbook for it then you should be
able to automate it more or less right i mean kubernetes is a great example of this right like
think back to pre-kubernetes days right like how you go back to our example that I gave him about like think
time and, and page per second and whatnot. And you're trying to decide like, Hey, how many web
servers do we want for our application? Right. And so you had to like, think about that. And
then you're like watching your traffic load and you may like, Oh, Hey, we don't have enough.
And so then you as a person had to go in like, you know, a watch that metric and be like, oh, you know what, based on CPU utilization or IO or whatever, I think I need to add in another web server.
Right. And so then you would have to go in and handle the provisioning of that yourself and configurating and whatnot. Now, with Kubernetes, for example, you can just say like, hey, here's the health metric to watch.
And if this resource limit gets above, you know, like a 90% CPU, go ahead and, you know, spin up another pod for it.
Right. And, you know, that's an example of like, you know, something that can now be automated that I'm sure some SRE and Google you know was was uh tasked with figuring out or you
know probably responsible for implementing and bringing to the world you know yeah kubernetes
is the best until it's not and then it's yeah it's awesome there's room for it to be simplified
for sure but right now i can't i wouldn't want to work somewhere that didn't have kubernetes
you know unless you've got like a three-tier system with literally like one server you know for each tier like i can't imagine
i had this thought though on it though like uh another tangent tangent alert um because the one
downside though that i that i thought about this week with kubernetes is that like you're like
everybody has to become an expert at every layer of networking
and security and
how to deal with firewalls
or whatever.
It's no longer just as simple
as it works on my machine.
You're spinning up a cluster,
a server farm, everywhere
you go, every time.
It's got its good and its bads you know
it's all good what are you talking about oh you don't want to learn about networking
i i totally messed up that part then yeah um let me rephrase this uh i'll figure it out i totally
agree with you you're right it's just it i i love it but it's it is complex like there's there's so
much and in fact there was that thing that I, well,
I guess we'll share it with the rest of the world too.
But Jay-Z shared this picture of the Kubernetes glacier, right?
Well, like what I don't,
all the things I don't know about Kubernetes and like how depressing it is to
be like, Oh, look at how much stuff is really in there. Like, uh, yeah,
it gets deep. So, uh, all right. Well, so going back to what we said before about publishing that
the SLOs, like these SLOs set expectations. And so it's great for teams to be able to publish
those so that other users can know what to expect.
So going back to the chubby example, which, by the way, can we admit that's a horrible service name?
Yeah.
These guys.
Like, yeah.
So, yeah, because, you know, you don't want people to become too reliant on it.
You want to, like, set some kind of level of like, Hey, you know,
it's okay if we're down every now and then for maintenance or whatever.
But it's great that our service is good enough to where like, you know,
we only have to take it down when we want to, but you know, don't,
don't be so reliant on it.
Hey, so this,
this next part is what I was mentioning earlier and I forgot it was at the end
of this was they were talking about having a safety margin.
One approach to having these expectations is you have your internal SLOs, right?
Like, hey, I want all of my requests for my service to come back at 200 milliseconds. But for the external customers that are facing the SLO,
we're going to publish for them as we want them to come back at 300 milliseconds, right?
So you've got a buffer of 100 milliseconds there. And they were talking about if you do this,
then this kind of protects you because you can aim for your internal targets.
And that way you're always pleasing your external customers. So that's one way to do this.
And then the other is don't overachieve, right? They talked about this early on was you are not
trying to get consistently better than your SLO because then people come to rely on that.
Like the chubby service that never went down, but when it did, there were outages everywhere
because people thought it was there. And so they actually said that you should consider doing failure injection,
which is a thing that they have there to where they actually introduced
downtime on purpose X amount of time throughout a quarter.
So that seems weird to me,
but I get it.
But this isn't failure injection as in like the chaos monkey from Netflix.
No, that's totally different.
This was just like, I almost hate to call it like failure injection.
It's just like we're going to, we're introducing forced downtime.
Right?
Yeah.
That's all it is.
It's not necessarily a failure.
The system didn't crash, but we're, we're bringing it down for one reason or another.
Yeah. It's going to be unavailable.
And it may not even be planned, right?
Like we don't want people to know that, hey, we're taking the system down tomorrow night at midnight.
No, it's just it's going to be down at some point and we're going to take it down so that people will understand that our reliability is what we published.
So it's interesting.
It's weird, but it sort of makes sense.
The failure injection might be on the other teams that have the dependency.
Right.
Right.
You know?
So,
yeah,
they,
they need to make sure that they can work around that.
And so like,
it'll call out the issues in their system in the case of the chubby example.
Yep.
So agreements in practice.
So we, we've pretty much covered SLIs and SLAs pretty good, but we've only kind of scratched the surface with the SLAs, right? But that's also fair because
the SRE's role is not to set the SLA. That's a business decision. Going back to the consequences
that Alan gave with the
Amazon
and the S3 and the cost there.
Us as
developers, we're not going to say,
hey, guess what we're going to do?
If this service goes down that I wrote, I'm going to give you
back this amount of money. We're not going to
do that. That's not for us to decide.
If you could, how much would you give back uh zero zero wow joe's greedy jay-z grinching it that's awesome um but what the sre's role is they may not be defining those
they are supposed to at least inform them about how difficult it's going
to be to meet those SLOs or the SLAs that are being put up there right like if somebody says
oh yeah we're going to guarantee an SLA of 99.9999 uptime the SREs are going to go dude you're crazy
like we can't we can't make that happen right and so that's that's where they kind of
come into play so they they get to call people crazy all the time is yeah exactly we could have
summed up this book a lot easier sre is getting to call people crazy that's it i feel like i'm
sorry maybe this should be my title though like you know there were managers talking
about the plan outage they were just like wait you're gonna take this down mess with my stuff
give me a bunch of extra work like i've got i've got to figure out how to squeeze this stuff in
with all my normal goals because because that might happen right so maybe right get out of here
you know crazy it's crazy so people think that's the SREs are crazy too. Yeah.
Well, just because you call me crazy, Alan, doesn't mean you get to be an SRE.
There's more to it than that. Wait, that's not what that is. I could have sworn. That's how you
got the title. I'm not even saying that you're not within reason to say that I'm crazy, but
you know, there's more to the job than that. That was awesome. Can you imagine if like Google
wrote a book, like how to be an SRE first call michael crazy and like i was specifically called out that was the end of the book end of
chapter one here's the credits thank you and here's the copyright date you've arrived and it
has been number all right so the last few that we got here you should be conservative with the sls
and the slas that you make publicly available because otherwise you're setting yourself up for pain.
This goes back to your buffer, your SLO buffer.
Yep. Buffer and also not trying to make it so unsustainably reliable that you can't work on anything else. Right. That's really what it is.
Or take,
I'll put it in their terminology,
the safety margin.
But yeah.
So the big thing that they call out here is you want them to be
conservative because as soon as you make that stuff public,
it's hard to backtrack that,
right?
Like if you publish your SLOs and your SLAs,
then,
then people are going to hold you to it.
Right.
And,
and if you slip on that,
it's going to be a problem.
Um, uh, then people are going to hold you to it, right? And if you slip on that, it's going to be a problem. So we talked about this earlier.
They called it out explicitly here in the footnotes of this chapter.
SLA is typically misused when talking about an SLO.
Basically, if there's an SLA breach, it probably triggers a court case.
Unless it's something like these
credits from Amazon where they're talking about, hey, we were down, you know, percent, then you're
just going to see it back on your bill. Otherwise, if it's something big, then you're probably going
to see something go to court. And then I think Outlaw mentioned this earlier. If you can't win
a particular argument or if you need to justify the SLO that you have and it's not like
really gaining any traction, then it's probably not worth having because you don't want your SR
team to work on it. So if, if, if your boss comes by and is like, Hey, why is that graph so bad?
You're like, Oh, take it off. Right? Like it doesn't matter. Get rid of it.
Yeah. I just like, uh, do you you remember have you ever been in an environment where
the development team had like literally a stoplight where uh you could see like oh the
builds are green the builds are red or you know i don't know why the green would ever be yellow
or why the light would ever be yellow but um imagine like your boss's boss's boss's boss
like somebody puts in a commit you know know, and it broke the build.
And,
and so it's red and your boss's boss's boss's boss is like,
Hey,
why is it red?
And you're like,
Oh,
there's a bug.
We're going to fix it.
And he's like,
Oh my God,
where's on fire.
And you're like,
do you know how many bugs we fix every day?
Like,
I mean,
it was definitely like a different world though,
too,
that I'm kind of thinking of.
Cause now I'm like,
I can't imagine like not having a PR gate in that would prevent the build from ever getting broken
like that you know but oh i remember having arguments about that stuff you remember people
like i ought to be too slow to merge code it's like dude i don't want you to merge code that's
broken yeah i don't have time for that right now man i just need to get this in there and be done
with it and it's like uh but you're not done with it it's broken yeah
but it's fine for but it's not fine for now because i still need to do my job yeah yeah yeah
i like pr gates even if they had five minutes to an approval well get ready for an hour oh wait
sorry okay so now we go full circle back to my monolithic build complaint.
Right.
All right.
Hey,
mergeify mergeify.
Yeah,
there you go.
So,
uh,
yeah,
so we'll have some,
some links there in the resources we like.
Uh,
obviously this book will be in there.
There might be a link to the,
uh,
Kubernetes glacier.
You'll probably cry when you see it.
And especially when you realize how far you have to scroll to see all of it. And that's okay.
Leave a comment. Maybe we'll send you a box of tissues to help
wipe away the tears. Kubernetes
sponsored Kleenex.
All right. Well, how about I ask you this,
though? How does Darth Vader like his toast?
Dark side.
It's got you there on the dark side.
Good. All right. Well, with that, we head into Alan's favorite portion of the show. It's the tip of the week.
So it's usually my favorite portion of the show,
except for when I can't think of anything.
Um,
so this is,
I've actually,
I've got to,
I'll probably type the other one in here in a second.
So the first one is,
uh,
had somebody,
we kind of forced this person onto a Mac recently,
and this is a,
you know,
20,
30 plus year user of Windows.
Ed complained like nonstop, like, why did they switch the command and the control button?
And I'm like, dude, you got to realize Mac has been around for a long time too.
Like they didn't switch it.
They didn't just decide that, oh, we don't like windows and we're going to do this.
This is how they'd be doing it forever.
It didn't matter.
Yeah.
Since the beginning.
So,
you know,
whatever windows had control,
Mac had command and control,
which is really confusing.
Um,
so,
and then windows decided to add a windows button a few years back.
So whatever the key is,
if you find yourself switching from something like a windows to a Mac and you're
finding like control C to be hyper frustrating, there are, um, software drivers that you can
download a lot of times for your keyboard for a specific OS. And like in this example, he was able,
I think he likes, Natural Keyboard 4000.
Like he's got a stack of them, right?
Well, you can download software for the Mac to basically remap the keys.
So the problem is there is a feature in Mac OS to say, hey, I want to swap my command and my control key.
But the problem is a lot of the software that you use is going to map Command-C to control or whatever.
So even though you try and do it at the OS level, your software is going to overtake it at some point.
So if you do it at the keyboard level, then now you can stick with your Control-C or whatever, and it'll make your life easier.
So I highly recommend if you find yourself in that situation and find it very
frustrating go look that up there's a lot of times mac software downloads for your keyboards
just embrace change that was what i said i was like dude it takes like a week for you to mentally
remap your thumb from one key to the other and he wasn't having it and i was like all right whatever
i don't want to hear about it anymore. I don't know what to do.
Yeah. But like, so, um, an example of that. So I'm using this Kinesis gaming freestyle keyboard.
Right. And one of the annoying things to me, I don't, I don't know why they made this decision. I think it's ridiculous. The keyboard has an option so that you can put it into, quote,
gaming mode. And that way, when you're in like your full screen game and, you know, if you were to accidentally hit the Windows key, it'll just ignore it, right? It doesn't like pop you out of
the game, right? Because especially like if you're in like a, a multi-level, sorry, a multiplayer game and you're like in an epic
battle, like you're in the fight. Right. And you accidentally hit that key. Like you don't want to
be taken out of it. Right. But for this keyboard, they have, uh, one of the options that you can buy is the Mac keys. So you can like swap out the,
like it swaps out the size and position of the keys so that it is more Mac-like versus the
Windows experience, right? So it's not just a reprint of the labels. It's actually changed
the different size of the keys as well and then in
the software you can tell it like hey i'm now i'm using the mac switches right awesome for the love
of god i don't know why but they disable the gaming feature well because people don't game
on max come on man uh that's probably what their rationale was like, Oh, you're only going to use this keyboard on one computer.
And I'm like,
you know,
I mean like 99% of the time I'm on the Mac,
but when I am on my PC,
I like to play games.
And so,
yeah,
it's,
it bothers me that I can't have the Mac keys and have that gaming feature.
Well,
they got a,
they got a figure.
If you're trying to go over to the back,
you're not gaming. And if you're going gonna go through the trouble of replacing the keys to
be on the back then you're not gonna be gaming on the windows machine right like that's not true
0.001 percent of the population that buys that keyboard yeah there probably is a super small
percentage of the population that has it for sure
for that purpose for that that makes use purpose oh yeah so the the other tip i had was just
being when you're doing the implementation of metrics and stuff and i know you guys have seen it
they're not all free right so what i mean by that is we talked about picking your SLOs in a manner that makes sense,
right?
Like don't just put a ton of them out there.
Have a handful of them and make it that way.
Well, with metrics, a lot of times when you're adding in metrics, you have a tendency to
just start adding them everywhere.
And the thing is, that stuff is collected by a server,
whether it's Prometheus or whether it's InfluxDB
or whatever it is that you're using.
And so you could be adding a ton of data
that you don't even realize you're doing, right?
So for instance, in Prometheus,
if you have, we were talking about the quantiles earlier,
if you have a 95% quantile and a 99% quantile, that's just two, right?
But then they're going to be bucketized as well.
If you start adding labels to those, every label that you add for a Prometheus metric is essentially a new series.
So if you're adding a customer name as a label, if you have a hundred customers and that's a hundred times two that you had previously. If you add a super high cardinality thing, like the size
of a request, now you could have a million different sizes that come in and that's a million
times 100 times two. So the number of metrics that are gathered grow like just insane amounts. So
pick and choose the metrics or the SLIs that you're also looking for, because it can actually
impact the systems that need to gather and serve up those metrics. So, um, you know, just a, a
warning for when you're going and adding these things into to your your metrics
gathering for your applications all right well i have a tip for you so um have you ever moved
browsers and uh exported your bookmarks and then re-imported them in another browser
uh i mean no i've never you know you can do a problem though yeah i know you could though. Yeah, I know you can do it.
I've never done it.
Yeah, you can do it.
Okay.
You just sign in with Google, and it automatically brings it all over.
Yeah, but if you wanted to switch to Firefox or something,
then you can export your bookmarks and actually import it because Firefox supports that, right?
So I always knew that was a feature, but I didn't really think too much about it.
Well, I have a little script that I use,
and sometimes at work I have to switch environments,
and it switches a bunch of different links you know so um i like to have a script that i use
that i run and i type the environment name and it generates all the links for me like the website
maybe to logs you know stuff like that and i always think like wow wouldn't it be nice if i
could like generate these bookmarks on the fly but sometimes these environments are not common
whatever but so it'd be nice to be able to either open them all up or just kind of be able to have a temporary
bookmark but it's you know it's not cheap it's a pain to add bookmarks right especially if a lot
of them so it got me thinking like well you know what maybe i can reuse this format that you use
to export and import to generate a bookmarks file and then whenever i'm working on an environment
for maybe a couple days i could import this as I'm working on an environment for maybe a couple of days,
I could import this as a new folder
and have it around for a couple of days
and then delete it.
And I can do that if it's easy.
And so I looked and sure enough,
the format is dead simple.
It's basically an HTML file
with just a couple of links.
And it's got a couple,
it's got like a weird DT tag,
but I found an article that shows
how you can use this and you can easily generate it.
So what I'm going to do is I'm going to update my script.
So it generates this HTML file whenever I run it.
So I can just kind of go into Chrome and import it.
And it's going to be easy.
So I never thought about this, but this is actually really nice for like onboarding, for example.
So you imagine like someone new starts the company, you can be like, hey, check it out.
Here is the HTML file of bookmarks that you can just import and it'll add so it's not totally
wiping out whatever you've got it's not like a total import export but you can add like the 20
common you know bookmarks that most people should just probably have for working here i like that
so that was pretty cool so i'll have a link to a script that somebody actually put together already
too uh so you can see how to do that i'm also like really bad though with bookmarks i mean i love
the idea of the onboarding example but i'm really bad about bookmarks because like uh i i am not a
heavy user of them and instead i'm like well i could just easily like re-google that or go back
to like uh you know what the reason the thing was. So I rely on search capabilities either through whatever that site or services or the or like, you know, Google, if it's public stuff so bad that it's more like.
Because then otherwise, like, I'll forget that I have a bookmark for it.
You know, do you ever do that? And you're like, Oh, I'll just go and like,
uh,
query the wiki for that page rather than realizing like,
Oh,
I already bookmarked the link to that page in the wiki.
Yeah.
It just,
uh,
like where I get,
uh,
where I stopped doing that as well with number one,
wiki searches are notoriously terrible,
but two,
um,
like logs.
So if you have a bunch of different environments in a cloudy environment
and you're like trying to get this like certain pods or certain like queries that are common
you know like oh it's such a pain to like navigate to the right project and the right cluster and
then you know the normal kind of filters you add and so i abuse those yeah yeah you know what i do
like about this so i'd never thought about that in terms of onboarding if you you did it like that, then everybody's bookmarks will be in the same place.
So when you're communicating about what page to go to, you can be like, hey, go to this bookmark folder and this bookmark and you're good.
Like that's actually really nice for communicating.
Yeah, it goes back to which book was it where we were talking about having a common language?
Ubiquitous language.
Yeah, yeah, yeah. Domain-driven design, sir.
There you go.
It kind of adds to that, right?
Because then you're all talking the same thing.
All right, so for my tip of the week, we got a remix!
Yeah, so I didn't realize that we had already like,
uh,
called this one out before,
but then,
uh,
Jay Z saw it and he's like,
Hey,
you realize I've already talked about that.
Right.
So,
um,
my tip of the week,
well,
I guess is more of a reminder.
We,
we,
we remix of,
of using the power level 10 K,
uh,
theme for Z shell. And the reason why this became a thing for me,
why I called it out is I forget which theme I was using for ZShell, but I'm working in a pretty
large repo, not the 20 gig repo that I think Alan mentioned at the start or Joe, I forget which one
of you mentioned it, but not quite that big, but it's big enough to where like every time I would like execute any command on the command line while
in that, um, that repo that, uh, the theme was trying to like also report back, like, you know,
get status kind of information, like what branch you're in and if the if it's dirty or not, you know, if there's any changes in it or not.
Fairly simple thing, but like literally it would add a few seconds of time to execute any command.
Like you just do a simple LS.
And it was so frustrating.
And Joe was like, hey, you should try this other theme.
And that'll that pain will go away. And sure enough, with power level 10 try this other theme. And that pain will go away.
And sure enough, with power level 10K, I no longer have that pain.
So I didn't realize that it had already been given away as a tip.
But yeah, so now you got a remix of it.
Looks cool too, right?
It does do a bunch of cool things that I'm questioning like, oh, God, did I go overboard
with how much stuff
i have on the command line now available to me because like you know you see the execution time
of everything and uh you know the path and the path is all like you know color coded uh and you
know pretty and whatnot in the branch you're on and yeah it i really do like it but uh i also might
go back and redo my configuration for it
but that also like made me question too though there was another um we got a comment um i forget
how it came in either email or or on the site or whatnot but um someone was talking about like hey
you know uh since i'm talking specifically about command line here and this you know the person
who wrote in was saying something along the lines of like hey you know, the person who wrote in
was saying something along the lines of like,
hey, you know,
you guys give so much love to the command line,
but like there's a lot of advantage
to knowing tools like a GitKraken
or something like,
GitKraken was specifically the example given
in the comment and everything.
But, you know,
just like knowing like the UI tools
because like how much faster
you can move around in those tools.
And I don't discount that,
right?
Um,
knowing,
knowing UI tools that are available for you,
it's like certainly,
you know,
you can definitely move faster and that's their whole,
that's their whole point.
That's their reason for being right.
But I was curious from your guys' perspective though,
cause like,
I guess maybe it's just like,
you know,
career wise, like what kind of things you've
had to work in or the luxury of like what you've had to, you've had to do, but I've never found
myself being in a situation to where like, I can only ever use that tool. Right. Like I find,
I also many times find myself to where like, I need to be able to do things by the command line.
Like this is all the access that I'm given to a particular system.
And so,
you know,
I,
I guess for me,
I like to use the command line as a practice,
as a,
to keep those skills sharp.
Right.
So that when I do get into those environments where that's the only thing I
have available to me,
I'm a habit.
Does that make sense?
Yeah,
totally.
And the, the. The argument that
I mean, that's not even an argument, but
I love canines, the UI for
using Kubernetes, and one thing that's kind of
Terminal UI. Yeah,
terminal UI. So, yeah, it feels
like a CLI, but it's not.
It's kind of a mix of that and VI.
But, you know, there are definitely
times when I want to script something and I realize
I don't have that muscle memory built. but you can go so much faster than cover so much ground
and it kind of teaches you too so i have mixed opinions you know i think it's important to know
both and i think it's fine to use a ui if you like it but so i don't know i'm torn on it i 100
agree with what jay-z just said it's the reason why I do the command line as much as I do is as soon as you want to automate something, that's kind of what you have to do. Right. So that's, that's one thing. But, but I have found myself and I think I mentioned this, I don't know, several episodes back. I have found myself, especially with get using things that are like built into
IntelliJ, right? So for instance, if I'm, I did this with Jay-Z and somebody else recently is I
had a fairly large PR that I needed to walk through and looking at that inside GitHub's pull request was useless because there's no context, right?
So what I could do is I could go to the UI tools in IntelliJ and say, hey, show me the files that
change. And it would actually give me a list of all the files on the file system that changed.
And I could click the file and it would show me the diffs between both of them on the page. And so I could easily navigate in a way that's easy for people to consume and say, hey, I made this change here because it relates to this over here.
Right. And this is where it is in the file system.
So I can show you that, hey, this was in this module and these changes were made.
And this is how it relates to this module. So, you know, I like UI tools and I definitely find myself using them, especially when I need to communicate with other people.
But I agree, man, like doing things from the command line are ultimately how you end up automating things in a lot of places.
So you need that muscle to even know how to go about doing it.
Yeah, I mean, I like that.
You know, it's I guess said another way is like
it's good to to know and use both there's a time and a place and you know like it's if we're it's
i don't want to say that we're harping on command line as much as we're like trying to keep that skill sharp. I think that's it for me. So, um, yeah. All right. Well, uh, so subscribe to us on
iTunes, uh, Stitcher, you know, Spotify, wherever you like to find your podcasts. Um, in, uh,
as Jay-Z tried, but failed to say earlier, and I had to go back and correct and make it better.
Uh, I'm pretty sure that's what happened.
That's the way it went down.
Yeah.
Send your bad reviews to wait,
send your,
send your reviews to,
okay,
here's some helpful links available at dub,
w dub dot coding box.net slash review.
Hey,
and while you're up there at dub,
w dub dot coding box.net,
make sure you check out our show notes,
which are extensive examples, discussions, and more, and send you're up there at www.codingblocks.net, make sure you check out our show notes, which are extensive examples, discussions, and more,
and send your feedback, questions, and rants,
apparently to joezack at Slack.
And definitely check out our Slack channel.
There really is just some awesome people in there,
which I have not been in this past week
because I've been crazy busy.
So, yeah.
Yep.
Also, we're on the Bird site, Twitter,
at CodingBlocks. If you've got some mean
tweets to sling at us, you can do it over there.
And also
CodingBlocks.net. You can find all our social links at the
top of the page if you want to hit us up on those
dillies.