PurePerformance - SLO Adoption and Usage in SRE with Sebastian Weigand
Episode Date: June 22, 2020Keep hearing the terms SLIs, SLOs, SLAs, Error Budgets and finally want to understand what they are, who should be responsible for and how they fit into SRE (Site Reliability Engineering)?Then listen ...to our conversation with Sebastian Weigand who has been helping organizations modernizing not only their application stacks but also helping them embrace DevOps & SRE. Learn about who is responsible to define SLIs, what the difference between SLOs and SLAs are and what the difference between DevOps & SRE is in his opinion!Sebastian, who calls himself “That Devops Guy” (@ThatDevopsGuy), also suggests to check out the latest free report on SLO Adoption and Usage of SRE as well as SRE Books from Google to get started with that practice.https://www.linkedin.com/in/thatdevopsguy/https://twitter.com/ThatDevopsGuyhttps://landing.google.com/sre/resources/practicesandprocesses/slo-adoption-and-usage-in-sre/https://landing.google.com/sre/books/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always my co-host Andrew Grabner.
Hey Andrew.
What's wrong with you? Why do you call me Andrew?
I just wanted to mess with your name today because it's been a while since I messed with it.
My favorite messing with your name was Andy Candy Grabner from a Halloween episode several years back.
Several years, Andy.
Can you believe it?
No, it's amazing.
Wow.
Yeah.
Anyway, Andreas Grabner.
Andy, to everybody else. To answer your question, I'm doing actually pretty well.
Week seven or eight or nine of the lockdown is the time of the recording and um we are seeing the i am not only seeing the light of the sun
that is shining through the clouds now but we're also seeing the end of the light at the end of
the tunnel at least in austria here australia you just said australia i said austria no i think i'm
gonna i'm gonna you know i'm gonna edit this and play it over and over again in this episode and
see what it sounds like but At least in Austria here.
At least in Austria here.
At least in Austria here.
Yeah, well, we'll see how things go.
You know, starting to see things.
Yeah, it may be the end, may not be.
It might be relapsing.
We'll see how everything goes.
But glad to hear things are going decent
on your side of the pond.
I imagine you don't have people going out with guns
and awful hate flags to try in your country going on so much
as we do here. But yeah, interesting things going on. I'm glad to be talking to people,
you and our guests today. It's lifted my spirits a bit. So let's go. We have a fun show today,
a hot topic. I think we're seeing it more and more all over the place. It's been around for
a while though, I think.
I think this goes back.
Well, our guest can tell us, right?
So why don't you go and introduce our guest and we can get into it.
Sure.
So, well, the first thing I asked my guest when we got on the mic just before the recording
is that you must have some German background because your name, Sebastian Weigand, sounds
very German.
And on his LinkedIn profile,
he's called that DevOps guy
and he's currently working at Google.
So that's an interesting, great combination
to talk about all things DevOps,
all things SRE, SLIs, SLOs, SLAs.
I'm sure we'll find a lot of things.
But I think I want to hand it over to Sebastian now
to introduce himself to the audience.
And then Sebastian will take it from there.
Definitely what we want to learn from you is,
what is this whole thing about SLIs, SLOs, SLAs?
Because that's what's really hot on our minds these days.
So take it away.
SEBASTIAN SCHMIDT- Sehr gut.
I'll get started then.
Yeah, so thanks for having me on.
I really appreciate that.
Like people said, my name's Sebastian.
I work at Google as an application modernization
specialist in our customer engineering department.
Essentially what that boils down to
is as organizations try to think about moving up to the cloud
or realistically just modernizing their application
stack and all of the infrastructure that's associated
with it, they take an interesting path.
And part of that is a digital transformation that's spurred on by things like DevOps and SRE culture.
Part of that is understanding new technologies. So things like Kubernetes and containers and
advanced monitoring and distributed tracing and things like that. I've been doing DevOps since
before DevOps was a term because I was born in the fires of systems administration way back when.
So I feel like if you started as a Linux admin, you naturally sort of progress to like, you know, I have to go set up a server.
And then it becomes, well, I'm getting asked to set up a server every other Tuesday.
So I might as well write a script to do that too.
Well, I seem to be getting asked to do this every day now.
So maybe I should move to some different configuration management type options.
And that inevitably leads into this like programmatic approach to systems administration and operations,
which is kind of interesting because it sort of naturally dovetails with like the concepts
and the components inside of site reliability engineering, which is always a fun
topic because SRE and DevOps and all of these fun industry terms are more than just buzzword bingo,
but at the same time, very difficult to define. Depending on who you ask, you're going to get a
variety of different definitions for pretty much everything. And even if you define them,
you get really different interpretations of what they mean. So someone's like, Oh yeah,
I know what that is. But when you ask them, how do they, you know, use a technology or use a
components there, they're like, Oh yeah, I had really no idea. Or they use it completely
differently than you were expecting. So it's kind of interesting to see how this like comes together
and, and, uh, informs our opinion of how we should tackle like modern i don't want to say business because this also affects like academia
and government things like that but it just affects your ability to leverage it to solve
organizational problems you know it's funny you you mentioned that you came from uh being a systems
administrator my my favorite use of the word devops is when you have a traditional sysadmin
who changes their title from sysadmin to DevOps admin and does nothing different.
Absolutely, right?
It's equivalent to like, hey, do you guys have security implemented in your system?
And they're like, oh yeah, we bought security.
It's that commercial off-the-shelf product that's right next to like stability and scalability right you just buy it it's not a thing
like that yeah so um it's interesting you said so you were basically born born into devops before
devops was a thing do you think now that these new kind of hype terms like s3 and i'm sure there's
another thing coming up soon just around the corner do you think if um if devops wouldn't have been coined as devops but if site reliability
which obviously is things that people have done in the past too but if this term would have caught
up earlier that we would have just used this term from the beginning and then it would have kind of
evolved into something else because in the end, it's very adjacent anyway?
It's interesting.
I don't think so.
And because I think that you need to have
an iterative approach to how we tackle operations tasks.
So I think the reason why DevOps became a term
was because you really needed to think of
what you're doing in an IT space
from a more holistic perspective.
So it's not just operations. So like way back in the day when I worked in managed hosting,
you had like a net ops team and you had like an infra ops team or a web ops team or things like
that. But we really didn't focus on the developer interactions and the, and the appropriate feedback loops that need to be established to really, you know, cohesively form a, a, a high-performing team that can
release things, you know, on schedule or, and, and more importantly, like on budget.
So I think that whatever it is, that's going to happen in the future.
I think you're, you'll continue moving up the abstraction stack, but we need to figure
out how to essentially like cut your teeth on the nuts and bolts of systems administration and operations.
But at the same time, developers needed to cut their teeth on programming languages and frameworks
and recomposable units and modularity and distributed systems and things like that.
And it's only until you move forward and the tools progress that you can start taking advantage of newer ideologies and newer methodologies that then empower the next generation.
So you kind of like level up as you as you get to a certain point.
And then from there, you take it and you go do something else that's even more advanced or more capable.
So whatever you want to call it, you still have to do one before the other, before you realize that you have to do both of them together.
Very well put.
So let me dive into one topic, which, you know, when we reached out to you to do a podcast
episode, we said a big topic or a lot of terms that are kind of thrown around these days and by the industry also by us at
dynatrace is the concept of slis and slos and slas and as i think there's a lot of people that are
just using it as the term devops and sre and noops just to make buzz and say hey we know the latest
shit and we can we we obviously have a solution for that but then a lot of people may not even know what that means at least i get a lot of feedback when i just put the
terms as allies and as slows on you know a small a small group of people sometimes at least has the
the courage to say what does this mean again some people just not and they don't know actually and
just assume that later on they may learn about it but would you do us the favor and um give us the the best or like a good
description of what this is all about what these different terms are what they mean um so that we
can kind of you know at least establish the baseline knowledge that we all need to have
absolutely um and also keep in mind too that there's a broad term of like site reliability engineering, which is sort of a, you know, it's a subject matter, right? But then there's also sort of Google's approach to it and Google's opinionated implementation of a set of practices and principles that work really well for us that we've sort of codified into site reliability engineering. And if you Google
like SRE book, I think the first one and the workshop book are available for free if you want
to read them online. But to get to SLIs, SLOs, and SLAs, it's always fun to go through all of those.
And I think we're actually leaving one out too, which is important, which is error budgets,
at least with our specific interpretation of site reliability engineering. So let's start with SLI, right? And SLI is a service level indicator, right? And what
I like to think of this as a good enough measure or a well-defined measure of successful enough.
You know, for example, an availability SLI could be the proportion of requests that resulted in,
say, a successful response. In other words, this is
the metric that determines a service's reliability or a service's performance. So if you want to
think of it like, you know, take like a CPU metric, right? If my CPU is high, is that a good thing or
is that a bad thing? Well, it depends, right? If I'm serving a bunch of web requests, then maybe if the CPU is high, I will cease to be able to serve web requests. But if I'm doing a bunch of video transcoding, if my CPU is low, then something is not working properly because I need to use all of the cores on my system to encode video. really tied to a service's availability, we want to focus on what it actually means
that gives us a good indicator that a service is performing properly.
So a good indicator would be, let's say if you're a big, say, search company or whatnot,
if all of the requests that are coming in on the web are being served properly,
and the definition of properly depends on your business goals, right? So like,
what if every request that comes in is being served, but it takes three seconds
in order for us to give them a response?
Well, maybe that would be a little bit below the threshold.
So we can set an SLI to determine what that is.
And keep in mind, you can have multiple SLIs for individual services.
So for example, we want most of the web requests coming in to be, let's say, 200s.
We want them to be successful requests.
But at the same time, we also want maybe latency to be under a specific time.
And keep in mind, those are all sort of sliding metrics. Now, when we get into SLOs, which is
the service level of objective, that's a top-line target for a fraction of successful interactions.
So if we establish the SLI as like the number of successful requests,
the SLO could say, okay, over a specific period of time that we've established,
the average number of requests that are coming in need to be above a certain, say,
watermark or something like that. So like if you have a 97% availability SLO and you get,
you know, a million requests over, you know, some amount of weeks in order to meet our SLO, and you get, you know, a million requests over, you know, some amount of
weeks in order to meet our SLO, we would need like 970,000 successful requests. And that's how
the SLIs relate to the SLOs. Now, SLAs are kind of interesting because SLAs tend to be what we,
what we see in a lot of like a corporate documentation or contracts or things like that.
SLAs are really, realistically, it defines what you're willing to do if you're failing to meet your SLO.
So it's more of a contractual thing than it is a technical thing.
So, for example, if we fail to meet an SLO for any cloud service provider, you know, any cloud service provider or any third party
software as a service or something like that, if, you know, you went to the website and it's down,
well, that's a problem. So your SLA establishes, well, if it's down for such and such amount of
time, we're going to refund you such and such amount of credits, or maybe, you know, give you
some additional money or refund whatever it is you paid for or something along those lines.
So that's how all of these, all of these relate to each other.
Do you have any questions on those?
Go ahead, Brian.
Okay.
So, well, one thing, some comments and some things to just confirm that I grasped them well enough.
The one comment I wanted to make about SLAs is that every time I hear that definition,
I think about how far away from that definition the use of SLAs has gotten, right?
Because most of the time, I think people are looking at an SLA and say, oh, my SLA is my
90th percentile needs to be under 500 milliseconds, right?
Which is the measure.
It's not the repercussion or what you're going to do.
It's not the contractual bit, but it's like, oh, we promised our customers export kind of X response time. And it's just
looked at as the metric. But yeah, but anyway, so the thing with SLIs and SLOs, if I understand
them correctly, just to make them super, super simple, just to keep the difference between
them, because to me, it's always like, well, which is the SLI and which is the SLO?
So the SLI,
that was the first one you talked about, right?
The indicator?
So I think the easy way, at least
for me to think about them, and I want to make sure I think about them
correctly enough at least, is on a
very basic level, the SLI is
what it is that you're
going to measure, and the
SLO is what the acceptable measurement is.
Yeah.
Right.
That's a good way to put it.
And then obviously you have,
it should be tied to actually availability and all that kind of stuff,
but just in terms of keeping them,
yeah.
Keeping them straight in the head.
All right.
Good.
Good.
Yeah.
So I wanted to just confirm,
cause I kind of always thought that that was an okay way to think of it.
But now that you're on, I want to confirm because you're what I'll call the expert.
Yeah. And I think the important thing to take away from this is a lot of people focus on the I or the O or the A or that sort of thing.
And I really want to focus on the S part of that. It's per a service.
Yeah.
And when we have to think about a service, we have to think about users.
And this is not something that I think a lot of people grasp sort of intuitively because a lot of systems administrators and ops people tend to jump in and then start to think of like, okay, well,
I'm intimately familiar with whatever it is that I've been asked to manage or spin up or something
like that. And as a result, I can think of like a bunch of metrics that I want to capture in my mind and a lot of like
alerting policies that I can create and things like that. But realistically speaking, none of
these matter from a user's perspective, right? So if I'm a user and I want to go to your site
and it's up, that's all I care about. I don't care about, you know, additional CPU overhead.
I don't care that your disc is at 98%
capacity. I just care that I can actually get to the, um, get to the site and I can start doing
something. Yeah. The interesting thing about this as well is the, um, the service level indicator
doesn't need to be, uh, just focused on let's say good or bad. Well, everything kind of breaks down into good and bad,
but there's different quantities of bad. So for example, uh, let's say your site is up and it's
available and it's down and it's not available. We kind of think sort of binary, but the problem
with that is like retail sites in particular. Uh, when I talked to a lot of the leaders in that
space and I asked them like, what's the most important like measure of availability? They actually care about latency more than they care about
site availability, which is kind of interesting. It's, it's not sort of intuitive, but when you
think about it, it makes a lot of sense, right? If I'm a user and I'm a, you know, a customer of
yours and I want to go to your website to go buy something, if it's down, I might think, Oh, well,
maybe they're doing maintenance. I'll check back later. But if they go to the website and they get
served some content, but then they, you know, they, they search for like a product
and then it just goes down or the latency is, is just abysmal or they can't add something to
their checkout site. They might get really frustrated. And as a result, they might leave
the site entirely and never become a customer where they might say, well, the heck with this
site, I'm going to go somewhere else and grab it from over there.
So it's really focused on your business and it's really ultimately focused on your customers
and your clients that are actually accessing the service.
And do you expect extend customer?
Because I know we do it sometimes in context, but I wonder if from the SRE point of view that you're talking about, will sometimes
define a service customer as the service above the service you're on? So not necessarily the
end user, but the service that's consuming your service. In the way you're talking, it seems like
that does not necessarily apply in the way you're talking. Obviously, in the long chain of command
of everything, that ends up impacting the actual human user. But when you talk about customers, is that part of the mindset,
or is it just strictly the human at the end? It's wrapped up in the mindset, but there's
different ways of approaching this. And realistically, what you want to try to figure
out is how to have service-level observability that is aware of multi-service
topologies, which I think is the underlying question that you're kind of getting at.
And in that situation, let's say, you know, you have a web front-end service.
You might have an SLI that's associated with, you know, its latency and the number of requests
that are coming in that are being returned as like a 200 successful
requests, that sort of thing. But what if that web front end needs to talk to five different
microservices at the application layer? And then let's say two of those microservices need to talk
to two additional microservices also at the application layer. And then those two services
each, so four total, need to talk to the database, added like a database tier.
So what happens when we have, you know, nested sort of like SLIs, that sort of thing?
And what's interesting about that is up until very recently, it was very hard to establish a hierarchy of like connected services and what they could each tolerate. Because if we have massive amounts of traffic
on the front end, the web front end needs to scale very differently from the application tier,
which needs to scale differently from the database tier. So how do you establish these things?
What I like to do is I like to focus on the individual services themselves and start there
so that you understand how this ends up working, but then it largely is dependent upon
the underlying application infrastructure. So for example, if you have a database and the database
can, you know, it can do, let's say, you know, a thousand requests per second, you know, I'm just
making numbers up. Um, if you know that it can do that per instance, and you know, that you can
scale horizontally, then once you get to, you know,
900 or so requests a second, you can then say, Oh, I should create another instance.
And then maybe like load balance between them or something along those lines.
So when you do that, you're able to have a better view of how all of this stuff kind of like plays
together. And then you can start doing like topology graphs. And that sort of gets into like distributed tracing where you can start to start to like recursively construct what an actual request does on the back end and then kind of take that information and then funnel it back into whatever observability system that you would have. It's interesting you mentioned this because for people that are familiar with Kubernetes and microservices and containers and things like that,
but are not familiar with service meshes like Istio, I would always, whenever I'm teaching
a class or talking to a bunch of customers, or if we have a community event, I have a bunch
that happened in New York City back when people could go to such events, right? I would always ask them like, well,
how do you know the total amount of communication from one service to another service? Like,
what do you do in that situation? How do you calculate that? And it turns out it's actually
really difficult to do that if you don't have this overarching view of everything that goes
inside of your cluster. And that's exactly what service
meshes like Istio provide. So instead of just focusing on specific SLOs that are set like per
service, it takes per service as it's implemented in potentially multiple backends or multiple pods
or multiple components, and then aggregates all of that information together so you can better hit those types of SLOs that you want to hit.
So that's basically, well, thanks for bringing this up
because that's also the way we have kind of approached teaching
and educating people about how to enforce SLIs on an individual service level
because we at Dynatrace, we've been doing distributed tracing
at least since I've been with the company
and that's been 12 years
and the company has been around for 15.
Thanks to obviously service meshes now,
certain things come out of the box,
certain metrics for certain environments
like Kubernetes.
But I completely agree with you.
So what we always said is
not only look at your response your response time at your failure rate
at your memory consumption on your service but also look at how many back-end calls do you make
and then not only aggregate it across the whole service of a particular time frame but we also
look at these metrics split by let's, the business function that the service provides. So if you
have a shopping cart function that can add a cart item, delete a cart item, provide the sum or
whatever else, then these are individual business functions that probably have a different call
pattern to the backend. So we also try to educate people. You not only need to look at the number of
database calls you make, but look at them per add to cart, per delete from cart, per login, per logout, per whatever else it
is. And then, you know, establish a baseline first of all, and then see how that behavior
also changes, you know, either from build to build, from release to release. Because you were earlier saying,
if you know how much load your backend database can handle
and then you can scale it based on incoming demand, that's great.
But what if the incoming demand is due to a coding issue?
What if a code mistake is increasing the number of backend service calls by 50%
because you're omitting
the cache or you're using a misconfigured library that is normally doing your OR mapping
and therefore you're making too many calls to the backend.
So, uh, yeah.
Oh, absolutely.
I think that's, it's funny you, you, you bring that up because that touches on another point,
which is services.
If we, if we focus the, the SLI and
the SLO that we're creating, the SLI is more technical, but the SLO is really more about
making the users or the clients of the service happy. Um, you mentioned like a shopping cart
service, um, that might have a very low SLO and it's completely okay for that to happen.
And the reason why that's, that's okay is if I'm a user, um, if I can,
well, maybe not the shopping cart service, maybe like the order processing service, let's
differentiate the two of them. So in other words, if I can go to your website and you, you know,
you're selling, you know, like the classic examples, like the hipster shop or the book
info application, um, that like ships with Kubernetes and Istio. Um, if I want to buy a
couple of things and I hit that checkout button, I don't really care if the order processing system, you know, is a beautifully scalable synchronous system and my order is immediately processed because there's going to be additional latency because someone has to put a product in a box and then ship it to me.
Right. That's going to take a lot of extra time.
So if that goes down for like a day, realistically, I'm not necessarily impacted.
Like it might affect,
you know, you know, depending on if the person paid for, you know, expedited shipping and things
like this, but it's a very different metric that we want to establish. Um, I want to go back just
a second though, and mention that the, the other concept that's that at least Google takes with
respect to SRE is this concept of an error budget, which is directly tied into SLOs. So if you have an
error budget, which basically says, this is the number of failures that we're allowed to have,
that's a very different way of thinking about things than, than just establishing like,
everything has to be this amazing amount of uptime. And that's what we're going to shoot for
because there's always a constant struggle between, you know, releasing new features and making things stable. Cause like in the
perfect world, if you ask like an operator, what's, what's your perfect scenario, the operator's going
to say like, well, I want nothing to change ever. Like I want the number of requests to stay the
same. I don't want new versions. I don't want new features. I know how to run what I have.
And that's that. But if you ask developers, what would you want? They say, well, I want to be able to release every brand new feature as fast as, as,
as humanly possible. And those are kind of at odds with each other, but an error budget allows you to
programmatically like, like define or like codify the rate of innovation that you can have with
respect to each service. So in the case of
like the shopping cart checkout backend system, you know, a lot of that is like kind of like big
data processing, right? You know, I get, you know, a handful of orders in, I do this like massive,
let's say like, you know, MapReduce operation or whatnot to be able to process all of this stuff
and then send it through the appropriate systems. And then bam, I have like a bunch of orders that
are processed. If I want to have innovation on that, because I want to like take my time to figure out
how best to maybe clump or group some of the orders so that geographically disparate locations,
but that are simple or similar in terms of like the routes or whatnot. So like,
let's wait until we have a bunch of people ordering this in roughly the same area so that we can optimize shipping to maybe that location or that distribution center
or whatever. If you want to be able to do that, you can have massive amounts of innovation occur
on something and all of your users are still happy at the end of the day, even if that service has
different types of downtime. So it's something to kind of keep in the back of your minds when designing systems of systems and what works where and when and how and who it's affecting and
who it's affecting upstream and downstream and all over the place. It's quite fun, quite complex.
Just on the error budget again. So you said it's basically a different way of looking at it. How
much trouble can you still afford in a particular time frame and i
would assume let's say a typical time frame is a month and if you say within a month we allow
an error budget of an hour which means out of an hour in one hour out of 30 days or a month
we allow certain requests to fail is this the right way of looking at it?
Yeah, pretty much.
And when you exceed your error budget,
then you have a plan that's defined beforehand,
which then signals what you should then focus on.
So for example, if everything is smooth sailing and you have no downtime whatsoever
and all of your customers are super happy, that's great.
But are you innovating enough to be able to attract more customers?
Do you have better product features?
You know, you're basically going to either stagnate or you're going to slowly drift off into obscurity if you don't innovate.
Your error budget is a good way of being able to sort of gut check the innovation speed that you have such that you can make sure
that you're releasing new features,
but are not ruining the stability of the system.
And it kind of goes the other way around.
Like if your system is inherently unstable,
in a very sort of simplified example, right?
If every day the system is crashing,
this is not the time to be rolling out new features.
Your priority should be fixing the system
so it stops crashing every day. So that's kind of the way that we think about things and the pace of innovation
that we want to be able to release at. And it's interesting because the error budget applies to
specific components that you have inside of your system. And if you have unforeseen downtime,
let's say everything's
going well, you know, you're well within your error budget, you're, you're, you're exceeding
your SLOs by decent margin. And then all of a sudden something completely unexpected happens.
And it just, you know, your system is down for hours and it's only supposed to be down for like
a minute or something like that. Right. Uh, in that scenario scenario the next phase of engineering design needs to be focused on
never letting that happen again so the error budget is that like counterbalance towards what
your priorities should be so it's another way to think about it that's very cool thanks for the
explanation i really like the what is i think what he said in the beginning error budget is kind of
the rate of innovation that's also a great way to put it because that's obviously something that the business cares about.
So coming to, and thanks for the explanation of SLIs, SLOs, SLAs, error budget.
You put a big emphasis that it's about the S.
It's about the service.
We look at services and on the service level, we define these things. Now, in organizations, at least when we approach people, they say, well, who is responsible
for that?
Who defines these SLAs, SLIs, and SLOs?
Well, where do we start and what do we then need to take and break it down into other
pieces?
So can you talk a little bit about this based on your experience on who is actually responsible for coming up with
these and how does it kind of trickle down to the other members of, I don't know, engineering and
testing and site reliability engineering and with the ultimate goal. And I think this is what we try
to get to is how can we actually start and end up with an S and then let's call it SLX and SLX driven culture.
I like that SLX driven culture. Sounds like a car. Like, you know, it's the something SLX,
you know, get it today. It's, it's on sale. It's a new sales event, right?
That's a really good question because a lot of times I've seen a lot of different enterprises, a lot of different organizations implement culture sort of ask backwardly.
It's where it's just like, OK, that's not how we're going to solve this thing.
DevOps and SRE and perhaps it might be good to kind of talk about how they're related maybe after this.
But it's interesting because I've seen people create like DevOps
teams.
So it's like, what are you?
It's like, oh, I'm on the DevOps team.
And it's like, that kind of goes against my understanding and my view of what DevOps should
be, which is cross-functional teams that have shared fate responsibility.
Right.
So, um, if let's say, uh, everyone in a company, let's, let's, let's make up a fictitious company.
It's a small startup.
It's like 100 some odd people.
You got a bunch of developers, a bunch of operators, a bunch of marketing people and everybody else that's working in the organization.
And let's say the site's down.
Okay.
Imagine yourself as the CEO of the company.
If you walk in and you're like, okay, well, the site's down.
We're not making any money and if a bunch of people start arguing over like well you know i i pushed the code it worked fine
but now it broke in production and someone else says well you know it wouldn't break in production
if the code were more stable and then someone else says oh well it's not the code it's actually
the hardware and you know the hardware people are like well it's not the hardware like on like cpu
or anything it's the network so blame the network Ultimately, at the end of the day, all of them are getting less money in their paychecks
because the company's not making money because the site is down, right?
We have to take a shared fate responsibility model where everyone understands that we're
on the same team here.
So it doesn't make sense for one person to try to dictate what the other person should be doing if it's not in service
to the customer, right? To the user who's actually accessing the services. So having said that,
the SLIs are very, they're very technical, right? So it's a measure of what we would consider to be
good for this particular service. It's not extraneous noise that's entering the system.
It's specific to whatever it is that we want to be able to measure, right? So for example,
like the web front end, we need to make sure that it's serving up proper requests, or maybe it's the database. We need to make sure that every query that's coming out is served in X amount
of milliseconds or something like that. So who's the best person to set that? Well,
realistically speaking, it's the person who knows the most about how this service is used by the services that are calling it or the customers that are calling it. Now, whoever that person is,
should be the person who's helping establish these types of SLIs and these types of additional
metrics that we want to be able to gather.
That could be a developer who's intrinsically, you know, understands all of the ins and outs
of the system, but maybe not because they don't necessarily know the usage patterns,
right?
Because they're focused on code and making sure that, you know, functions work and we
have really good data structures and things like that.
You could go to the ops people and you could say, hey, you guys, you know, what's, what's an appropriate response code. But if you ask a
bunch of ops people, you know, as long as it's, it serves there, then they're fairly happy.
So the answer to this is kind of interesting because the SLO that you want to actually
establish is really up to the business need, or it's up to the customer focused understanding
of what that business goal should be so the sli could be driven by the people who understand the
technology but the slo should be driven by people that understand the business outcomes that we want
to establish that's really interesting then yeah breaking it into the two teams. And then, so I've, I know you explained it earlier, but then I think you need to, again, help me now understand.
If you say the SLO is more the business view of things and the SLA the technical, isn't the SLA, I always thought the SLA would be more like on the business side, we talk about, is the site available?
So maybe you just brought in another set of confusion for me,
or maybe I just misheard and you actually said SLA,
but I heard SLO.
That's the SLO.
Can I take a crack at it to see if I understand it?
Sure, by all means.
Correct me, I might get this totally wrong,
but from what I was getting with the SLA and the comment I made earlier about the misuse of it, is an SLA is an actual contractual agreement of we're going to deliver this type of performance or this metric to you.
And if not, this is the actual repercussion of what we're going to have to service back. So if we're a third-party payment processing and we promise you we'll process your credit cards within 200 milliseconds and we violate that, we're going to have to pay you.
Part of our SLA agreement is at 200 milliseconds.
If not, we're going to have to pay you or give you some kind of discount towards your monthly bill.
And that's really what an SLA is. Whereas I think a lot of
people just kind of use it as a metric. They just took the metric part out of it and ignore the fact
that it's an actual something that's written up in a contract with repercussions. Whereas what
you're saying is the SLI is, you know, the bit or starting with the SLO, the business people are going to say, we need to have the site up and running and our customers, you know, whatever, we're going to have a promotion.
And we want to make sure our promotion can handle a 30% increase in traffic with less than a half a percent of errors.
Right.
So then the technical team has to figure out the slis
that they can use to measure and deliver that yeah that's that sounds good yeah
it's kind of a fun one um i think a lot of people tend to confuse slas and slos because they don't
really make a nuanced distinction between the O and the A, right?
So like, um, what we'd like to do is we like to differentiate them because sometimes we really want to focus on SLAs as a relationship between, uh, like a, like a, you know, a client and a,
and a provider or like a provider and a customer, right? Um, if you're, you know, a retail company
and you, and you're entirely B2C, right? You're a business that services customers as opposed to like other businesses, right?
There's no service level agreement that's established with the public, right?
If I go to, you know, Sebastian's amazing sock shop.com, it's not a website.
If it is, kudos to whoever's putting that together.
But let's say if you were to go there and it's down, I'm not going to have to pay all of my
customers some stipend because the website is down. However, if I'm a provider of like a third
party transaction API that does credit card processing and I go down, my business being
down now affects your business in the example that you mentioned, because you can't clear any checkouts. So in that
situation, you need to have a contract that says, if I miss my SLO, which is purely what we want to
define as the proper target for a service to operate at, I need to go do something. So that's
why we have that little bit of a distinction. And naturally, if you kind of expand on that,
you want your SLOs to be a little bit tighter than your SLAs. So if we, if you want to maintain, you know, a 99% uptime SLA. So if we're
down for 1%, that's fine. If we go over 1%, then we have to compensate you for something or other,
then we, we better be sure internally that we're not hitting that 99% or we're, we're not hitting that 99% or we're not hitting that 1% territory.
So we might have a 99.9% SLO, but our SLA is a little bit looser than that.
So we can still break it, but we can recover and not have any sort of, you know, legal ramifications or something like that happen.
Cool. Thank you for that. Hopefully that cleared my SLI, SLO, SLA mix up in my head.
That's good. Hey, earlier you made a great point about DevOps where you said
just because an organization has now a DevOps team, most of the time means they got DevOps
wrong because it should be shared responsibility of multidisciplinary people in the team
delivering value to the customers. What about SRE teams? I see people that now say, well, I'm in the SRE team.
Is that the same thing?
Or what do you see from an SRE perspective?
What an SRE team should do?
That's it.
Yeah, that's really interesting.
I think that there's a, first of all, it should be noted,
there's a difference between titles and actions, right? So, you know, if I want to claim that I'm a senior managing director of, you know,
a tiny one person dog walking company, I can claim that what, what that actually means is like,
I'm the only employee inside the company. So the title doesn't necessarily match with, with what,
what, what ends up happening. And every now and again, we'll stumble across that where, you know, a director of cloud reliability, engineering center of excellence or something or other turns out to be very new to the position or doesn't really have the background that we were expecting and whatnot.
So I really like defining it based on actions. So like, what does this person or this group of people do? And does it align with what my sort of definition of what they should be
thinking about does? An SRE team is different from sort of like a DevOps team. So first of all,
let's differentiate the two of them and then talk about like why an SRE team is actually an okay
thing to have, I think, in my opinion. So first of all DemOps is, when did the term emerge? I think it emerged in
like late 2008, I want to say. It's essentially like a set of principles, right? It's a set of
practices or guidelines, some cultural stuff that's thrown in that's really designed to break
down silos, right? So like we don't have IT operations existing in a bubble where the rest of the people
that need to have IT functions be isolated for some reason, right? So that's sort of what DevOps
is kind of focusing on. And it's really focusing a little bit more on like release engineering,
if it's what I see out in the field. So like CICD or, you know, as soon as anyone mentions Jenkins,
that's usually like, oh, they're the DevOps team. Right. So that's where that's a kind of, you know, relegated to
cyber liability engineering on the other hand is like a set of practices that, um, you know,
different companies, particularly Google who kind of like coined the term have found to work that,
uh, that, that, that, that facilitate our ability to actually get something done.
So I'd say DevOps is more of a specific set of practices,
and SRE is a little bit broader definition of a way of thinking about something.
There's a great t-shirt that a few of our developer advocates have
where it says, Class SRE Implements DevOps.
And if you're a coder, I think that's a perfect way of sort of thinking about this, right?
DevOps is like a set of tools that you have
and a set of practices,
but site reliability engineering is the pursuit
of something bigger than just what DevOps has
as its purview, which is kind of fun.
So in that aspect, right, if you have a DevOps team,
most of the DevOps teams I've seen
are focusing on basically release engineering.
They're an incorrectly named release engineering team that focuses on infrastructure as code and like CICD pipelines.
That's really what it kind of boils down to.
They're not thinking about necessarily breaking down silos, particularly if it's like their own team that has to maintain like uptimes and things like that.
Site reliability engineers, on the other hand, tend not to be on a separate team, but instead
tend to be on individual product or services teams, but then they also correspond with
themselves, which is a little bit different.
So you think of it as like a distributed team that has like a core set of functionality, but that's embedded into another
product or service. So another way of thinking about this is if you're, you know, that, that
retail company that we keep going back to just, you know, fictitious company that sells socks or
whatever, the payments team might have an SRE on their team who is solely tasked with making sure that we're implementing
and, you know, proper reliability features into whatever the product is that we're building.
You know, if it's a handful of different services, if it's the appropriate monitoring,
the appropriate scalability, and so on and so forth. In that situation, they're embedded on
that team, but at the same time,
they're a member of the site reliability engineering organization, if you want to
think of it that way. So the SRE team is actually just a collection of people that are on a bunch
of other teams that are doing something else. So it's a different way of thinking about it,
but they're tasked with making sure that whatever area they're working on or whatever service
they're, they're, they're aligned to is reliable and working appropriately.
And because they're a scarce resource, teams tend to vie for their attention.
And it might be a case where, um, you know, as the organization grows, um, we can't hire
a bunch of, of SREs because they have to have a skill set in software engineering and in systems operations and large system design.
And there's very few people that are in that middle section of that Venn diagram that kind of have a foot in both. big complex projects and big complex, you know, pieces of infrastructure or architecture that you have for whatever company it
is that you're a part of will delegate them to the most critical components of
their infrastructure and then say, make sure that this thing, you know,
like never fails and then other teams can implement similar practices,
but maybe you don't have someone who's dedicated specifically to implementing
some of the features.
Well, the way I understand it, I think so, yeah.
So basically they come in and hopefully not just, you know,
build something and fix it, but also kind of mentor and onboard
so that the team later on can kind of take over what they've done
and continue these practices and make sure that the system stays reliable
if they may have to go back to another team
to help them kind of get all these things established right yeah
exactly and keep in mind that they're they're they're engineers that that focus on system
reliability and the entire practice is essentially about applying software engineering principles to
solving you know scalability issues so it could be just like writing a bunch of tools to make sure
that they can do their job better you know investigations or monitoring or leveraging
tracing products for example devops could have could be seen as the same way right if you do
devops right you would also have people that kind of um uh enable other product teams to become
fully responsible end-to-end help help them build their pipelines, build their reliability,
build their monitoring,
their SLIs and their SLOs and SLAs,
and then kind of leave them in a state
where they become self-sufficient,
where they are autonomous in the end.
And then, so I think we should apply this to DevOps as well,
if you think at least the way I think about it.
It should, but usually when you do that, you consider yourself a little bit more of an SRE
than a DevOps practitioner necessarily. Because the other thing to keep in mind is
development and operations and establishing these feedback loops and making sure that we can release
smoothly. On the one hand, site reliability engineering does focus on the ability for us
to like release code successfully
because if we release something and it doesn't work,
then it's not reliable and we got to fix it.
But at the same time, like there's infrastructure
to be associated with the release process.
And that role really doesn't fall on any individual team.
When you think about it, it's like, it's not really the,
it's not a bunch of developers that are writing code, but it's not necessarily ops to host the
applications that we're writing as a company. It's this middle ground that helps facilitate
the release of additional code, which is why I think DevOps tends to be a little bit more
relegated to different components of release engineering, and then they throw additional cultural components into it.
And then you slap an SRE sticker on the side of it, like DevOps, now with SRE, and kind of go from there.
So it's always interesting to see how companies implement these sorts of things.
And what's funny, too, is in the field, when I engage with customers,
and we're seeing a lot more people now saying, oh, we're with the SRE team, the difference between concept of it and practice, I'd say, that I'm seeing is so far, it seems like a lot of the SRE people I'm interfacing with are mostly engineers who are tasked with looking at the performance of their system
and finding ways through code to improve it.
So they're looking at, you know, everything's running,
but they want to take a look, okay, this process is taking five seconds.
What can we identify as a hotspot to then have the development team fix?
So they'll look at the trace, they'll look at the execution and say, okay, we see this is making a bad example, N plus one query on the database.
Here, give it to Sal or whoever to fix it and make improvements on that side. So it's still,
at least what I'm seeing in the fields, it's still sort of in its infancy in a lot of ways,
obviously not at Google in some places, but I think there's a lot of people are starting to embrace it.
But based on the definition and the explanation you gave on there, I think there's quite a
large gap that companies have to fill.
And I think that's always the challenge, right?
Because I'm sure this is, hey, you're on the SRE team now, but none of your other
responsibilities have changed.
And now you also have to do this new task. So I think it's probably a lot more of just trying to figure it out.
Their only guidance is probably reading the SRE handbook and trying to take it from there
without having someone, an expert come in. A lot of times when you have Agile or other things like
this, you get people in and help train the company and help train people on how to do these things. I'm sure there's not a lot of that going
on, uh, in the SRE push at this moment though. Yeah. It's, it's interesting you mentioned that.
Um, so there's a great book. Um, it's like a little, you know, those like little mini O'Reilly
books that you can get, um, that are, are usually compliments of someone. So like someone sponsors it.
I would be remiss if I don't plug one that actually like just came out, which is called SLO Adoption and Usage Insight Reliability Engineering. I believe it just came out in April
and it's an O'Reilly book report compliments of Google Cloud. It's free. If you just search for
that title, you can download the PDF, but
there's fascinating insights inside of it that I was actually reading in sort of preparation for
the podcast. Google has, we acquired the DevOps Research Association or DORA. I think that's what
it stands for. I should probably double check that. But it produces this massively complex and incredibly
insightful
really thick document
which talks about where a business is at
with respect to DevOps and
how do high performers outperform
low performers and what are the qualifications
and how many different dimensions do you want to slice and dice
things. I mean, it's a proper
like, it's essentially a research
paper. It just happens to be published not in, you know, not published just like on our website.
Absolutely. Yeah. And it's really interesting. Now this book is kind of like a mini version of
that that's specifically focused on SLO adoption, which I would highly encourage people to take a
look at. And there's some interesting statistics. And I wanted to mention a couple of these,
because I think they're super apropos. You mentioned this tends to be kind of new.
43% of businesses that responded actually have an SRE team. Let me say that again,
because I kind of stumbled through it. 43% of businesses have an SRE team. So right off the
bat, there's not a lot of people that are
actually implementing SRE that would classify themselves as an SRE team. And most are actually
under three years, meaning they haven't been investing in this for the past decade.
This is a relatively new thing. And when we talk about SLOs and SLIs and things like that,
34% of the people that were surveyed actually implement SLOs and SLIs and things like that, 34% of the people that were
surveyed actually implement SLOs, but 31% have SLIs, which actually define their SLOs. In other
words, people just kind of come up with a number and they're just like, oh, it should be like 99%
uptime. But then when you ask them like, well, have you, do you have proper, you know, defined
metrics and SLIs that can actually inform this? They go, oh yeah, no, not really.
We just sort of set one and then, you know,
we yell at a person if the site goes down,
which is not really the purpose of site reliability engineering.
So there's tons of extra insights inside of a little booklet.
Very cool.
So SLO adoption, SRE.
We'll put a link to this as well in the proceedings.
I think there's a lot of more stuff to talk about.
I just want to get your opinion on one quick thing, and I think then we can wrap it up.
But we've been promoting and pushing a lot and wrote a lot of blogs and did a lot of presentations on shifting left SRE.
And what that actually means is shifting left
the enforcement of SLIs and SLOs
as part of your delivery pipeline.
So if you have, if you use the same concept
of what metrics are important for me
and what are my contracts to the users of my service,
and I have a pipeline where every build
gets properly tested with a representative amount of load,
load testing tools, then I believe or we believe we can also look at SLIs and SLOs
and let the pipeline fail in case a code change is either jeopardizing the response time that
we promised or the number of database calls we make to the backend system, or any of these.
Have you seen this as a movement, as a push to enforce SLIs and SLOs in the delivery pipeline?
It's really interesting that you bring that up.
It's something that I'm really interested in and really passionate about is the programmatic execution of like increased business functionality, which is kind of it's like an all encompassing term for this.
So I've seen it internally, but the difficulty there is you have like an incredibly niche market in the sense that you have to have customers that are super well
established, that understand how to do like SRE to begin with. They have to understand how to have,
you know, proper CICD pipelines. They have to be okay with things like canary deployments,
where I can actually roll this out to receive production traffic and then monitor for changes
to actually get insights into whether or not this version of the code is in fact better than the previous version of the code and so on and so forth. So in that scenario, a really
interesting future looking statement on how all of this stuff would work would be a machine learning
derived automatic heuristics system that would automatically take in every single SLI that's inside of your
multi-layered system and automatically calculate what is a considered good norm. So like this is
the expected operations of all of these things. And then when you release a new version of whatever
component you would possibly release in a fashion that will automatically understand how to do
canary deployments,
but then also canary promotions. So in other words, like we're going to release a new version
of the code, we're going to slowly send traffic to it in very small percentages, like maybe like
1% in a specific geographic region. And then we'll go a little bit larger and a little bit larger.
And then once we're confident that the new version actually works, slowly roll that out across the
fleet. And when we do that, we're constantly monitoring and seeing how this
performs versus the old versus the old version. And then from there, we can start to take actions
and have signals based on whether or not this is a good thing or a bad thing. So if, for example,
if we roll it out and total latencies go down because we've, you know, we've optimized some sort of networking library, but the total amount of, let's say, maybe as part
of, of, of this implementation, it has like a little in-memory cache or something like that.
But memory consumption is going through the roof. And as a result, there's unintended consequences
of it. We can choose to make a decision based on parameters that we've specified ahead of time.
What's, what's interesting about this scenario is it's doable with today's technology, actually.
If you're leveraging Kubernetes for, let's say, microservice management,
if you're leveraging Istio for your service mesh to get that service level observability that you want,
you can then take actions on any of the signals that are coming out of the system. And if you have, you know, distributed tracing frameworks built on top of
it, then you can get even better insights to how all of this stuff is operating. The problem is,
is that in defining what the idealized state would be, you have this really, really large,
like a non-exhaustive problem. So you have to think to yourself, like,
what are all of the different possible combinations of things that would constitute a
quote unquote, well-running system? And how do I define these, uh, you know, ahead of time
so that the system can then be informed as to how to take actions afterwards.
So you can either solve that deterministically ahead of time, in which case you're going to
spend more time defining how you want your system to run than actually running your system.
Or you can define it as a series of probabilistic heuristics that are derived from statistics that are collected in the system or statistics that are collected and then predicted using some like machine learning models. And what we're starting to see, especially a lot of interesting monitoring companies and
service-based companies start to do is start to implement a little bit more sharper, like ML
models that can actually predict whether or not you will have a successful code release based on
code coverage, based on additional signals in the build, based on additional like unit testing,
integration testing, and canary feedback, and then can automatically take that sort of response for you. But that is like so cutting edge that there's like,
like we're still working on like, what does that look like from, you know, just a pure engineering
perspective, let alone an implementation spec, right? So maybe if we're optimistic and we get
a lot more people excited about DevOps and SRE type culture,
maybe we'll see that in, I don't want to make a prediction, but like maybe five,
10 years from now. But until then, like we're not even monitoring half of the stuff yet. Right.
So let's focus on the low hanging fruit of like making sure that this darn service is monitored and, and make sure that we can keep it up before we start going super crazy.
Well, I want to put one word in your, in your head, maybe for some research for you, and make sure that we can keep it up before we start going super crazy. That's an awesome idea.
I want to put one word in your head, maybe for some research for you,
maybe for a future conversation,
and Brian knows exactly what that word is going to be now.
Look at Keptn.
That's an open source project we are curating.
So K-E-P-T-N.
With your German background, you probably figure out what that means.
But I'll leave it with that.
We are trying to contribute something like this
to the open source community
that is tackling some of these problems.
Interesting.
Cool.
Hey.
Super cool, man.
Well, Sebastian, I know you probably have
another hour or two or five to talk about
because you have a lot of experience in that field. But I think we want to wrap it up now. you probably have another hour or two or five to talk about when,
because you have a lot of experience in that field, but I think we want to wrap it up now and,
and,
you know,
invite you for another episode at a future time,
because I'm definitely sure there's more we can learn from you.
Yeah.
Oh,
absolutely.
Thank you so much for having me.
It's been an absolute pleasure talking
to you.
And Brian, we'll keep the, uh, we'll, we'll skip the summer later today because normally
Sebastian I'll kind of try to summarize, but I think you would not be able, I think people
should just listen again to the SLI SLX, that description of what you kind of walked me
through multiple times until I hopefully finally got it.
So awesome.
Thank you so much.
There are some summaries in there as well.
It's funny, too, because this is normally the part of the show when I ask, oh, any appearances you're going to be making.
Virtual appearances?
Not today.
Yeah.
Oh, yeah, there might be.
Are you doing any virtual online?
No.
Okay. Oh yeah, there might be. Are you doing any virtual online? Um, no. Um, we,
we were going to do some stuff with next, uh, and we're working on kind of figuring out how to transition next over to like
digital content. Yeah. Um,
and because we have a bunch of like presentations and things like that, um,
figured out, but we wanted to do like an entirely digital version.
And then we're trying to figure out like what,
what are the best logistics for that?
And how are we going to meet with people?
And we're taking a slightly different approach where we want to like reach out to specific people that we know have specific concerns and have it be a little bit more personal, which would be kind of fun.
And then maybe release some recordings on YouTube under like Google Cloud Platform and talk about some stuff there. But for me, I'm unfortunately
not speaking anywhere in the near future
because they all got canceled
and they all did not go digital.
Yeah.
Oh, well, we'll see.
Hopefully next year, 2021, right?
Fingers crossed.
It'll be a nice good year.
Yeah, hopefully.
All right. Well, again, thank you so much for coming on the show.
And we look forward to having you back.
If anybody wants to follow you, they should just look you up on LinkedIn.
We'll have a link in there.
Do you do like Twitters or anything like that?
Or do you mostly do LinkedIn for...
Yeah, it's just that DevOps guy is all of them, which is pretty cool.
Oh, nice. I got a lot of things, though.
All right. Well, appreciate it. And we'll talk to you soon.
Thanks a lot. Thanks to everyone for listening. Bye bye.
Bye bye.