PurePerformance - SLO Adoption and Usage in SRE with Sebastian Weigand

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always my co-host Andrew Grabner. Hey Andrew. What's wrong with you? Why do you call me Andrew? I just wanted to mess with your name today because it's been a while since I messed with it. My favorite messing with your name was Andy Candy Grabner from a Halloween episode several years back.

Starting point is 00:00:47 Several years, Andy. Can you believe it? No, it's amazing. Wow. Yeah. Anyway, Andreas Grabner. Andy, to everybody else. To answer your question, I'm doing actually pretty well. Week seven or eight or nine of the lockdown is the time of the recording and um we are seeing the i am not only seeing the light of the sun

Starting point is 00:01:06 that is shining through the clouds now but we're also seeing the end of the light at the end of the tunnel at least in austria here australia you just said australia i said austria no i think i'm gonna i'm gonna you know i'm gonna edit this and play it over and over again in this episode and see what it sounds like but At least in Austria here. At least in Austria here. At least in Austria here. Yeah, well, we'll see how things go. You know, starting to see things.

Starting point is 00:01:31 Yeah, it may be the end, may not be. It might be relapsing. We'll see how everything goes. But glad to hear things are going decent on your side of the pond. I imagine you don't have people going out with guns and awful hate flags to try in your country going on so much as we do here. But yeah, interesting things going on. I'm glad to be talking to people,

Starting point is 00:01:52 you and our guests today. It's lifted my spirits a bit. So let's go. We have a fun show today, a hot topic. I think we're seeing it more and more all over the place. It's been around for a while though, I think. I think this goes back. Well, our guest can tell us, right? So why don't you go and introduce our guest and we can get into it. Sure. So, well, the first thing I asked my guest when we got on the mic just before the recording

Starting point is 00:02:20 is that you must have some German background because your name, Sebastian Weigand, sounds very German. And on his LinkedIn profile, he's called that DevOps guy and he's currently working at Google. So that's an interesting, great combination to talk about all things DevOps, all things SRE, SLIs, SLOs, SLAs.

Starting point is 00:02:37 I'm sure we'll find a lot of things. But I think I want to hand it over to Sebastian now to introduce himself to the audience. And then Sebastian will take it from there. Definitely what we want to learn from you is, what is this whole thing about SLIs, SLOs, SLAs? Because that's what's really hot on our minds these days. So take it away.

Starting point is 00:02:56 SEBASTIAN SCHMIDT- Sehr gut. I'll get started then. Yeah, so thanks for having me on. I really appreciate that. Like people said, my name's Sebastian. I work at Google as an application modernization specialist in our customer engineering department. Essentially what that boils down to

Starting point is 00:03:12 is as organizations try to think about moving up to the cloud or realistically just modernizing their application stack and all of the infrastructure that's associated with it, they take an interesting path. And part of that is a digital transformation that's spurred on by things like DevOps and SRE culture. Part of that is understanding new technologies. So things like Kubernetes and containers and advanced monitoring and distributed tracing and things like that. I've been doing DevOps since before DevOps was a term because I was born in the fires of systems administration way back when.

Starting point is 00:03:47 So I feel like if you started as a Linux admin, you naturally sort of progress to like, you know, I have to go set up a server. And then it becomes, well, I'm getting asked to set up a server every other Tuesday. So I might as well write a script to do that too. Well, I seem to be getting asked to do this every day now. So maybe I should move to some different configuration management type options. And that inevitably leads into this like programmatic approach to systems administration and operations, which is kind of interesting because it sort of naturally dovetails with like the concepts and the components inside of site reliability engineering, which is always a fun

Starting point is 00:04:27 topic because SRE and DevOps and all of these fun industry terms are more than just buzzword bingo, but at the same time, very difficult to define. Depending on who you ask, you're going to get a variety of different definitions for pretty much everything. And even if you define them, you get really different interpretations of what they mean. So someone's like, Oh yeah, I know what that is. But when you ask them, how do they, you know, use a technology or use a components there, they're like, Oh yeah, I had really no idea. Or they use it completely differently than you were expecting. So it's kind of interesting to see how this like comes together and, and, uh, informs our opinion of how we should tackle like modern i don't want to say business because this also affects like academia

Starting point is 00:05:09 and government things like that but it just affects your ability to leverage it to solve organizational problems you know it's funny you you mentioned that you came from uh being a systems administrator my my favorite use of the word devops is when you have a traditional sysadmin who changes their title from sysadmin to DevOps admin and does nothing different. Absolutely, right? It's equivalent to like, hey, do you guys have security implemented in your system? And they're like, oh yeah, we bought security. It's that commercial off-the-shelf product that's right next to like stability and scalability right you just buy it it's not a thing

Starting point is 00:05:49 like that yeah so um it's interesting you said so you were basically born born into devops before devops was a thing do you think now that these new kind of hype terms like s3 and i'm sure there's another thing coming up soon just around the corner do you think if um if devops wouldn't have been coined as devops but if site reliability which obviously is things that people have done in the past too but if this term would have caught up earlier that we would have just used this term from the beginning and then it would have kind of evolved into something else because in the end, it's very adjacent anyway? It's interesting. I don't think so.

Starting point is 00:06:29 And because I think that you need to have an iterative approach to how we tackle operations tasks. So I think the reason why DevOps became a term was because you really needed to think of what you're doing in an IT space from a more holistic perspective. So it's not just operations. So like way back in the day when I worked in managed hosting, you had like a net ops team and you had like an infra ops team or a web ops team or things like

Starting point is 00:06:57 that. But we really didn't focus on the developer interactions and the, and the appropriate feedback loops that need to be established to really, you know, cohesively form a, a, a high-performing team that can release things, you know, on schedule or, and, and more importantly, like on budget. So I think that whatever it is, that's going to happen in the future. I think you're, you'll continue moving up the abstraction stack, but we need to figure out how to essentially like cut your teeth on the nuts and bolts of systems administration and operations. But at the same time, developers needed to cut their teeth on programming languages and frameworks and recomposable units and modularity and distributed systems and things like that. And it's only until you move forward and the tools progress that you can start taking advantage of newer ideologies and newer methodologies that then empower the next generation.

Starting point is 00:07:50 So you kind of like level up as you as you get to a certain point. And then from there, you take it and you go do something else that's even more advanced or more capable. So whatever you want to call it, you still have to do one before the other, before you realize that you have to do both of them together. Very well put. So let me dive into one topic, which, you know, when we reached out to you to do a podcast episode, we said a big topic or a lot of terms that are kind of thrown around these days and by the industry also by us at dynatrace is the concept of slis and slos and slas and as i think there's a lot of people that are just using it as the term devops and sre and noops just to make buzz and say hey we know the latest

Starting point is 00:08:41 shit and we can we we obviously have a solution for that but then a lot of people may not even know what that means at least i get a lot of feedback when i just put the terms as allies and as slows on you know a small a small group of people sometimes at least has the the courage to say what does this mean again some people just not and they don't know actually and just assume that later on they may learn about it but would you do us the favor and um give us the the best or like a good description of what this is all about what these different terms are what they mean um so that we can kind of you know at least establish the baseline knowledge that we all need to have absolutely um and also keep in mind too that there's a broad term of like site reliability engineering, which is sort of a, you know, it's a subject matter, right? But then there's also sort of Google's approach to it and Google's opinionated implementation of a set of practices and principles that work really well for us that we've sort of codified into site reliability engineering. And if you Google like SRE book, I think the first one and the workshop book are available for free if you want

Starting point is 00:09:50 to read them online. But to get to SLIs, SLOs, and SLAs, it's always fun to go through all of those. And I think we're actually leaving one out too, which is important, which is error budgets, at least with our specific interpretation of site reliability engineering. So let's start with SLI, right? And SLI is a service level indicator, right? And what I like to think of this as a good enough measure or a well-defined measure of successful enough. You know, for example, an availability SLI could be the proportion of requests that resulted in, say, a successful response. In other words, this is the metric that determines a service's reliability or a service's performance. So if you want to think of it like, you know, take like a CPU metric, right? If my CPU is high, is that a good thing or

Starting point is 00:10:39 is that a bad thing? Well, it depends, right? If I'm serving a bunch of web requests, then maybe if the CPU is high, I will cease to be able to serve web requests. But if I'm doing a bunch of video transcoding, if my CPU is low, then something is not working properly because I need to use all of the cores on my system to encode video. really tied to a service's availability, we want to focus on what it actually means that gives us a good indicator that a service is performing properly. So a good indicator would be, let's say if you're a big, say, search company or whatnot, if all of the requests that are coming in on the web are being served properly, and the definition of properly depends on your business goals, right? So like, what if every request that comes in is being served, but it takes three seconds in order for us to give them a response? Well, maybe that would be a little bit below the threshold.

Starting point is 00:11:30 So we can set an SLI to determine what that is. And keep in mind, you can have multiple SLIs for individual services. So for example, we want most of the web requests coming in to be, let's say, 200s. We want them to be successful requests. But at the same time, we also want maybe latency to be under a specific time. And keep in mind, those are all sort of sliding metrics. Now, when we get into SLOs, which is the service level of objective, that's a top-line target for a fraction of successful interactions. So if we establish the SLI as like the number of successful requests,

Starting point is 00:12:07 the SLO could say, okay, over a specific period of time that we've established, the average number of requests that are coming in need to be above a certain, say, watermark or something like that. So like if you have a 97% availability SLO and you get, you know, a million requests over, you know, some amount of weeks in order to meet our SLO, and you get, you know, a million requests over, you know, some amount of weeks in order to meet our SLO, we would need like 970,000 successful requests. And that's how the SLIs relate to the SLOs. Now, SLAs are kind of interesting because SLAs tend to be what we, what we see in a lot of like a corporate documentation or contracts or things like that. SLAs are really, realistically, it defines what you're willing to do if you're failing to meet your SLO.

Starting point is 00:12:54 So it's more of a contractual thing than it is a technical thing. So, for example, if we fail to meet an SLO for any cloud service provider, you know, any cloud service provider or any third party software as a service or something like that, if, you know, you went to the website and it's down, well, that's a problem. So your SLA establishes, well, if it's down for such and such amount of time, we're going to refund you such and such amount of credits, or maybe, you know, give you some additional money or refund whatever it is you paid for or something along those lines. So that's how all of these, all of these relate to each other. Do you have any questions on those?

Starting point is 00:13:31 Go ahead, Brian. Okay. So, well, one thing, some comments and some things to just confirm that I grasped them well enough. The one comment I wanted to make about SLAs is that every time I hear that definition, I think about how far away from that definition the use of SLAs has gotten, right? Because most of the time, I think people are looking at an SLA and say, oh, my SLA is my 90th percentile needs to be under 500 milliseconds, right? Which is the measure.

Starting point is 00:14:00 It's not the repercussion or what you're going to do. It's not the contractual bit, but it's like, oh, we promised our customers export kind of X response time. And it's just looked at as the metric. But yeah, but anyway, so the thing with SLIs and SLOs, if I understand them correctly, just to make them super, super simple, just to keep the difference between them, because to me, it's always like, well, which is the SLI and which is the SLO? So the SLI, that was the first one you talked about, right? The indicator?

Starting point is 00:14:32 So I think the easy way, at least for me to think about them, and I want to make sure I think about them correctly enough at least, is on a very basic level, the SLI is what it is that you're going to measure, and the SLO is what the acceptable measurement is. Yeah.

Starting point is 00:14:48 Right. That's a good way to put it. And then obviously you have, it should be tied to actually availability and all that kind of stuff, but just in terms of keeping them, yeah. Keeping them straight in the head. All right.

Starting point is 00:14:59 Good. Good. Yeah. So I wanted to just confirm, cause I kind of always thought that that was an okay way to think of it. But now that you're on, I want to confirm because you're what I'll call the expert. Yeah. And I think the important thing to take away from this is a lot of people focus on the I or the O or the A or that sort of thing. And I really want to focus on the S part of that. It's per a service.

Starting point is 00:15:21 Yeah. And when we have to think about a service, we have to think about users. And this is not something that I think a lot of people grasp sort of intuitively because a lot of systems administrators and ops people tend to jump in and then start to think of like, okay, well, I'm intimately familiar with whatever it is that I've been asked to manage or spin up or something like that. And as a result, I can think of like a bunch of metrics that I want to capture in my mind and a lot of like alerting policies that I can create and things like that. But realistically speaking, none of these matter from a user's perspective, right? So if I'm a user and I want to go to your site and it's up, that's all I care about. I don't care about, you know, additional CPU overhead.

Starting point is 00:16:03 I don't care that your disc is at 98% capacity. I just care that I can actually get to the, um, get to the site and I can start doing something. Yeah. The interesting thing about this as well is the, um, the service level indicator doesn't need to be, uh, just focused on let's say good or bad. Well, everything kind of breaks down into good and bad, but there's different quantities of bad. So for example, uh, let's say your site is up and it's available and it's down and it's not available. We kind of think sort of binary, but the problem with that is like retail sites in particular. Uh, when I talked to a lot of the leaders in that space and I asked them like, what's the most important like measure of availability? They actually care about latency more than they care about

Starting point is 00:16:47 site availability, which is kind of interesting. It's, it's not sort of intuitive, but when you think about it, it makes a lot of sense, right? If I'm a user and I'm a, you know, a customer of yours and I want to go to your website to go buy something, if it's down, I might think, Oh, well, maybe they're doing maintenance. I'll check back later. But if they go to the website and they get served some content, but then they, you know, they, they search for like a product and then it just goes down or the latency is, is just abysmal or they can't add something to their checkout site. They might get really frustrated. And as a result, they might leave the site entirely and never become a customer where they might say, well, the heck with this

Starting point is 00:17:22 site, I'm going to go somewhere else and grab it from over there. So it's really focused on your business and it's really ultimately focused on your customers and your clients that are actually accessing the service. And do you expect extend customer? Because I know we do it sometimes in context, but I wonder if from the SRE point of view that you're talking about, will sometimes define a service customer as the service above the service you're on? So not necessarily the end user, but the service that's consuming your service. In the way you're talking, it seems like that does not necessarily apply in the way you're talking. Obviously, in the long chain of command

Starting point is 00:18:02 of everything, that ends up impacting the actual human user. But when you talk about customers, is that part of the mindset, or is it just strictly the human at the end? It's wrapped up in the mindset, but there's different ways of approaching this. And realistically, what you want to try to figure out is how to have service-level observability that is aware of multi-service topologies, which I think is the underlying question that you're kind of getting at. And in that situation, let's say, you know, you have a web front-end service. You might have an SLI that's associated with, you know, its latency and the number of requests that are coming in that are being returned as like a 200 successful

Starting point is 00:18:45 requests, that sort of thing. But what if that web front end needs to talk to five different microservices at the application layer? And then let's say two of those microservices need to talk to two additional microservices also at the application layer. And then those two services each, so four total, need to talk to the database, added like a database tier. So what happens when we have, you know, nested sort of like SLIs, that sort of thing? And what's interesting about that is up until very recently, it was very hard to establish a hierarchy of like connected services and what they could each tolerate. Because if we have massive amounts of traffic on the front end, the web front end needs to scale very differently from the application tier, which needs to scale differently from the database tier. So how do you establish these things?

Starting point is 00:19:35 What I like to do is I like to focus on the individual services themselves and start there so that you understand how this ends up working, but then it largely is dependent upon the underlying application infrastructure. So for example, if you have a database and the database can, you know, it can do, let's say, you know, a thousand requests per second, you know, I'm just making numbers up. Um, if you know that it can do that per instance, and you know, that you can scale horizontally, then once you get to, you know, 900 or so requests a second, you can then say, Oh, I should create another instance. And then maybe like load balance between them or something along those lines.

Starting point is 00:20:13 So when you do that, you're able to have a better view of how all of this stuff kind of like plays together. And then you can start doing like topology graphs. And that sort of gets into like distributed tracing where you can start to start to like recursively construct what an actual request does on the back end and then kind of take that information and then funnel it back into whatever observability system that you would have. It's interesting you mentioned this because for people that are familiar with Kubernetes and microservices and containers and things like that, but are not familiar with service meshes like Istio, I would always, whenever I'm teaching a class or talking to a bunch of customers, or if we have a community event, I have a bunch that happened in New York City back when people could go to such events, right? I would always ask them like, well, how do you know the total amount of communication from one service to another service? Like, what do you do in that situation? How do you calculate that? And it turns out it's actually really difficult to do that if you don't have this overarching view of everything that goes

Starting point is 00:21:22 inside of your cluster. And that's exactly what service meshes like Istio provide. So instead of just focusing on specific SLOs that are set like per service, it takes per service as it's implemented in potentially multiple backends or multiple pods or multiple components, and then aggregates all of that information together so you can better hit those types of SLOs that you want to hit. So that's basically, well, thanks for bringing this up because that's also the way we have kind of approached teaching and educating people about how to enforce SLIs on an individual service level because we at Dynatrace, we've been doing distributed tracing

Starting point is 00:22:05 at least since I've been with the company and that's been 12 years and the company has been around for 15. Thanks to obviously service meshes now, certain things come out of the box, certain metrics for certain environments like Kubernetes. But I completely agree with you.

Starting point is 00:22:20 So what we always said is not only look at your response your response time at your failure rate at your memory consumption on your service but also look at how many back-end calls do you make and then not only aggregate it across the whole service of a particular time frame but we also look at these metrics split by let's, the business function that the service provides. So if you have a shopping cart function that can add a cart item, delete a cart item, provide the sum or whatever else, then these are individual business functions that probably have a different call pattern to the backend. So we also try to educate people. You not only need to look at the number of

Starting point is 00:23:02 database calls you make, but look at them per add to cart, per delete from cart, per login, per logout, per whatever else it is. And then, you know, establish a baseline first of all, and then see how that behavior also changes, you know, either from build to build, from release to release. Because you were earlier saying, if you know how much load your backend database can handle and then you can scale it based on incoming demand, that's great. But what if the incoming demand is due to a coding issue? What if a code mistake is increasing the number of backend service calls by 50% because you're omitting

Starting point is 00:23:45 the cache or you're using a misconfigured library that is normally doing your OR mapping and therefore you're making too many calls to the backend. So, uh, yeah. Oh, absolutely. I think that's, it's funny you, you, you bring that up because that touches on another point, which is services. If we, if we focus the, the SLI and the SLO that we're creating, the SLI is more technical, but the SLO is really more about

Starting point is 00:24:10 making the users or the clients of the service happy. Um, you mentioned like a shopping cart service, um, that might have a very low SLO and it's completely okay for that to happen. And the reason why that's, that's okay is if I'm a user, um, if I can, well, maybe not the shopping cart service, maybe like the order processing service, let's differentiate the two of them. So in other words, if I can go to your website and you, you know, you're selling, you know, like the classic examples, like the hipster shop or the book info application, um, that like ships with Kubernetes and Istio. Um, if I want to buy a couple of things and I hit that checkout button, I don't really care if the order processing system, you know, is a beautifully scalable synchronous system and my order is immediately processed because there's going to be additional latency because someone has to put a product in a box and then ship it to me.

Starting point is 00:24:57 Right. That's going to take a lot of extra time. So if that goes down for like a day, realistically, I'm not necessarily impacted. Like it might affect, you know, you know, depending on if the person paid for, you know, expedited shipping and things like this, but it's a very different metric that we want to establish. Um, I want to go back just a second though, and mention that the, the other concept that's that at least Google takes with respect to SRE is this concept of an error budget, which is directly tied into SLOs. So if you have an error budget, which basically says, this is the number of failures that we're allowed to have,

Starting point is 00:25:33 that's a very different way of thinking about things than, than just establishing like, everything has to be this amazing amount of uptime. And that's what we're going to shoot for because there's always a constant struggle between, you know, releasing new features and making things stable. Cause like in the perfect world, if you ask like an operator, what's, what's your perfect scenario, the operator's going to say like, well, I want nothing to change ever. Like I want the number of requests to stay the same. I don't want new versions. I don't want new features. I know how to run what I have. And that's that. But if you ask developers, what would you want? They say, well, I want to be able to release every brand new feature as fast as, as, as humanly possible. And those are kind of at odds with each other, but an error budget allows you to

Starting point is 00:26:16 programmatically like, like define or like codify the rate of innovation that you can have with respect to each service. So in the case of like the shopping cart checkout backend system, you know, a lot of that is like kind of like big data processing, right? You know, I get, you know, a handful of orders in, I do this like massive, let's say like, you know, MapReduce operation or whatnot to be able to process all of this stuff and then send it through the appropriate systems. And then bam, I have like a bunch of orders that are processed. If I want to have innovation on that, because I want to like take my time to figure out how best to maybe clump or group some of the orders so that geographically disparate locations,

Starting point is 00:26:56 but that are simple or similar in terms of like the routes or whatnot. So like, let's wait until we have a bunch of people ordering this in roughly the same area so that we can optimize shipping to maybe that location or that distribution center or whatever. If you want to be able to do that, you can have massive amounts of innovation occur on something and all of your users are still happy at the end of the day, even if that service has different types of downtime. So it's something to kind of keep in the back of your minds when designing systems of systems and what works where and when and how and who it's affecting and who it's affecting upstream and downstream and all over the place. It's quite fun, quite complex. Just on the error budget again. So you said it's basically a different way of looking at it. How much trouble can you still afford in a particular time frame and i

Starting point is 00:27:45 would assume let's say a typical time frame is a month and if you say within a month we allow an error budget of an hour which means out of an hour in one hour out of 30 days or a month we allow certain requests to fail is this the right way of looking at it? Yeah, pretty much. And when you exceed your error budget, then you have a plan that's defined beforehand, which then signals what you should then focus on. So for example, if everything is smooth sailing and you have no downtime whatsoever

Starting point is 00:28:22 and all of your customers are super happy, that's great. But are you innovating enough to be able to attract more customers? Do you have better product features? You know, you're basically going to either stagnate or you're going to slowly drift off into obscurity if you don't innovate. Your error budget is a good way of being able to sort of gut check the innovation speed that you have such that you can make sure that you're releasing new features, but are not ruining the stability of the system. And it kind of goes the other way around.

Starting point is 00:28:52 Like if your system is inherently unstable, in a very sort of simplified example, right? If every day the system is crashing, this is not the time to be rolling out new features. Your priority should be fixing the system so it stops crashing every day. So that's kind of the way that we think about things and the pace of innovation that we want to be able to release at. And it's interesting because the error budget applies to specific components that you have inside of your system. And if you have unforeseen downtime,

Starting point is 00:29:24 let's say everything's going well, you know, you're well within your error budget, you're, you're, you're exceeding your SLOs by decent margin. And then all of a sudden something completely unexpected happens. And it just, you know, your system is down for hours and it's only supposed to be down for like a minute or something like that. Right. Uh, in that scenario scenario the next phase of engineering design needs to be focused on never letting that happen again so the error budget is that like counterbalance towards what your priorities should be so it's another way to think about it that's very cool thanks for the explanation i really like the what is i think what he said in the beginning error budget is kind of

Starting point is 00:30:02 the rate of innovation that's also a great way to put it because that's obviously something that the business cares about. So coming to, and thanks for the explanation of SLIs, SLOs, SLAs, error budget. You put a big emphasis that it's about the S. It's about the service. We look at services and on the service level, we define these things. Now, in organizations, at least when we approach people, they say, well, who is responsible for that? Who defines these SLAs, SLIs, and SLOs? Well, where do we start and what do we then need to take and break it down into other

Starting point is 00:30:38 pieces? So can you talk a little bit about this based on your experience on who is actually responsible for coming up with these and how does it kind of trickle down to the other members of, I don't know, engineering and testing and site reliability engineering and with the ultimate goal. And I think this is what we try to get to is how can we actually start and end up with an S and then let's call it SLX and SLX driven culture. I like that SLX driven culture. Sounds like a car. Like, you know, it's the something SLX, you know, get it today. It's, it's on sale. It's a new sales event, right? That's a really good question because a lot of times I've seen a lot of different enterprises, a lot of different organizations implement culture sort of ask backwardly.

Starting point is 00:31:29 It's where it's just like, OK, that's not how we're going to solve this thing. DevOps and SRE and perhaps it might be good to kind of talk about how they're related maybe after this. But it's interesting because I've seen people create like DevOps teams. So it's like, what are you? It's like, oh, I'm on the DevOps team. And it's like, that kind of goes against my understanding and my view of what DevOps should be, which is cross-functional teams that have shared fate responsibility.

Starting point is 00:31:59 Right. So, um, if let's say, uh, everyone in a company, let's, let's, let's make up a fictitious company. It's a small startup. It's like 100 some odd people. You got a bunch of developers, a bunch of operators, a bunch of marketing people and everybody else that's working in the organization. And let's say the site's down. Okay. Imagine yourself as the CEO of the company.

Starting point is 00:32:20 If you walk in and you're like, okay, well, the site's down. We're not making any money and if a bunch of people start arguing over like well you know i i pushed the code it worked fine but now it broke in production and someone else says well you know it wouldn't break in production if the code were more stable and then someone else says oh well it's not the code it's actually the hardware and you know the hardware people are like well it's not the hardware like on like cpu or anything it's the network so blame the network Ultimately, at the end of the day, all of them are getting less money in their paychecks because the company's not making money because the site is down, right? We have to take a shared fate responsibility model where everyone understands that we're

Starting point is 00:32:56 on the same team here. So it doesn't make sense for one person to try to dictate what the other person should be doing if it's not in service to the customer, right? To the user who's actually accessing the services. So having said that, the SLIs are very, they're very technical, right? So it's a measure of what we would consider to be good for this particular service. It's not extraneous noise that's entering the system. It's specific to whatever it is that we want to be able to measure, right? So for example, like the web front end, we need to make sure that it's serving up proper requests, or maybe it's the database. We need to make sure that every query that's coming out is served in X amount of milliseconds or something like that. So who's the best person to set that? Well,

Starting point is 00:33:51 realistically speaking, it's the person who knows the most about how this service is used by the services that are calling it or the customers that are calling it. Now, whoever that person is, should be the person who's helping establish these types of SLIs and these types of additional metrics that we want to be able to gather. That could be a developer who's intrinsically, you know, understands all of the ins and outs of the system, but maybe not because they don't necessarily know the usage patterns, right? Because they're focused on code and making sure that, you know, functions work and we have really good data structures and things like that.

Starting point is 00:34:22 You could go to the ops people and you could say, hey, you guys, you know, what's, what's an appropriate response code. But if you ask a bunch of ops people, you know, as long as it's, it serves there, then they're fairly happy. So the answer to this is kind of interesting because the SLO that you want to actually establish is really up to the business need, or it's up to the customer focused understanding of what that business goal should be so the sli could be driven by the people who understand the technology but the slo should be driven by people that understand the business outcomes that we want to establish that's really interesting then yeah breaking it into the two teams. And then, so I've, I know you explained it earlier, but then I think you need to, again, help me now understand. If you say the SLO is more the business view of things and the SLA the technical, isn't the SLA, I always thought the SLA would be more like on the business side, we talk about, is the site available?

Starting point is 00:35:26 So maybe you just brought in another set of confusion for me, or maybe I just misheard and you actually said SLA, but I heard SLO. That's the SLO. Can I take a crack at it to see if I understand it? Sure, by all means. Correct me, I might get this totally wrong, but from what I was getting with the SLA and the comment I made earlier about the misuse of it, is an SLA is an actual contractual agreement of we're going to deliver this type of performance or this metric to you.

Starting point is 00:35:59 And if not, this is the actual repercussion of what we're going to have to service back. So if we're a third-party payment processing and we promise you we'll process your credit cards within 200 milliseconds and we violate that, we're going to have to pay you. Part of our SLA agreement is at 200 milliseconds. If not, we're going to have to pay you or give you some kind of discount towards your monthly bill. And that's really what an SLA is. Whereas I think a lot of people just kind of use it as a metric. They just took the metric part out of it and ignore the fact that it's an actual something that's written up in a contract with repercussions. Whereas what you're saying is the SLI is, you know, the bit or starting with the SLO, the business people are going to say, we need to have the site up and running and our customers, you know, whatever, we're going to have a promotion. And we want to make sure our promotion can handle a 30% increase in traffic with less than a half a percent of errors.

Starting point is 00:37:02 Right. So then the technical team has to figure out the slis that they can use to measure and deliver that yeah that's that sounds good yeah it's kind of a fun one um i think a lot of people tend to confuse slas and slos because they don't really make a nuanced distinction between the O and the A, right? So like, um, what we'd like to do is we like to differentiate them because sometimes we really want to focus on SLAs as a relationship between, uh, like a, like a, you know, a client and a, and a provider or like a provider and a customer, right? Um, if you're, you know, a retail company and you, and you're entirely B2C, right? You're a business that services customers as opposed to like other businesses, right?

Starting point is 00:37:49 There's no service level agreement that's established with the public, right? If I go to, you know, Sebastian's amazing sock shop.com, it's not a website. If it is, kudos to whoever's putting that together. But let's say if you were to go there and it's down, I'm not going to have to pay all of my customers some stipend because the website is down. However, if I'm a provider of like a third party transaction API that does credit card processing and I go down, my business being down now affects your business in the example that you mentioned, because you can't clear any checkouts. So in that situation, you need to have a contract that says, if I miss my SLO, which is purely what we want to

Starting point is 00:38:32 define as the proper target for a service to operate at, I need to go do something. So that's why we have that little bit of a distinction. And naturally, if you kind of expand on that, you want your SLOs to be a little bit tighter than your SLAs. So if we, if you want to maintain, you know, a 99% uptime SLA. So if we're down for 1%, that's fine. If we go over 1%, then we have to compensate you for something or other, then we, we better be sure internally that we're not hitting that 99% or we're, we're not hitting that 99% or we're not hitting that 1% territory. So we might have a 99.9% SLO, but our SLA is a little bit looser than that. So we can still break it, but we can recover and not have any sort of, you know, legal ramifications or something like that happen. Cool. Thank you for that. Hopefully that cleared my SLI, SLO, SLA mix up in my head.

Starting point is 00:39:21 That's good. Hey, earlier you made a great point about DevOps where you said just because an organization has now a DevOps team, most of the time means they got DevOps wrong because it should be shared responsibility of multidisciplinary people in the team delivering value to the customers. What about SRE teams? I see people that now say, well, I'm in the SRE team. Is that the same thing? Or what do you see from an SRE perspective? What an SRE team should do? That's it.

Starting point is 00:39:56 Yeah, that's really interesting. I think that there's a, first of all, it should be noted, there's a difference between titles and actions, right? So, you know, if I want to claim that I'm a senior managing director of, you know, a tiny one person dog walking company, I can claim that what, what that actually means is like, I'm the only employee inside the company. So the title doesn't necessarily match with, with what, what, what ends up happening. And every now and again, we'll stumble across that where, you know, a director of cloud reliability, engineering center of excellence or something or other turns out to be very new to the position or doesn't really have the background that we were expecting and whatnot. So I really like defining it based on actions. So like, what does this person or this group of people do? And does it align with what my sort of definition of what they should be thinking about does? An SRE team is different from sort of like a DevOps team. So first of all,

Starting point is 00:40:55 let's differentiate the two of them and then talk about like why an SRE team is actually an okay thing to have, I think, in my opinion. So first of all DemOps is, when did the term emerge? I think it emerged in like late 2008, I want to say. It's essentially like a set of principles, right? It's a set of practices or guidelines, some cultural stuff that's thrown in that's really designed to break down silos, right? So like we don't have IT operations existing in a bubble where the rest of the people that need to have IT functions be isolated for some reason, right? So that's sort of what DevOps is kind of focusing on. And it's really focusing a little bit more on like release engineering, if it's what I see out in the field. So like CICD or, you know, as soon as anyone mentions Jenkins,

Starting point is 00:41:42 that's usually like, oh, they're the DevOps team. Right. So that's where that's a kind of, you know, relegated to cyber liability engineering on the other hand is like a set of practices that, um, you know, different companies, particularly Google who kind of like coined the term have found to work that, uh, that, that, that, that facilitate our ability to actually get something done. So I'd say DevOps is more of a specific set of practices, and SRE is a little bit broader definition of a way of thinking about something. There's a great t-shirt that a few of our developer advocates have where it says, Class SRE Implements DevOps.

Starting point is 00:42:23 And if you're a coder, I think that's a perfect way of sort of thinking about this, right? DevOps is like a set of tools that you have and a set of practices, but site reliability engineering is the pursuit of something bigger than just what DevOps has as its purview, which is kind of fun. So in that aspect, right, if you have a DevOps team, most of the DevOps teams I've seen

Starting point is 00:42:43 are focusing on basically release engineering. They're an incorrectly named release engineering team that focuses on infrastructure as code and like CICD pipelines. That's really what it kind of boils down to. They're not thinking about necessarily breaking down silos, particularly if it's like their own team that has to maintain like uptimes and things like that. Site reliability engineers, on the other hand, tend not to be on a separate team, but instead tend to be on individual product or services teams, but then they also correspond with themselves, which is a little bit different. So you think of it as like a distributed team that has like a core set of functionality, but that's embedded into another

Starting point is 00:43:29 product or service. So another way of thinking about this is if you're, you know, that, that retail company that we keep going back to just, you know, fictitious company that sells socks or whatever, the payments team might have an SRE on their team who is solely tasked with making sure that we're implementing and, you know, proper reliability features into whatever the product is that we're building. You know, if it's a handful of different services, if it's the appropriate monitoring, the appropriate scalability, and so on and so forth. In that situation, they're embedded on that team, but at the same time, they're a member of the site reliability engineering organization, if you want to

Starting point is 00:44:11 think of it that way. So the SRE team is actually just a collection of people that are on a bunch of other teams that are doing something else. So it's a different way of thinking about it, but they're tasked with making sure that whatever area they're working on or whatever service they're, they're, they're aligned to is reliable and working appropriately. And because they're a scarce resource, teams tend to vie for their attention. And it might be a case where, um, you know, as the organization grows, um, we can't hire a bunch of, of SREs because they have to have a skill set in software engineering and in systems operations and large system design. And there's very few people that are in that middle section of that Venn diagram that kind of have a foot in both. big complex projects and big complex, you know, pieces of infrastructure or architecture that you have for whatever company it

Starting point is 00:45:07 is that you're a part of will delegate them to the most critical components of their infrastructure and then say, make sure that this thing, you know, like never fails and then other teams can implement similar practices, but maybe you don't have someone who's dedicated specifically to implementing some of the features. Well, the way I understand it, I think so, yeah. So basically they come in and hopefully not just, you know, build something and fix it, but also kind of mentor and onboard

Starting point is 00:45:32 so that the team later on can kind of take over what they've done and continue these practices and make sure that the system stays reliable if they may have to go back to another team to help them kind of get all these things established right yeah exactly and keep in mind that they're they're they're engineers that that focus on system reliability and the entire practice is essentially about applying software engineering principles to solving you know scalability issues so it could be just like writing a bunch of tools to make sure that they can do their job better you know investigations or monitoring or leveraging

Starting point is 00:46:08 tracing products for example devops could have could be seen as the same way right if you do devops right you would also have people that kind of um uh enable other product teams to become fully responsible end-to-end help help them build their pipelines, build their reliability, build their monitoring, their SLIs and their SLOs and SLAs, and then kind of leave them in a state where they become self-sufficient, where they are autonomous in the end.

Starting point is 00:46:38 And then, so I think we should apply this to DevOps as well, if you think at least the way I think about it. It should, but usually when you do that, you consider yourself a little bit more of an SRE than a DevOps practitioner necessarily. Because the other thing to keep in mind is development and operations and establishing these feedback loops and making sure that we can release smoothly. On the one hand, site reliability engineering does focus on the ability for us to like release code successfully because if we release something and it doesn't work,

Starting point is 00:47:09 then it's not reliable and we got to fix it. But at the same time, like there's infrastructure to be associated with the release process. And that role really doesn't fall on any individual team. When you think about it, it's like, it's not really the, it's not a bunch of developers that are writing code, but it's not necessarily ops to host the applications that we're writing as a company. It's this middle ground that helps facilitate the release of additional code, which is why I think DevOps tends to be a little bit more

Starting point is 00:47:40 relegated to different components of release engineering, and then they throw additional cultural components into it. And then you slap an SRE sticker on the side of it, like DevOps, now with SRE, and kind of go from there. So it's always interesting to see how companies implement these sorts of things. And what's funny, too, is in the field, when I engage with customers, and we're seeing a lot more people now saying, oh, we're with the SRE team, the difference between concept of it and practice, I'd say, that I'm seeing is so far, it seems like a lot of the SRE people I'm interfacing with are mostly engineers who are tasked with looking at the performance of their system and finding ways through code to improve it. So they're looking at, you know, everything's running, but they want to take a look, okay, this process is taking five seconds.

Starting point is 00:48:38 What can we identify as a hotspot to then have the development team fix? So they'll look at the trace, they'll look at the execution and say, okay, we see this is making a bad example, N plus one query on the database. Here, give it to Sal or whoever to fix it and make improvements on that side. So it's still, at least what I'm seeing in the fields, it's still sort of in its infancy in a lot of ways, obviously not at Google in some places, but I think there's a lot of people are starting to embrace it. But based on the definition and the explanation you gave on there, I think there's quite a large gap that companies have to fill. And I think that's always the challenge, right?

Starting point is 00:49:18 Because I'm sure this is, hey, you're on the SRE team now, but none of your other responsibilities have changed. And now you also have to do this new task. So I think it's probably a lot more of just trying to figure it out. Their only guidance is probably reading the SRE handbook and trying to take it from there without having someone, an expert come in. A lot of times when you have Agile or other things like this, you get people in and help train the company and help train people on how to do these things. I'm sure there's not a lot of that going on, uh, in the SRE push at this moment though. Yeah. It's, it's interesting you mentioned that. Um, so there's a great book. Um, it's like a little, you know, those like little mini O'Reilly

Starting point is 00:49:59 books that you can get, um, that are, are usually compliments of someone. So like someone sponsors it. I would be remiss if I don't plug one that actually like just came out, which is called SLO Adoption and Usage Insight Reliability Engineering. I believe it just came out in April and it's an O'Reilly book report compliments of Google Cloud. It's free. If you just search for that title, you can download the PDF, but there's fascinating insights inside of it that I was actually reading in sort of preparation for the podcast. Google has, we acquired the DevOps Research Association or DORA. I think that's what it stands for. I should probably double check that. But it produces this massively complex and incredibly insightful

Starting point is 00:50:46 really thick document which talks about where a business is at with respect to DevOps and how do high performers outperform low performers and what are the qualifications and how many different dimensions do you want to slice and dice things. I mean, it's a proper like, it's essentially a research

Starting point is 00:51:03 paper. It just happens to be published not in, you know, not published just like on our website. Absolutely. Yeah. And it's really interesting. Now this book is kind of like a mini version of that that's specifically focused on SLO adoption, which I would highly encourage people to take a look at. And there's some interesting statistics. And I wanted to mention a couple of these, because I think they're super apropos. You mentioned this tends to be kind of new. 43% of businesses that responded actually have an SRE team. Let me say that again, because I kind of stumbled through it. 43% of businesses have an SRE team. So right off the bat, there's not a lot of people that are

Starting point is 00:51:46 actually implementing SRE that would classify themselves as an SRE team. And most are actually under three years, meaning they haven't been investing in this for the past decade. This is a relatively new thing. And when we talk about SLOs and SLIs and things like that, 34% of the people that were surveyed actually implement SLOs and SLIs and things like that, 34% of the people that were surveyed actually implement SLOs, but 31% have SLIs, which actually define their SLOs. In other words, people just kind of come up with a number and they're just like, oh, it should be like 99% uptime. But then when you ask them like, well, have you, do you have proper, you know, defined metrics and SLIs that can actually inform this? They go, oh yeah, no, not really.

Starting point is 00:52:27 We just sort of set one and then, you know, we yell at a person if the site goes down, which is not really the purpose of site reliability engineering. So there's tons of extra insights inside of a little booklet. Very cool. So SLO adoption, SRE. We'll put a link to this as well in the proceedings. I think there's a lot of more stuff to talk about.

Starting point is 00:52:47 I just want to get your opinion on one quick thing, and I think then we can wrap it up. But we've been promoting and pushing a lot and wrote a lot of blogs and did a lot of presentations on shifting left SRE. And what that actually means is shifting left the enforcement of SLIs and SLOs as part of your delivery pipeline. So if you have, if you use the same concept of what metrics are important for me and what are my contracts to the users of my service,

Starting point is 00:53:19 and I have a pipeline where every build gets properly tested with a representative amount of load, load testing tools, then I believe or we believe we can also look at SLIs and SLOs and let the pipeline fail in case a code change is either jeopardizing the response time that we promised or the number of database calls we make to the backend system, or any of these. Have you seen this as a movement, as a push to enforce SLIs and SLOs in the delivery pipeline? It's really interesting that you bring that up. It's something that I'm really interested in and really passionate about is the programmatic execution of like increased business functionality, which is kind of it's like an all encompassing term for this.

Starting point is 00:54:12 So I've seen it internally, but the difficulty there is you have like an incredibly niche market in the sense that you have to have customers that are super well established, that understand how to do like SRE to begin with. They have to understand how to have, you know, proper CICD pipelines. They have to be okay with things like canary deployments, where I can actually roll this out to receive production traffic and then monitor for changes to actually get insights into whether or not this version of the code is in fact better than the previous version of the code and so on and so forth. So in that scenario, a really interesting future looking statement on how all of this stuff would work would be a machine learning derived automatic heuristics system that would automatically take in every single SLI that's inside of your multi-layered system and automatically calculate what is a considered good norm. So like this is

Starting point is 00:55:14 the expected operations of all of these things. And then when you release a new version of whatever component you would possibly release in a fashion that will automatically understand how to do canary deployments, but then also canary promotions. So in other words, like we're going to release a new version of the code, we're going to slowly send traffic to it in very small percentages, like maybe like 1% in a specific geographic region. And then we'll go a little bit larger and a little bit larger. And then once we're confident that the new version actually works, slowly roll that out across the fleet. And when we do that, we're constantly monitoring and seeing how this

Starting point is 00:55:48 performs versus the old versus the old version. And then from there, we can start to take actions and have signals based on whether or not this is a good thing or a bad thing. So if, for example, if we roll it out and total latencies go down because we've, you know, we've optimized some sort of networking library, but the total amount of, let's say, maybe as part of, of, of this implementation, it has like a little in-memory cache or something like that. But memory consumption is going through the roof. And as a result, there's unintended consequences of it. We can choose to make a decision based on parameters that we've specified ahead of time. What's, what's interesting about this scenario is it's doable with today's technology, actually. If you're leveraging Kubernetes for, let's say, microservice management,

Starting point is 00:56:33 if you're leveraging Istio for your service mesh to get that service level observability that you want, you can then take actions on any of the signals that are coming out of the system. And if you have, you know, distributed tracing frameworks built on top of it, then you can get even better insights to how all of this stuff is operating. The problem is, is that in defining what the idealized state would be, you have this really, really large, like a non-exhaustive problem. So you have to think to yourself, like, what are all of the different possible combinations of things that would constitute a quote unquote, well-running system? And how do I define these, uh, you know, ahead of time so that the system can then be informed as to how to take actions afterwards.

Starting point is 00:57:19 So you can either solve that deterministically ahead of time, in which case you're going to spend more time defining how you want your system to run than actually running your system. Or you can define it as a series of probabilistic heuristics that are derived from statistics that are collected in the system or statistics that are collected and then predicted using some like machine learning models. And what we're starting to see, especially a lot of interesting monitoring companies and service-based companies start to do is start to implement a little bit more sharper, like ML models that can actually predict whether or not you will have a successful code release based on code coverage, based on additional signals in the build, based on additional like unit testing, integration testing, and canary feedback, and then can automatically take that sort of response for you. But that is like so cutting edge that there's like, like we're still working on like, what does that look like from, you know, just a pure engineering

Starting point is 00:58:15 perspective, let alone an implementation spec, right? So maybe if we're optimistic and we get a lot more people excited about DevOps and SRE type culture, maybe we'll see that in, I don't want to make a prediction, but like maybe five, 10 years from now. But until then, like we're not even monitoring half of the stuff yet. Right. So let's focus on the low hanging fruit of like making sure that this darn service is monitored and, and make sure that we can keep it up before we start going super crazy. Well, I want to put one word in your, in your head, maybe for some research for you, and make sure that we can keep it up before we start going super crazy. That's an awesome idea. I want to put one word in your head, maybe for some research for you, maybe for a future conversation,

Starting point is 00:58:50 and Brian knows exactly what that word is going to be now. Look at Keptn. That's an open source project we are curating. So K-E-P-T-N. With your German background, you probably figure out what that means. But I'll leave it with that. We are trying to contribute something like this to the open source community

Starting point is 00:59:12 that is tackling some of these problems. Interesting. Cool. Hey. Super cool, man. Well, Sebastian, I know you probably have another hour or two or five to talk about because you have a lot of experience in that field. But I think we want to wrap it up now. you probably have another hour or two or five to talk about when,

Starting point is 00:59:27 because you have a lot of experience in that field, but I think we want to wrap it up now and, and, you know, invite you for another episode at a future time, because I'm definitely sure there's more we can learn from you. Yeah. Oh, absolutely.

Starting point is 00:59:42 Thank you so much for having me. It's been an absolute pleasure talking to you. And Brian, we'll keep the, uh, we'll, we'll skip the summer later today because normally Sebastian I'll kind of try to summarize, but I think you would not be able, I think people should just listen again to the SLI SLX, that description of what you kind of walked me through multiple times until I hopefully finally got it. So awesome.

Starting point is 01:00:06 Thank you so much. There are some summaries in there as well. It's funny, too, because this is normally the part of the show when I ask, oh, any appearances you're going to be making. Virtual appearances? Not today. Yeah. Oh, yeah, there might be. Are you doing any virtual online?

Starting point is 01:00:24 No. Okay. Oh yeah, there might be. Are you doing any virtual online? Um, no. Um, we, we were going to do some stuff with next, uh, and we're working on kind of figuring out how to transition next over to like digital content. Yeah. Um, and because we have a bunch of like presentations and things like that, um, figured out, but we wanted to do like an entirely digital version. And then we're trying to figure out like what, what are the best logistics for that?

Starting point is 01:00:45 And how are we going to meet with people? And we're taking a slightly different approach where we want to like reach out to specific people that we know have specific concerns and have it be a little bit more personal, which would be kind of fun. And then maybe release some recordings on YouTube under like Google Cloud Platform and talk about some stuff there. But for me, I'm unfortunately not speaking anywhere in the near future because they all got canceled and they all did not go digital. Yeah. Oh, well, we'll see.

Starting point is 01:01:18 Hopefully next year, 2021, right? Fingers crossed. It'll be a nice good year. Yeah, hopefully. All right. Well, again, thank you so much for coming on the show. And we look forward to having you back. If anybody wants to follow you, they should just look you up on LinkedIn. We'll have a link in there.

Starting point is 01:01:34 Do you do like Twitters or anything like that? Or do you mostly do LinkedIn for... Yeah, it's just that DevOps guy is all of them, which is pretty cool. Oh, nice. I got a lot of things, though. All right. Well, appreciate it. And we'll talk to you soon. Thanks a lot. Thanks to everyone for listening. Bye bye. Bye bye.

PurePerformance - SLO Adoption and Usage in SRE with Sebastian Weigand

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.