PurePerformance - Why SREs are not your new Sys Admins with Hilliary Lipsig

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have my fantastic co-host Andy Grabner with me. Hi Andy and happy Performance Awareness Day. Is it Performance Awareness Day? Yes, at least the way we write the date in the United States. What's the date if you just do the month and day? Well, it's 5-5.

Starting point is 00:00:47 Oh, crap. I'm wrong. It's just 5-0-3. It was two days ago. I was so excited. I had this all set for this. We'll keep this in anyway because it shows how stupid I am. 5-0-3 is performance awareness day. So, May 3rd. We're making that a new holiday.

Starting point is 00:01:03 Yeah, well, we could always... What's 5-0-55 isn't that an error i'm sure it's some hdb status code that means something maybe it's cinco de mayo error code if you drink too much uh not sangria maybe a corona today then you feel sick tomorrow but maybe not from corona but from corona yeah and and very quickly on cinco de mayo you know people love putting up decorations for holidays and it's getting really crazy now i saw my first cinco de mayo lawn decorations and i was like okay that's getting a little out of hand now um especially since it's got nothing to do with the united states really anyhow uh just a big drinking holiday another Speaking of drinking holidays, I'm terrible

Starting point is 00:01:48 at segues, Andy. I think so too. Let me take this over. Speaking about resiliency, because I think if you drink a lot, you want to be resilient against the alcohol. But resiliency not only applies to drinking and then sobering up, but it also applies to software. And we've been talking a lot about software engineering and cyber liability engineering lately. And we've been talking a lot about software engineering and cyber reliability engineering lately. And we have an amazing guest today who actually tells us a little bit more about how to not just throw like new software over the wall. And then the SREs are taking over to just handle everything and keeping it running and

Starting point is 00:02:20 monitoring it. But what we can actually do as organizations, at least as I think what she will tell us, what we can do is, what good SRV teams should do to make sure that the stuff that the apps that they are getting to run are actually applying or adhering to certain standards. And I think I'm not doing a good enough job to explain this. That's why I want to welcome Hilary on the stage. Hilary, how are you?

Starting point is 00:02:44 Hi, I'm well. You two are a delight of terrible dad jokes. I just love every second of that. I had to go on mute so that people wouldn't hear me cackling like a witch while you guys were doing that. So just, you know, fantastic intro and thank you so much for inviting me on here today. By way of introduction, my name is Hillary Lipsig. I am a Principal Site Reliability Engineer at Red Hat and a global SRE team lead. So at Red Hat, we have several different SRE teams, all of them around managed OpenShift, so OpenShift dedicated and managed services running on OpenShift dedicated. And that's where I sit. So I'm on the managed services.

Starting point is 00:03:29 These are software as a service offerings that Red Hat has. And they are backed by site reliability or SRE support. So basically, if I understand this correctly, we talked about Kubernetes a lot in the past, about OpenShift. I think people understand that this is the de facto platform of the future where we will run most of our cloud native loads. And you as Red Hat, you provide OpenShift as a service. So that means I assume there's a lot of critical apps

Starting point is 00:03:58 running on these platforms. And you want to make sure that the underlying platform itself really runs stable because otherwise your customers would not be happy with you. Yeah, exactly. So it's all about how OpenShift runs and performs and how apps run on OpenShift and interact with OpenShift. And a lot of that, that's where the architectural reviews come in. So kind of like you said, so for the services, site reliability engineers, or SREs, which is my group, we actually have, we call them the tenants, and they are software development engineering teams that would like to offer their services as managed services, meaning full SRE support plus full CEE, you know, customer experience engineer support.

Starting point is 00:04:42 And so we onboard these these tenants and part of that onboarding process is a comprehensive architectural review, where we basically provide them with our standards. This is what makes an app observable. This is what makes an app supportable. This is how we know that we can support you with confidence. Then we help them and coach them to to you know reach these standards

Starting point is 00:05:05 um so you know some of the things that i'm talking to a team and a lot of these there's like that was awesome for people that like we heard the dogs but now we saw a cat running through the picture and jumping around i think we We will probably continue once she's back because she just rushed out. Yeah, sometimes it'd be good to do these on video, but oh well. So actually the dogs are outside. My office is against my back wall of my house. There's a back door.

Starting point is 00:05:44 The dogs are out there. And what happened is they are roughhousing. And that is just dog playing noises. I have a Shiba Inu, two Siberian Huskies, and a German Shepherd. So it sounds like they are fighting to the death. That's actually just what they sound like when they're playing. Those were all happy noises. But they slammed into the door.

Starting point is 00:06:02 So the cat went from the floor to the bookshelf right quick, just in case. That was amazing. I don't care if we leave that in. That was, that was hilarious. No, we should, we should leave it in. That's awesome. Come on. Oh gosh. But what was I saying? Right. Architectural reviews and standards. My dogs don't meet any of them. They're a hot mess. They're not sustainable at all. But right. So there's several things that we ask our tenants about in addition to some of the lower level into the weeds pieces. There are

Starting point is 00:06:38 some higher level places that we start. And so some of the things that we initially look at is, of course, well, does your offering even run on OpenShift? The usual answer is yes, we've done that much. Great. We usually actually also have a requirement that their offering, their managed service offering is a level four or above operator. For folks who are not familiar with an operator, these are a continuously running piece of software. It's, you know, it'll be a pod in OpenShift usually. And what it is doing is it's managing its operands. So it extends the statefulness that Kubernetes provides you all the way through to the application layer. So making sure that the application is highly available, it always has the requisite number of pods to meet that. Making sure that all the secrets exist and they've rotated on time. Anything that an operator watches is called an operand. And there are levels

Starting point is 00:07:35 to the operators and a level four means that it must be highly observable. So it must do several day one and day two actions. So it'll deploy and upgrade your code and then it's also managing the code or the application rather to a degree. An operator impersonates a person. And if you've heard me talk about operators before, I'll tell you that a good operator requires a good story. So you need to know what the person who would be operating your software by hand would be doing. And that is the standards that we as SREs coach our tenants through. So we look at, you know, what is your upgrade story? How are you going to accomplish them? Have you thought through

Starting point is 00:08:17 keeping your application highly available during those, you know, is it doing the rolling restarts? Is it, you know, upgrading things one at a time versus, you know, just, you know, destroying everything and recreating it, which would cause outages. In general, is your application highly available? Meaning it has at least three replicas of everything that's running. And then we also start looking more in terms of the continuity. So if you're running an application and you're pulling in open source upstream repositories and tools, what's your story around that? Do you have people who are maintainers of that open source upstream tool?

Starting point is 00:08:56 How are you going to guarantee bug fixes for the customer experience? How are you going to guarantee the response to critical vulnerability alerts? These are the types of things that we're walking tenants through. How are you going to disaster plan? We're actually kicking off a new effort that is just called the Disaster Planning Office Hours. And it's all RCA-based, so root cause analysis.

Starting point is 00:09:19 We are presenting on a few really catastrophic incidents where data was lost to help get our tenants thinking through these scenarios. And then is it observable? We do a really comprehensive SLO, SLI, and SLA, service definition review, of these applications to make sure that they're thinking through, you know, what do we want to measure to know we're doing well? How are we going to measure to know we're doing well? And then lastly, what do we actually want SRE to do? And that's usually the hardest question for a tenant team to answer. They're just like, well, I just want SRE to keep it alive. But there's more to it. What does that mean to you? In a situation where we have a catastrophic outage and we can restore service at the expense of data, which do you want me to choose? Which is more important to you, service uptime or data continuity?

Starting point is 00:10:13 These are decisions tenants have had to make. And we've had to coach them through these decisions and, you know, sometimes learn the hard way. And so everything that we learn the hard way gets scaled back up and becomes a permanent standard. You must tell us, you know, this by this time. And we run them through phases. So you have to be this tall to ride in order to go into staging. And then you must be this tall to ride in order to start offering in production. You must be this tall to ride in order for us to reach to assume the pager from you. So and we're with them every step of the way, coaching them,

Starting point is 00:10:46 helping them, advising them, and helping them to do things like write their SOPs or standard operating procedures, helping them to think through their testing and making recommendations on how to approve their upstream. I would say my only open-source contributions have been recommendations for how to fix the upstream. My name's not on that,

Starting point is 00:11:08 but that is my open source contribution. And that is typically the types of things reliability engineers are doing is we are holistically looking at your offering and helping you make it better from head to toe. Wow. To recap, I need to recap a little bit and just ask that I understand this completely.

Starting point is 00:11:25 So if I'm company X, I have a certain service offering. In order for me to run my software on your managed OpenShift, you are expecting a certain level of quality. I think you call it the operant level four, at least, that defines things like, is my system observable? Is it high available? Do you have to have certain architectural settings in place? And you're basically coaching them to get to that level so that you then actually would take ownership of it, that you can then actually run it for them.

Starting point is 00:12:01 Just to recap quickly, because I think this is fascinating. Because you are, not only do you provide a managed service offering where they can run their software on your systems, but you're also in that process helping them to make their systems actually resilient, to make them observable. And now let me ask another question. When you talk through them, or when you talk with them about, do you have runbooks or do

Starting point is 00:12:29 you have like, if then, what do you do in case of a certain problem? Do you then also try to automate the remediation of these situations? Like if they have something written and this is how to do it manually right now, do you try to automate the remediation steps that are described kind of in that runbook in an automated way in the future? Or do you still see just refining the steps in a written way and then you as an SRE team would then take over and say, okay, we have everything written and we are very happy with this. What do you also expect that most of the remediation actions need to be automated? It's a mix. Most of that is actually the automation gets put

Starting point is 00:13:13 into a single source place. A lot of the stuff that I am supporting actually will run on a Red Hat customer's OpenShift cluster. So not even Red Hat-owned infrastructure, right? That's the customer cluster. We don't run anything there. We don't have to because they are paying for those resources. And we have to respect that and keep our footprint as small as possible.

Starting point is 00:13:39 So there are other places where we store our automated remediation. Then when we have to access the customer's infrastructure to resolve an issue, then that's when we trigger our automated remediation. Of course, that access has several layers of security and so forth that we have to go through, we have to prove our identity. Everything we do is logged and auditable. It has very strict RBAC. My role and it has very very strict rbac so my role you know role-based access

Starting point is 00:14:07 is very very strict um so it's it's all you know highly like very very very secure like very very tight um because of course uh protecting people's information is extremely critical to red hat um so there's a degree of automation which we expect. Some of our managed services offerings are ran on Red Hat-owned infrastructure, actually. And we can do more there since we're footing the bill for those resources. And so there is a lot of automation, and there are some things that must be done by hand. And anything that requires access to a certain degree of information must be done by hand because I also then have to go and justify why I did that for security and compliance reporting.

Starting point is 00:14:52 So it's a mix, and we get our tenant engineers to help us write the automated remediation, which might be a bash script, it might be a Python script. I mean, I think with anything, it's pretty much scripting. So it's pretty easy for anybody to contribute to. But there are some things that you don't really think about, right? When you're writing a script to resolve an alert

Starting point is 00:15:15 and you're saying, okay, well, you know, have the cluster just give you all this information and then pick the correct component, right? This really happened. That sounds great until the cluster spits out 5 000 results um so this script needs to be a little bit smarter so some of that you learn from from experience from running the scripts from testing it from actually having production level workloads where there could be 5 000 results instead of like the six you did in testing um and those are the types of things that we take back to the tenants those are the types of things that we peer review on

Starting point is 00:15:43 their sops and we peer review on their automation of those SOPs. Like I said, we're with them from the very beginning all the way to the very end because at least especially my team, we review every Prometheus query to make sure it makes sense and to make sure we understand the time windows, the error budget is covering. Is your error budget a five-minute? Is it a rolling five minutes? Or is it just like a a five minute period and then the next five minutes is a new five

Starting point is 00:16:08 minute period right those are the types of things that we as sre's need to know because it allows us to make informed decisions on does your stop allow us to resolve your alert within the time frame to stay within the slo so that we do not not you know blow the error budget by being too slow. So we do all of that. We do the stops. We test them by hand. There's a fun effort my team does. It's a chaos engineering game

Starting point is 00:16:33 where we set up standalone copies of the infrastructure. And we have three teams, red, blue, and green. I call it combat, Kubernetes chaos combat. So the three teams play against each other. So the red team is the saboteurs. So they've got in, they've broken things, and then they will complain at you like a customer, hey, this isn't working, I don't know why.

Starting point is 00:16:54 The green team has to kind of be like the equivalent to the CEE or the customer experience engineering team, where they talk about, okay, here's the things that I can see from my perspective, here's the things the customer is saying. And then, of course, the blue team is the actual SRE team, so we have to go in and find and fix what was broken. And so we've run these games a couple of times. We're running them again.

Starting point is 00:17:15 Actually, I'm putting together the next round of games for one of our managed services offerings, and we've created it as a repeatable pattern that all of the, you know, it's cross-functional. The Red Hat teams are joining in and participating and, you know, generating, you know, just being an observer and generating production data for us if that's, you know, something that's feasible

Starting point is 00:17:33 with the application. So that we can all be better trained on how to handle incidents, how to communicate, how to do escalations if it's needed, how to grab engineering if we have to because it's a bug in the code and we can't fix that, which is actually an interesting point that I'll come back to in a second. And so we do these things as part of the training and part of the onboarding.

Starting point is 00:17:52 And we say, we know, we know your system. We are ready for your pager. We are trained. You've met our requirements. Let's go. It's go time. And then the service is generally available. It sounds like a murder mystery party

Starting point is 00:18:05 it's awesome it sounds like a fun time you know i thought it was really fun um to be honest with you some of the some of the team found it really stressful because it was like oh gosh i have to do my job in front of 30 people i've never seen before and you know as reliability engineers we're not like usually people persons like that um so uh the feedback i got was smaller groups please so more games smaller groups that that was that's that's that's the new plan going forward to allow people to feel a little bit more safe about failing or not doing something right so no live stream on twitch for those then huh i could convince a few of the people to do a live stream on twitch i absolutely could um but especially the very first time we did it,

Starting point is 00:18:47 where the game format was new, we've never done it before. That was, you know, maybe 30 people on the call was a little much. It was a little overwhelming. And I took that feedback to heart. So we're doing smaller games at this point, which is actually easier to coordinate, to be honest with you. If I just get six people going in a room, that's way better than trying to get 30 people across three geographic regions.

Starting point is 00:19:08 I got a question for you. If you look back at the last couple of projects you ran where you helped your customers to get to that level, in hindsight, do those have something in common? Like these are, let's say, certain certain mistakes or not certain mistakes, but certain situations that are kind of not good for all of them. And kind of, can we tell our audience who would like to think about, you know, coming to you and having you look at the environment and actually bring it to the certain resiliency

Starting point is 00:19:41 state? Are there certain things where, you know, know hey for every customer that comes to us we always have to tell them this and this and this because this is always not there i don't know good slos or they have certain architectural principles that they completely neglected are there certain things you see consistently across of your customers like your top three list of things that we could discuss? Because that would be interesting, right? Because it could be an interesting, hey, before you are calling Hillary,

Starting point is 00:20:12 make sure that you got these three things covered at least. Yeah. So the first thing I have to tell everybody is SRE is not your sysadmin team. We are not sysadmins. We will not be doing your upgrades for you. We will not be doing your maintenance for you. You need to build that into your operator. You need to build that into your CICD pipelines.

Starting point is 00:20:34 You need to know how that's going to be done in a reliable and automated way that leaves your system highly available. And that is usually the most interesting bar for teams to have to meet because there is a little bit of a, oh, SRE will just run it. It's fine. And the answer is no. And I've had to tell people as gently as possible, no. And so one of the first things I tell people is just so you know, SRE is not your sysadmins. And I think that's a really important difference that I think people forget.

Starting point is 00:21:06 The next thing is some of the teams have never been cloud native before. And that is just, you know, the world is changing, so they're changing too. So there'll be some things that they don't, even if they've learned about doing something for Kubernetes, OpenShift is opinionated. So we have to kind of coach them through the differences, the minutia of, okay, but you can't really capacity plan like that because OpenShift has certain other constraints that you're going to have to think about. So you need to adjust.

Starting point is 00:21:39 So there's just a little bit of coaching people through the differences of OpenShift and vanilla Kubernetes. At the end of the day, that's not a very big lift. It's just very small things, check mark items that we can get people through. I think it's a very important thing though, right? Because I think there's obviously a big debate about opinionated versus giving you all the freedom. But the benefit of being opinionated is that you're getting put into a certain trajectory that makes certain things easier if you know your rules uh on the other side people may say well but then i have to stick to certain rules but yeah but certain rules make it easier so i i understand that and yeah yeah um and the last thing is uh etcd is not a

Starting point is 00:22:20 database i don't care what anybody else tells you i don't care what the training says etcd is not a database. I don't care what anybody else tells you. I don't care what the training says. Etcd is not a database. A lot of our worst problems have been around people using Etcd without care to, you know, cleaning up after themselves, deleting old CRs and so forth. main open source contribution is auto pruning in Tekton, not because I wrote it, but because I had to go to that team and say, Hey, people are not cleaning up after themselves. You don't have auto pruning on by default. And, uh, it's terror it's, it's bringing down infrastructure because etcd is getting completely full. And so, you know, they were really great. This is an example of working with, they're actually not a managed service. So, but working with an open source team, you know, we have folks on that team in Red Hat.

Starting point is 00:23:10 So I went to talk to them and I was like, we need this timeline adjusted on this, on this. And they were so great. They were so responsive to, you know, enabling auto-pruning by default and improving their documentation around why auto-pruning is important and how to do it. And so, you know, it's just one of those things where it's a reliability engineer is the type of person who can come to a team and say,

Starting point is 00:23:35 hey, this is what's happening because these things were not considered and we improve them. And I know you said three things, but the fourth thing is not necessarily understanding some of the ways to set up your workloads and best practices. So resource requests and limits on CPU and memory. I know that there is a subset of folks in the Kubernetes world who don't think that CPU limits are necessary as long as you're setting requests. I disagree with them. I have RCAs for why. Some of that is secret sauce, though. So, you know, that's what it is.

Starting point is 00:24:13 But that's sort of our typical stance on this, is that you need to be setting those resources. Yeah. That reminds me of Christian Heckleman, Brian, if you remember him. We had him on the show. He did a talk at KubeCon, one of our colleagues. And it was labeled things, how not to start with Kubernetes, all the things that you probably do wrong, and not setting limits or not the right limits.

Starting point is 00:24:38 It was definitely very, very up there on the list. Yeah. BRIAN DORSEY- I think that one falls into the trap of moving to the cloud, right? Because a lot of people think, I moved to the cloud, therefore I have unlimited resources and these are things I don't have to think about anymore. So I'm just going to push my stuff and it's going to run and it'll be magic. But there's a responsibility you need to still own and maintain. So at the very least, you're not just consuming tons of money in resources, but there's a responsibility you need to still own and maintain. So at the very least, you're not

Starting point is 00:25:06 just consuming tons of money in resources, but there's also the health of the platform that you need to consider. But there's just a, seems like to be a hand in hand forgetfulness of, now there's still physical limitations to everything, right? Physics doesn't go away. Yeah, I definitely, that's exactly what I see. I also think that you see that there are some people who have I've done very like low level, very close to the hardware stuff where I've had to be super conscientious of of resources of all kinds and like literally even to the point of

Starting point is 00:25:39 if we're using a CPU and the IOT server is getting too hot, it will fry and it adjusts the longevity of that device, right? So I've had to consider it all the way down to that kind of degree. I think there are a lot of folks who have spent their entire career in the cloud. And so it's just a little bit of a lack of exposure to what's actually happening on the hardware that leads to some of the rules that we've designed. And in some cases, when you have a healthy budget and you can have auto-scaling on, to some of the rules that we've designed.

Starting point is 00:26:07 And in some cases, when you have a healthy budget and you can have auto-scaling on, you can kind of get away from that. One of the things about OpenShift Dedicated is it really doesn't auto-scale because we're not going to just automatically make decisions about people's budget and what they're going to spend on their infrastructure. So in our environment,

Starting point is 00:26:23 setting all those in properly doing your capacity planning if you're a managed service offering and therefore knowing what to correctly set is really mission critical for the end customer experience. It's interesting. I wonder if with this idea, when we move to cloud stuff and all that, there's a lot of abstraction. Even if you take it to the extreme to serverless functions, people have no clue what's running underneath there.

Starting point is 00:26:49 But I'm wondering if on the cloud side, there should be more of an effort to expose resource utilization. Obviously, you might not have any control, but making the users aware of what's going on underneath everything so that they're forcing them to just see it and be hopefully conscientious about it because it's so abstracted and no one does see it. No one is getting exposed to it. I wonder if there should be, I don't want to quite say a movement, but should there be some sort of movement for lack of a better term to force people to be aware of here's what you're doing right your code's running but here's what it's doing to everything else that you're doing right yeah um so there's actually

Starting point is 00:27:29 something a little bit about that happening within the red hat sre groups um we are working on some projects that kind of give customers more insights and information um there's a project that i've been helping out with um for a little over a year now called the Deployment Validation Operator, which basically goes through and just does a sanity check of your workloads to make sure they're meeting Cloud-native best practices. That uses an open source upstream by StackRox called kubelinter.

Starting point is 00:27:58 This is something that we actually evaluate our tenant workloads against and make them pass the checks for this or get an exception before, you know, they're allowed to go to production. And as I'm actually, my team owns a service, a managed service that is internal only. And we've had to go through these, the same pain points that our tenants go through, we're going through it as a tenant as well. And so the deployment validation operator is a great tool that we use for that. I can't say that that's, I mean, I've contributed to the documentation there.

Starting point is 00:28:29 I'm the current technical lead of the project, although that will be handed off to somebody else for long-term ownership. But KubeLinter is a great tool for it. It's a static analysis tool. You can put it into your GitOps workflows, like in a GitHub action, or you can use the deployment validation operator.

Starting point is 00:28:45 So you can continuously have things just checking. Can you spell the cube blender again? How do you spell that? Do you know? Yeah. So cube. So K U B E dash L I N T E R. Okay. Got it. Yeah. Because it reminds me what we are doing. And I know, Hilary, the way we actually met initially was through a podcast where we talked about Kepton. And as part of our Open Source Project Kepton, we are also automatically validating the health of a system by looking at a list of metrics. We call them SLIs, right?

Starting point is 00:29:22 And then compare them against the objectives. And I was just wondering if we should think about integrating Kubilinter as one of the data sources, because we're open to any type of data source. And I think that could be really great to integrate it into the orchestration as we're pushing out new builds. Then not only looking at Prometheus, but maybe also looking at stuff that

Starting point is 00:29:45 Kuplintr comes back with. That sounds really interesting. I think that it's a really fantastic tool. That's part of why I got involved in the operator that Red Hat is writing. And, you know, like I said, we ask our tenants to pass these checks. And so again, from a perspective as a tenant, actually to one of my sister SRE teams, my team has had to go through this and we've had to do the same things. Well, we were asked to run a tool that actually is an open source tool,

Starting point is 00:30:17 but Red Hat has no ownership of at all. It's a really great tool. It is a fork of century. So it's called Glitch Tip. And is a fork of century. So it's called glitch tip. And so when we're looking at running it, we're like, okay, but unfortunately we have to make it meet our own standards. So my team has made some open source contributions there, just a few changes so that it would run on OpenShift,

Starting point is 00:30:39 some minor tweaks, and then we implemented an observability API. Because even though this is a tool to help you have observability, it also must be observable so that the other team that will be the SRE support for it, if anything does go wrong, can see what's happening with Prometheus and then run an SOP to fix it. I'm having to eat my own dog food on this. We have to meet our own standards. There's no exceptions just because we're internal and it's an internal only tool.

Starting point is 00:31:11 We still have to be this tall to ride. Hey, I know you have a lot of kind of secret sauces that you don't want to reveal. And I'm not going to ask you about it, but I want to ask you one thing. I was just in the states uh traveling uh through two through uh two different states interstates now let's get it right i went to cast visit customers and the big topic we discussed there uh was slos service level objectives and i i do have a little let's say workshop where i try to educate them first of all what are slis and slos what are good ones what are bad ones and then we go through a little exercise,

Starting point is 00:31:46 kind of as a group exercise, to define good SLOs. Oh, your cat is back. Look at that. Oh, yeah, she likes to drop down onto my head during meetings and scare people. It's really a shame this is not on camera, because everybody would have just seen this tabby cat just drop down.

Starting point is 00:32:00 And now she's on my chair. The good news is I can take a screenshot, and we'll get it in somehow. But yeah, coming back to the discussion of SLIs and SLOs, this was a big topic. And people are struggling a lot with defining SLIs and SLOs. Good ones that actually are meaningful. And I wonder, what is your take on how do you approach an SLO definition?

Starting point is 00:32:27 Like, where do you start? Do you have kind of some recommendations where you say it doesn't make sense at all to define SLOs at a certain level? Where should you start defining SLOs? Who do you need to get into a room to actually define SLOs? Just, it's a hot topic for me and for us.

Starting point is 00:32:43 And therefore, I would be very interested in your take. And without revealing any secrets, obviously, of what you normally do in your work. So unfortunately, none of that is secret sauce, right? And so it's so not secret sauce that we're writing open source trainings about exactly this on our end as well. So there's a couple of things. So because of, I think unlike a lot of SREs, I actually very deeply care about what the SLA says, because I know that we're going to have end customers using the workloads that I support. So it's multi-level. So what are we promising customers, right? That's the SLA. So we might say we've got four nines of uptime, right? Great. How do we make sure we give them four nines of uptime? And when do my alerts happen? So my

Starting point is 00:33:31 objectives are not actually to make sure our SLA is met. My objectives is to be better than our SLA, right? We should be better than our SLA. That's the goal. And then if we're not, that's when we start firing alerts and getting SRE involved. So SRE should be there well before the customer is like, knows or is upset or, you know. So that's really a lot about what it is. It's really around what's the customer going to experience. And, you know, coming from a background of quality engineering, which I know you do as well, I often consider SRE very much a quality-like function because we have to help people think about that from a lot of different angles.

Starting point is 00:34:15 So it's really about being better than your SLA with your SLOs. So whatever you want to promise to your customers, measure for better than that. And then alert if you're not doing, if you're not doing, if you're not performing that well. And that can be around things like we're guaranteeing, you know, you know, latency, no latency, right? You can't, you can't ever guarantee a hundred percent

Starting point is 00:34:34 because you don't control all of the factors. You don't control global DNS. You don't control AWS or GCP or, you know, a lot of other pieces that you're running on. So you should never be promising better than what your dependencies are promising. And then you should never be promising 100%, even if you did control all your dependencies because things happen. The power goes out, you know, whatever, like things happen. So those are some like general guidelines that we tell people about. And then, of course, when you're looking into measuring it, you're saying, okay, if we want to make sure that we're not experiencing latency more than, I don't know, two requests a second taking longer than 200 milliseconds to come back, right? Then that's what you start measuring.

Starting point is 00:35:19 And then you do it over a period of time. And you say, okay, you know, we're seeing two requests a second coming back slower than we want to over a period of time and you say, okay, you know, it's, we're seeing two requests a second coming back too slow than we want, slower than we want to over a period of an hour, right? That might be like a error budget type. And I didn't actually sit there and do the math on that at all. So that could be like, wrote that down. That might be, that might be nonsense. But, you know, those, those are the types of things that we're looking at is like, how do we want to measure it? How do we want to aggregate those measurements? So we're looking at our overall picture, not just our moment by moment picture. And we typically actually layer up the Prometheus alerts in that kind of a way to say, okay, if we want this much availability and we want to,

Starting point is 00:35:54 if we're in breach of that, we want to alert SRE, you should probably have multiple layers of measurement of that. So you're measuring it a few different ways so that you're capturing what the customer experience would be. Yeah. So you're measuring it a few different ways so that you're capturing what the customer experience would be. Yeah. So basically, I mean, this is also kind of the way I kind of advised our customers when I was on the road.

Starting point is 00:36:14 The first thing that I came up with or I suggest is start from where it matters most. And that's kind of the business perspective or the end user perspective. Because it doesn't do me any good if my backend services are perfectly fast and available if the end user, for whatever reason, cannot access your system,

Starting point is 00:36:32 for obviously reasons that you still have under control. I mean, I understand your argument with you cannot control global DNS. But from that point on, from kind of as close as possible to the consumer of your services, then break it down into leading indicators. And then, as you said, you have leading indicators that tell you if you fail here, you will then start failing, getting closer to where it matters most, which could be availability, your end-use experience, and so on and so forth. So starting from the top and breaking it down, this is kind of what I advise people. The challenge that I always have is

Starting point is 00:37:10 if you think about these complex systems we are operating right now, you have applications, there's hundreds of microservices potentially. How granular do you go, or how long do you walk through that exercise of defining SLOs for every single service? Does it even make sense? Where does it not make sense to define SLOs? And I think this is, for me, a challenging question as well. Does it make sense if I have 100 microservices to define SLOs on every microservice? Or do I rather find

Starting point is 00:37:40 interface microservices that are then accessed by a third party or by your consumers. So I'm just trying to figure out what your best practices are there, what your approach would be. Do you define SLOs on every service, or it doesn't make sense? So we have SLIs on every service. And a lot of times, you'll see an SLO to an SLI having a one-to-one relationship.

Starting point is 00:38:03 Not always. And that's not, I think, typically by design. I think it just sort of happens. But there are some things that are just central pieces. Like if you think about like a spider web and it all comes down into this like center circle, right? There's going to be that some like central piece that everything connects to. And that is really like when you, when you can identify components like that, those are probably the best components to have objectives around because it doesn't really matter if some of the underlying pieces are not working or are

Starting point is 00:38:39 working great rather. If that piece is not working, like if you have a pod that controls sso um and that pod is down well who cares about the rest of it right that that should be like your one of your most heavily measured things because it's an it's an anything that's like ingress or egress right should have all of your objectives some of the other pieces you might have it you'll probably want indicators around because that guy is going to build up to your objective. But in terms of this is an objective, that's probably really what you're looking at.

Starting point is 00:39:12 Integration points in general, like, okay, we are going to have to log out to this third-party API. You can't have an objective around that API. You can only have an objective around how fault tolerant you are in cases of that API is down. So a fault tolerance objective is probably one that I don't see a lot, and I really wish I saw more. We are fault tolerant, right?

Starting point is 00:39:31 We're not crashing because something came back. We surface that something wrong came back and that we can't do anything about it because of that. But we don't crash. We don't fail. Those are the types of things that people don't think about. But that's where I would put objectives think about, but that's the place where I would put objectives around.

Starting point is 00:39:46 That's great. So I guess you figure out those critical components as part of your architectural review, which makes, I would assume, sense. I really like the fault-tolerant objective. That means how tolerant are you in case systems that you're relying on break? And again, this is a classical use case where chaos engineering comes in right because you want to actually as part of your testing uh you figure out that's like I mean Brian right we had calls uh podcast with Anna Medina I remember her call when she

Starting point is 00:40:16 talked about um chaos engineering and uh a few of them yeah it's a fantastic topic. Andy and I both almost immediately fell in love with the concept of chaos engineering. It's a lot more exciting than traditional performance testing. Even though you can argue that traditional performance testing is chaos engineering, because it's the first time when the system was actually brought under stressful situation, because there was more than one user sitting in front of it, hitting it with requests. Yeah. Again, I don't want to take credit for this because I know other companies do it too, but we took it to the next level

Starting point is 00:40:51 with chaos engineering as training, right? And so I think that that has been one of the highlights of my career as the tech lead, was designing and organizing and implementing that game. One of the things, one of the terms that I like that came out of the work I did with Anna Medina was the term test-driven operations, because that's basically what chaos engineering allows you to do. If you use chaos engineering, you can actually validate, does your monitoring work? Do you get the right alerts?

Starting point is 00:41:28 Do the right people get notified, even though it's obviously in an experiment? And then actually, do people react in the right way if these alerts come in, or the tools, and the kind of test-driven operations? It's like we do test-driven development, we can take this concept to operations, or to SRE, test-driven operations. It's like we do test-driven development. We can take this concept to operations, or to SRE, test-driven operations.

Starting point is 00:41:49 Does everything work as expected? Do my remediation scripts actually work when they actually get executed? STEPHANIE WONG- I'm taking that term and using it from now on, test-driven operations. I love that. But actually, so on that, I actually want to go back to something I said I'd come back to and I haven't yet,

Starting point is 00:42:09 is my SRE team, unlike a lot of other SRE teams when you talk to, you know, across various companies, we can't produce production bugs. And that's because of the ProdSec requirement on where builds originate, how they originate, and so forth. So we actually don't do that, which is a very different type of SRE than you find. I found that there was a mix. There are people like us who SRE without

Starting point is 00:42:37 being able to fix the production bugs. So I can tell you what the production bug is, but my ability to fix it is going to be fairly limited. And if it requires a code change, we typically have, are able to devise ways of working around production bugs and getting services back up and restored to where they need to be. But so we actually have to, one of the maturity requirements from our tenants is engineering escalation path. How do we page engineering at 3am and say, hey, your code has this bug? And again, like I said, a lot of that comes down to it's, you know, there's like a lot of project requirements. This allows us to be language agnostic in what types of workloads we support. So I actively

Starting point is 00:43:19 support Java and Ruby and Python code. And I don't, it doesn't, you know, I can debug just about anything as I think most S3s can. So we can, oh, and Golang, of course, because the operators are written in Golang. So, you know, we're debugging across various different stacks. You know, I have to dump a Java heap.

Starting point is 00:43:41 I have to, you know, go get some logs around certain workloads from certain, you from certain parts within the pod even, not just the pod logs. So we can do all of that, but we don't fix the bugs. So one of our maturity requirements is actually being able to page engineering, raise engineering, raise the BU even in case that they are gonna need to do some sort of

Starting point is 00:44:01 discussion with the customer or what have you, or just make them aware of, hey, there's no way to not blow our error budget for this customer, sorry. That's part of our requirements is having emergency escalation paths. Then like I said, just general disaster planning. Like, okay, I have to know I can restore the service,

Starting point is 00:44:19 but it requires sacrificing this data. What's more important? Mm-hmm. Yeah. Have you, for the fixing production bugs, how often do you see organizations use feature flagging to actually shield off or being able to turn off certain codes in case you're looking at it and you see, hey, this

Starting point is 00:44:42 is actually the vulnerable code. You cannot change it. But do you advise for new features especially to wrap things behind the feature flag or is this not what you see it's not really what i see there's a degree to which that can exist but to be honest with you when you have an operator maintaining state um it basically creates it's a double-edged sword and i talk about this in a talk i give called helm and back again uh i did this with uh christian hernandez at um devcon or devcon cz so in the the czech republic um and so you have no snowflakes when your workloads are operator based. And that's great because I have a fleet that is changing size on the daily. I never know how big my fleet is until it's like time to go look. So I know exactly what to expect always.

Starting point is 00:45:36 The problem is if something has gone terribly wrong, I know exactly what to expect always. And that always is very true. And so you have to be very careful about how you work around things. I'm very sensitive to what the operator will and will not try to maintain. It's one of the reasons we do such a thorough software view actually is because we will never be able,

Starting point is 00:45:59 we will never be able to guarantee a snowflake. You know, we turn something off, but then, you know, nodes restart and we lose those configuration changes. The operator is going to restore things to the previous state, which might break stuff. So it's interesting. And so feature flags, I see them from time to time,

Starting point is 00:46:18 but typically not. And that's really the reason why. That's good to hear. Hey, Hilary, I think, first of all, thank you so much for getting on the show. It's amazing to hear from you directly and your day-to-day work and how you help organizations to really bring the systems really into a state

Starting point is 00:46:41 where SRE teams from your organization are taking it over. And it seems there's a, I mean, the service that you deliver just alone by getting them there is just amazing because you help them to just have better systems that are more resilient by default. Is there anything else maybe as a final conclusion that you want to make sure this is a final thought from you

Starting point is 00:47:02 that people need to understand and need to know? Gosh, you know, we really covered so much. I don't think so. I think at the end of the day, you know, the SRE team, we are your allies, right? We are trying to help you have the best system possible. We're never intentionally combative. And one of the things that I would say, if anybody is sitting in my shoes and they're like, I want to do the same thing Hillary does. When I'm making recommendations to my tenants, I bring them evidence. All of my recommendations are evidence-based. I'm like, this is why we recommend this. This is why we say this is a repeatable pattern. That's actually something I should have covered. We basically have predefined patterns for services architecture of like we know this works. And we bring them why we know this works and why we know other things don't work.

Starting point is 00:47:53 And a lot of times we'll bring them RCAs. And I'll talk to them about some RCAs. And I'll talk to them about, you know, why. Like the disaster planning workshop I said, it's all RCA based. So pretty much everything we do when I'm bringing this to the tenants, I'm bringing them an RCA so they understand the why of it. Because really, when you bring people evidence, I find they're usually very willing to make some changes to their code or their architecture, because they don't want the same consequences that the other team experienced.

Starting point is 00:48:18 Yeah. And I think this is also a great place to say if people want to follow up, we will obviously share your social media links. Is there any besides LinkedIn, Twitter, is there any other place or to follow up with you in case people want to know more about what you do and maybe they are now excited and want to do something similar or even join your team? Really, Twitter and LinkedIn are the best place. I'm a very private person. I don't have a lot of public-facing social media. That is by design. So those are my preferred new friend ingress points.

Starting point is 00:49:03 And then after, like, I feel like I know somebody, I'm like, okay, you're fine. And then I might, you know, give them other avenues of contact to me. Then the service mission will route you to the other, yeah. Yes, exactly. The service mission will route you to the other points of contact, yes. You'll get behind the firewall. I just got to say, I think this is amazing, too, because it's about maturity models, you know, and I think one of the challenges for other people trying to do SRE at their own organization is they probably get told, Hey, we need SRE, go ahead and do it. Right. And what you're proving out here is that there are, there are a lot of best practices.

Starting point is 00:49:38 There are a lot of requirements that can really help make it a stellar thing instead of just having a title actually being impactful and helping the organization reach the height of what that SRE is intended to. So hopefully anybody listening who's in that SRE realm took something away from this along those lines. But yeah, thank you so much. All right, it's our secret open source contributions,

Starting point is 00:49:59 pushing services to be better all the way upstream. Awesome. And remember folks, SRE is not just for admins. I think that's a great line. I will write all of this up in the summary. Can you upgrade my Windows? Yeah. I need Microsoft Word installed on my...

Starting point is 00:50:19 No, it's a whole different one. Anyhow, thank you so very much. And I wish I could be a fly on the wall for the Rogue One V Empire Strikes Back conversation. But that'll be another day. Maybe I'll get to hear some of that. Thank you for being on the show today. And if anyone has any questions or comments, you can reach us at pureperformance at dinatrace.com or Twitter at pure underscore DT. Thank you so much for being on, Hillary.

Starting point is 00:50:46 It was a pleasure. Thank you so much for having me.

PurePerformance - Why SREs are not your new Sys Admins with Hilliary Lipsig

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.