PurePerformance - From Postmortems to true SRE Culture with Steve McGhee

Episode Date: July 6, 2020

Steve McGhee (@stevemcghee) is an expert in post mortems and SRE. He has learned the craft at Google, applied it at MindBody and is now sharing his experiences while back at Google to the larger SRE c...ommunity. Listen to this episode and learn more about how post mortem analysis can be the starting point of your SRE transformation. How it can help reliability engineering to build and engineer systems that fail gracefully instead of causing full crashes or outages.Steve also went into monitor what matters and only defining alerts on leading indicators with an expiration date – a fascinating concept to avoid a flood of custom alerting in production!If you want to learn more from Steve or SRE check out these additional resources he mentioned in the podcast: The SRE I aspire to be (SRECon19) and his 2 blog part series on blameless.com.https://twitter.com/stevemcgheehttps://www.youtube.com/watch?v=K7kD_JfRUY0https://www.blameless.com/blog/improve-postmortem-with-sre-steve-mcghee

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always my wonderful co-host Andy Grabner. How are you doing today Andy? Brian, hello. Very good. Sunshine outside and not as high in temperatures as I know you told me earlier on your end. But at least the rain is gone for one day,
Starting point is 00:00:45 even though we've been hoping for rain because we had a too dry period. And now I think the farmers are happy again. Now it's time for some sun and some warmth. You know, it's funny. I do a lot with music. I've really listened to music a lot. And in that, I've been noticing a lot more lately
Starting point is 00:01:02 when people talk, their words trigger songs in my head. And I think there are about three or four that went off in my head as you were just saying that. So that was just really disturbing to me because I don't know what's wrong with me. So fortunately for everybody listening, I'm not going to start singing any of them. I'll leave the singing to you, Andy, because I'm sure you have a better singing voice than I. I'm glad your weather's fine. I was telling you before we started, we have this whole Miller moth problem in Denver and it's really, really bad this year. It's horrible. So anybody who knows about these moths,
Starting point is 00:01:33 yeah, it's really, really, really rough right now because not only are we stuck with kind of being pseudo-quarantined and all, but now you can't even turn on your lights at night because they all come in your house. So real fun times right now, but it'll pass it. This too shall pass as we say,
Starting point is 00:01:48 but we have better things going on. Well, yeah. And then maybe also better weather and, and no moth problems on wherever Steve is from. Yes. Steve, who Steve,
Starting point is 00:02:00 who Steve, Steve, Steve jobs on the show today. Nah, I don't think so. I can't hear to show back in here so Steve for bringing the show
Starting point is 00:02:10 from beyond yeah so Steve McGee hopefully I pronounced this correctly yep cloud solutions architect
Starting point is 00:02:18 at least that's what I read from I'm looking at just one of your blogs on blameless.com and Steve you can probably just do a better job in explaining, but telling the world, or at least the listeners, who you are, where you are from, and also how the weather is
Starting point is 00:02:34 and where you're from right now. Yes, I'm Steve McGee. You got it right. First try. Congratulations. There's a secret little H in my name, which is completely silent. You can just ignore it um so i live in san luis obispo california which is also known as slow or slo and i'm an sre and i find that hilarious that i'm an sre that lives in slo um get that joke i think it's great
Starting point is 00:03:00 i actually visited that place in sixth grade yeah Yeah, it's got a mission right downtown. I'm about 200 feet from the mission right now, actually, sitting in my car. That's pretty awesome. It's a beautiful day in Central California. My background is I worked for Google for 11 or 12 years. I was an SRE for about 10 of those and i worked on a bunch of stuff so like android and play and compute engine and youtube and all all over the place um and then and then i left i came here to beautiful central california to work for a company called mind body online
Starting point is 00:03:40 which is uh it's like an app that helps you find gyms and salons and haircuts studios and stuff like that and i helped them like do the cloud thing uh so i became like a cloud customer and then i'm back to google to basically help other customers other other companies do exactly that like figure out what this whole cloud thing is and how to like use it properly and my sort of like specific angle on that is like how to develop reliable systems on public clouds so how to do sre on the cloud basically is kind of my goal so i help a bunch of companies do that all day and it's pretty fun. Yeah. So awesome. That's pretty cool. So you said you were at Google for 11 years before you did the other gig and now you're back.
Starting point is 00:04:31 So you've been doing SRE for what, 15 years? Yeah. So I was when I first started Google, I was worked in the partner team, but I was working on reliability stuff then. So there wasn't SRE back then, but looking back, it was awfully SRE shaped. So I developed a monitoring system that let us understand when other companies weren't showing our ads or search results properly. So one of my favorite stories from that is when MySpace, if you remember MySpace, it was a big deal at the time. They showed ads, Google ads, and we would find out when MySpace was down before they would. And I remember calling their NOC saying, hey, your site is down. And they said, no, it's not. I said, yes, it is.
Starting point is 00:05:16 And they went, hang on a second. And then they were, oh, you're right. Hang on. And they would go reboot something. So we were kind of their pager. Yeah, because you saw a drop obviously of requests coming in from them yeah yeah because i mean if our ads weren't showing on their site we were losing money and they were losing money so we you know it was like
Starting point is 00:05:34 our responsibility to make sure that they kept their stuff running which was kind of a funny way of doing reliability i wouldn't recommend doing it that way ever again but it worked but that's it but that's a great motivation if you are obviously not only a cloud vendor but you're also one of the largest ad vendors in the world right and you want your ads to be shown on websites that run on your cloud you better make sure that these sites are ours are reliable and yeah one thing that has always been interesting we didn't call tom no i i considered like hey is this tom but no it was it was it was the operation center uh but well one thing that we've always kind of or the way i've tried to think about this kind of stuff at google
Starting point is 00:06:15 is that like forever google was an advertising company but like they also made like chrome and like what what's the deal there like why would an advertising company make a web browser but the idea was simply like if you make the whole internet better and more reliable people are going to use it more to go you know do whatever it is they want to do on the internet and that's that's just going to naturally drive up basically ad clicks right this this background radiation of a way of making money on the internet so like i try to think that like in gcp like the background radiation is not ad clicks but but it's VM hours, essentially. And so if we make it easier for people to run businesses successfully on the cloud, we're just going to drive up more of that background radiation.
Starting point is 00:06:58 We're going to have more VM hours happen as a consequence of making the system better. If you can have an e-commerce business on the cloud that is more successful than it would have been on-prem, it's better for you, but it's also better for the cloud, and it's better for your customers, and it's a win-win-win kind of situation. So that's the way I try to take it. It's interesting you mention that, because I don't want to mention the competition by name, but there's this monolithic, very old-school company that's also a big competitor in the cloud. We'll leave it at that.
Starting point is 00:07:31 But their model used to always be operating systems. And as you said, the idea is if you make things easy and usable, people will use it and you can put new revenue models. So I think you nailed it on that. You really just nailed it on that. You really just nailed it with what you said there in terms of make everything easy and people can use it without thinking and the money will flow in. In fact, we see that all the time, right,
Starting point is 00:07:53 Andy, where people just spin things up, don't even realize they're still running and like, oh, that's right. I have a VM running because it's too easy until the finance team comes screaming at you. Yeah. We don't want that to happen either, too, because then they get mad and the pendulum swings the other way and no one likes that. Making sure that doesn't happen, too. And Google Cloud
Starting point is 00:08:16 actually has a great modeling to say you can save money on this VM by reducing slides because you're not using it. I think that was really, really innovative of you all to do. Yeah, that was a fun one. I was on the Compute Engine SRE team when that came out. That came out of the Poland office, and it was just one person in the Poland office
Starting point is 00:08:33 who just kind of had this idea. We can just show these little pop-ups and say, you know, you're not really using this VM. What if you did something else? And the original idea was to automatically change it and be like all dynamic. And we said, well, what if we just pop up like a little sticky note that says hey and so we did that and just that was was a big deal so people loved it they still do hey steve so with all your
Starting point is 00:08:57 experience on sre's uh even though you live in slow but uh you are an sre um the you know reading it reading a blog i mean there's a couple of things you put out there but i what i found interesting is your approach of how do people how can people get started with sres by starting with something they should do hopefully anyway in case something is not reliable which is post-mortems. And it would be great from your perspective because we've been, Brian, right? I think in the last couple of episodes, we focused a lot on SLIs and SLOs and SRE. We had a couple of your colleagues on
Starting point is 00:09:34 and some other folks talking about site reliability engineering. And the question then always comes up, okay, but how do we get started? What's a good model? Because we're not born in the cloud.'re not cloud native yet i mean give me something right and and it would be interesting from to hear from you because i believe you you did a great job in the blog explaining the approach and um so i want to i want to hear from you how can
Starting point is 00:09:58 people how do people get started what do you advise thanks yeah i'm glad you like that that article that was kind of a partnership with Blameless. So this was while I was at MindBody actually. And it's entirely just out of the experience at MindBody. So this wasn't a Google sanctioned official way to do anything. This was just like my experience at the time. It happened to be published later once I was back at Google, but it was actually, I believe the interview happened while I was unemployed actually between the two things. But anyway, um, but the idea was essentially that like, you know, how did this happen everywhere and everyone is always, you know, dealing with them.
Starting point is 00:10:35 And the idea of postmortems are pretty understandable. Like you might call them retrospectives or learnings or, you know, there's all kinds of words for them in Google SRE. We call them postmortems for a long time, which is maybe a little bit dark, but I think the idea comes across reasonably well. But I found that in this one company that I was working for, MindBody, they were doing something that was pretty close to postmortems, but they were doing it inconsistently. And they were, you know, kind of everyone would have their own way of writing up the document, and then they wouldn't necessarily always follow through. And some people would, and some people would kind of forget about it, and blah, blah, blah. So basically, I just kind of introduced the idea of, well, we're already doing something like this, why don't we just add a little more rigor to it. So I introduced like a postmortem template, which was really, really simple.
Starting point is 00:11:31 And it's, you know, it's only like, one page long, and it's a couple headings, just sort of like to give you some sense of what to write, because I found a lot of people would just ask me like, hey, you wrote this postmortem, and it looked pretty good. Like, how do I do that? And I just, I just gave them, you know, a halfway filled in document and that it helped a lot. Um, so being able to just start with something you're already doing and just improving it slightly, uh, is a really good way to start improving a practice across the board. Um, the other, the other thing you can do is, is if you already have seen good postmortems in practice, like I had at Google for 10 years or so, um, I would just write postmortems in practice, like I had at Google for, you know, 10 years or so, um, I would just write postmortems for incidents that I was kind of observing. And I would say,
Starting point is 00:12:16 Hey, uh, I'm, I'm not really, I wasn't really involved in, you know, the outage, but like, I'd like to write it down or help you write it down for you. Um, so I, I did write a few postmortems for incidents that I was just kind of like sitting on the sidelines of. And then I made sure that they were, you know, well, well, more well written and that there are at least they were complete. But the most important thing that I've kind of found or like the most helpful thing that like in terms of outcomes from doing this was making sure that you write at least at least three types of kind of i would call them bugs but you know just kind of tickets maybe you know action items and those would be uh you know detection how did how did it how can we detect this better or sooner or with more precision uh there's prevention how can we prevent this better or sooner or with more precision? There's prevention, how can we prevent this from ever happening again? And then there's mitigation. So how can we, if this does happen again, how can we survive it better? And so if you can at least write one of each of these, you're in good shape. Generally, there's more of some than another. And often the prevention one
Starting point is 00:13:21 is like very difficult. But you know, you should still be able to write it down and put it into a persistent ticket queue or bug filing system of some kind. I have heard from some companies, they would do this and they just kind of say, well, but we don't have a bug filing system or a ticket queue that's like this. And I say, well, okay, this is a good reason to get one.
Starting point is 00:13:43 Sometimes just doing this process sort of exposes other capabilities that are missing as well. So it just starts to kick off a few things. That excuse is like a kid saying, I can't eat my peas because I don't have a spoon. Oh, here's a spoon. There you go. Exactly. I think it's interesting too because what you're saying about having those plans and having that edict in there, I haven't worked on the, I've been in sales engineering since 2011, because what you're saying about having those plans and having that edict in there, I haven't worked on the,
Starting point is 00:14:09 I've been in sales engineering since 2011, but before that I was in the performance side and we started doing some post-mortems. We were actually calling them back then, but it was just a meeting where we'd all get in a room and discuss what happened. There was no organization. There was no folk. It was just
Starting point is 00:14:25 a bunch of people kind of partially blame game but i think they called it a post-mortem because we were trying to figure out what happened but it was very very loose no focus no goals and i think the key as you're saying there's to have some sort of goal even if it's you know something as simple as that and yeah the other thing that's important to do is like once you have the basics of oh we should be writing these down in a consistent way and having like a template that at least has all of the fields that you care about and you start getting some practice writing these uh the next step in this is then to have a uh a, almost, or a meeting. So the suggestion that I have, I hate meetings,
Starting point is 00:15:08 but the suggestion I tend to put out there is just put something that is regular, regular cadence, like once a week, once a month, whatever it is that you think makes sense, depending on how many outages you have, I guess. And say, it's every Thursday at 10 o'clock, we're going to have the postmortem review session. And there is a standing agenda document. And all it is, is there are, you know, it's a one hour meeting,
Starting point is 00:15:33 and there's three 20 minute slots. And you just sign up for a slot. That's it. So if there's already three scheduled, then just wait for next week. But what you do is you show up with your postmortem that you wrote, you send it in ahead of time, and the sort of curators of the meeting will, you know, read it and kind of get a sense of what happened. And then during that 20 minutes, all they do is you just kind of have a discussion on either the document itself, where they'll say, well, you know, you forgot to, you know,
Starting point is 00:15:59 add these types of outages, or I mean, no resolutions are like, we don't really understand the impact, you know, they'll sort of help you write a better document. But But even better is they'll give you an opportunity to sort of explain what happened, and what you think the resolution is, and you know, what the prevention steps and all these are to like, a semi third party, who has almost no interest in the actual outcome, aside from making sure that it's a good learning experience for everyone, and that the system becomes better over time. So this sort of forum is, you know, it's still employees within your same organization, but no, they're not the authors of the system that broke. And they're not the people who got paged during the incident. No,
Starting point is 00:16:42 they're, they're, they're semi uninterested. They can be pretty neutral parties, I guess. Having a forum like that is super helpful for driving up the quality of these postmortems as well as making sure that the outcomes make sense. They're not just
Starting point is 00:16:59 super inside baseball and are super, super deep. They'll help you kind of focus on the things that will actually help the company that's another kind of step you can take yeah one quick question so i guess maybe that's part of that forum or that outside body but if you start writing post-mortems you know with every problem that you see and if multiple people start writing it how do you make sure that you're just not creating a large list of things that in the end nobody really yeah not cares about but at least but how can you make sure that a you are catching duplicates how
Starting point is 00:17:38 can you actually then hey this happened before did happen before yeah is what is there are any good systems for that or is this part of i don't know is part of the best practice part of the um the body you put in there's there's a couple problems with this uh that are really easy to um fall into and you have to be careful of um one is uh i don't want to spend that much time writing documents and like how much time i don't want to make the perfect document every single time like we had an outage like what's the big deal like why do i have to write you know people feel like they're being punished because they have to like write this big document um and that that that is definitely not the case that's that's not the point you're making but that's like a similar a similar point um having a good uh template helps a lot here and
Starting point is 00:18:18 then you can say like all you need to do is fill in these like 10 fields and you can be done in 10 minutes like and at the at the simplest case but then further along is like if it was more complicated sure you should spend more time and explain it a little bit more um but but what you're asking about is is more like um it's the same happening same uh event keeps happening over and over again like are we catching it um or how do we know that the outcomes that were sort of the resolutions or bugs that we're filing aren't just being ignored? How do we prioritize them? Or how do we make sure that they're going into the right bucket? And really what we're doing here now is we've moved into the realm of operations into software engineering.
Starting point is 00:18:58 So we're talking about we're raising essentially feature requests or actual bugs in our feature tracking system or our bug tracking system. So the real question is, is your company good at prioritizing work? And is your company able to take bugs from customers or from product managers or, you know, from internal engineering or QA? How do you how do you do that process today? Like, how do you determine if something is as a duplicate bug or overlapping or something like that? And you just use the exact same system. So really, all we're talking about here is just another form of defect. So this is just generic defect tracking. And again, this is the spoon problem, right? So if you don't have a good way of tracking defects
Starting point is 00:19:52 and you try to do the system, it's not going to work. And it's not because this process is broken. It's because you need a good defect tracking system. Now, what if I don't meet, hopefully this isn't jumping ahead at all or properly, which requires not a fix in the code, but in the fix in your monitoring coverage or what might be acceptable limits, jumping a little bit into the SRE side, what is an acceptable limit of number of database calls that we can handle and we weren't monitoring those and blah, blah, blah.
Starting point is 00:20:43 So that thing goes into fixing your monitoring and yeah you know is that something that you would handle through like let's open a ticket to change the way we monitor so that we can avoid doubles and all that too or totally um i mean i i don't think i don't distinguish between the monitoring and the software because turns out monitoring is made out of software and what you really care about is the entire system right like your ceo doesn't care if it was the ruby that broke or the prometheus rule that doesn't exist like he wants to know why the system itself isn't working you know like okay uh so the idea is like don't treat uh you know configuration and operations and monitoring as like something that isn't software
Starting point is 00:21:25 because it is software. It should be tracked in a similar kind of way. There's the phrase configuration as code or configuration is code. I'm a firm believer in that. Even better is when you have code that outputs configuration that goes into other code. There's code everywhere. So this is, these are all just forms of defects. And I think trying to introduce a different process for operations
Starting point is 00:21:53 is just going to be confusing. So if you have like a consistent way of handling defects through your company, whether you're writing YAML or Java is, I think it's super important that you treat all these system defects similarly. That way you can prioritize work in a consistent way. Right. That's really interesting. Never thought of... So yeah, everything's a defect.
Starting point is 00:22:18 It just gets categorized to different teams. Yep, totally. Because you also find if you're changing a system, often you're like you're introducing a new piece of functionality into like the Java, for example. And then you know, you say, well, actually, at the same time, we want to introduce some monitoring. So you want to be able to introduce these things in a consistent way and say, here's the new feature set. And here's the monitoring that goes with it. And like, if you have to span multiple systems to do this, people are going to be less likely to do it just because it's annoying but also they're there's going to be it's going to be a lot harder to enforce or to track that this is
Starting point is 00:22:53 happening so if you have sort of one system for doing both things you can say here's the new feature and in the same you know change set or whatever the system is you know that you're using calls it you know here are also the tests that with it, and here are the monitoring rules that go with it, and here's the on-call team alias who should be in charge of it. And it's all in one package suite. So that way when something goes down, you can track it back to that one set of changes and say, ah, we introduced this feature,
Starting point is 00:23:20 but we forgot to add some bit of observability or something like that. And I think another great advantage of that is you'll get to test your monitoring or your system change early on to make sure that, number one, it works. Number two, that it doesn't introduce any other kind of problem. If you suddenly say, hey, we're going to monitor heavily on something, that could cause an issue. But if you put that in early in the system, that's part of what Andy and I talk about, the whole idea of shifting left for not only your code changes, but I shouldn't say Andy and I, everyone talks about it, right? But also, your monitoring is code, your system, your code is code, your deployment is code. It gets tested throughout the entire cycle before you get to production to make sure that it's doing what it's supposed to do. It's giving you the results or the information that you need to act upon. And you're not just guessing and throwing in production and then the defect happens again because you picked the wrong metric to look at.
Starting point is 00:24:21 Yeah. metric to look at yeah um so one thing that you uh kind of i think you referred to before um where it's kind of around the cause you think a little bit but like when when you do have oh no this is from the article that you guys sent me about um dynatraces like auto healing capabilities and things like that i thought that was a neat article um one thing that you referred to was about um sort of automated remediations um and i think like uh like i like i said kind of in the email chain like i'm really leery about that concept of automated remediation because it sounds it sounds too good to be true because like it is like you can't you can't rely on it uh to work all the time for all the things um but one thing that you can do which i think is kind of what you guys,
Starting point is 00:25:06 what your system does and is good at, or at least the one that was described in that blog post, I would describe them as short-term remediations. So like if you have an on-call team who is in one time zone and something goes off in the middle of the night, it's a total hassle to wake someone up to do something dumb, right? To you know memory or like restart a process
Starting point is 00:25:28 so being able to distinguish like an outage uh when when you have an outage being able to distinguish a remediation from like a quote real fix you know like a prevention mechanism or like a an actual you know bug uh fix is super super important so um i think automated remediation when it comes to these short-term fixes is great it's super uh helpful it helps people like sleep at night and uh gives you signal into the system of what's kind of going wrong and maybe it's not even middle of the night things sometimes it's just like during you know high load or something like that but the the really, really important thing that, that, that doesn't happen sometimes is people think that when the remediation happens, like, Oh, we have a remediation. Great.
Starting point is 00:26:12 We're done. Uh, like no, no way. Like you need to have a, basically like a due date on these things. Like this is a short-term fix and it will actually expire in two weeks. This remediation will no longer work two weeks from now. So, you know, software engineering, you, you have two weeks. This remediation will no longer work two weeks from now. So software engineering, you have two weeks to fix this properly. So in Google, we had some things like this where if you put one of these short-term fixes in place, it has a required expiration field
Starting point is 00:26:39 and you can't put infinity in that field. There was a maximum of two weeks or something like that. I think that's super important to emphasize that these short-term fixes need to be enforced by not just good intentions, but also the automation of the system itself to say, like, required field is when when does this remediation get lifted so if you can do that i think it's a really good incentive to actually fixing it properly which is the only really righteous path forward yeah it also brian this reminds me a little bit um one i think the first episode we had with uh guaranca bieto from facebook when she was still working at facebook and she talked about the fixit ticket so she basically said it's allowed to deploy something not perfect into production but then you have a you know like a due date until you have to you know optimize it in a way so that it's truly
Starting point is 00:27:36 fit for production um and i thought totally that was also interesting uh interesting content also um yeah thanks for reading the blog. And I totally agree with you. These are, you know, English is not my first language. So whether you call it remediation or maybe mitigations, I think what you're talking is probably mitigating the impact of a problem long enough so that people get some time to truly look at the problem, find its root cause, and then fix it. It's like one mitigation would be we just restart your application server every time it runs at 80% memory because we have a memory leak, but it's obviously not a solution to the real problem. Or like, I don't know, some other, as you said earlier, scaling up in case of peak load,
Starting point is 00:28:22 but maybe we need to come up with a different solution to deal with this uh maybe we need to optimize the system to to be able to scale better or something like that yeah i agree with you yeah i think it's really interesting because i can see people abusing that right it's when you mention it when you talk about it that way steve you know if you know one thing i think a lot of people encourage and we encourage, it's great to automate your playbooks for production outages. Whatever your standard operating procedures are, automate that and get things back up and running. And that then can be triggered based on metrics coming out of your observability tool, aka Dynatrace. But I think you bring up a good point that people, especially if you have it really well locked down on the automation side, that people might just rely on that and keep on letting these things fix and not fix it. sort of expiration or if there's a way to put that into those playbooks um so that where where it does apply people can leverage that is that like i could just see that being a real become a real
Starting point is 00:29:32 liability yeah just these scripts running all the time band-aiding all you know putting plugging all the leaks yeah the problem is that it looks like an asset and it's actually a risk so when you have a dynamic system especially a dynamic distributed system like when you get out of monolith land and you have like this uh system that is always slightly broken and always slightly changing uh every band-aid that you apply is like very specially shaped for the problem at hand during that moment in time and so when anything changes which it will uh you know undoubtedly that band-aid is going to become a liability and you're not even going to remember that it's there so being able to sort of like keep a running tally of how many band-aids are you know in production at any given time and making sure that they get burned away with time is super critical, 100%.
Starting point is 00:30:28 So, I mean, this is not hard stuff in terms of like reporting and, you know, coming up with goals and like being able to keep track of how many they're out there and when they expire and like telling your team, okay, let's try to, you know, get next week, let's try to get two or three of these down. You just gotta know that that's the idea because otherwise, the good intention is like, oh, we have a mitigation system in place, let's use the heck out of it. Let's use it so much when really you kind of wanna use it as little as possible. Hey, got a question for you. So you mentioned monitoring being obviously, it has to be part of the, let's say, software
Starting point is 00:31:08 delivery, because monitoring is just software as well. And whoever builds it and configures it in obviously needs to take care of it. Do you have a strategy on testing monitoring? Because on the one side, we write unit tests and functional tests to test the functionality of the software is there anything kind of in the software engineering practice that also includes testing if the monitoring data itself is actually what we expect and then going further is there anything that you have as best practices on how do we test the remediation because it would be kind of risky to say well
Starting point is 00:31:46 we have remediation in place in production but we've never been able to test it so we just assume it works so kind of these two things can we is there anything i you know that kind of extends test-driven development to test-driven monitoring i'm not sure and i don't think this is the right term but i think you get the idea right so testing the monitoring and testing the remediation as part of the pipeline. So inside of Google, there was a system, I'm sure it still exists too, but there was a system that was exactly for this. It was like a monitoring rule test framework. And it worked fine.
Starting point is 00:32:20 Like you give it input data of, you know, the counter goes from zero to one to two to three. worked fine. You give it input data of the counter goes from zero to one to two to three, and then over the course of time, if it exceeds this rate, then fire this alert. And yeah, it worked. It was notoriously hard to use, but people did it. However, the idea of doing um tests like test-driven monitoring um sort of flies in the face of of the kind of slightly newer at least you know sre uh type of methodology which is that you don't actually want to uh focus on every possible cause but instead you want to focus on symptoms. So kind of what you, you know, the false goal is to be able to enumerate every possible cause and then define an alert for it,
Starting point is 00:33:12 and then maybe write a test to make sure that the alerts work or that the monitoring exists for each possible cause, and then proceed from there, which is totally intractable, especially, again, for a distributed system, which is constantly changing because you're always moving those causes. You know, you're introducing new causes and getting rid of old ones. And so you're just going to be on this treadmill of adding and removing alerts as well as those tests, which have enforced the validity of those alerts. So we found this to be the case many years ago, that this was just like this impossible task.
Starting point is 00:33:53 So it took a long time to realize what was going on, but essentially we realized this is kind of what we would call praying to the wrong god. So what you do instead is look at symptoms. This is where SLOs come out, right? So this is looking at what the end user sees, if it's fast enough and correct enough, and it's available enough and all these things like that. So this is kind of a rehash of probably things you've already talked about, but just because you add SLOs and availability and latency alerts
Starting point is 00:34:24 doesn't mean you get rid of all the other monitoring it just means that you don't react to it directly um so the the way that this is actually a pretty good parlay into the uh the cause tree we were talking about but like uh if you if you think about uh when when an slo violation occurs it's you know something got slow or something became uh slightly unavailable like it started issuing errors more errors than we like um that's like the top of a tree uh and i mean like a you know computer science type of tree right um and so like there's this huge graph of of nodes that may have caused it uh why did it get slow well i don't know is it was the load balancer was it you know the container ran out of memory? There's a million things that could have possibly caused it. And do is you want to be alerted by the root of the tree,
Starting point is 00:35:26 which is, you know, the system is slow or having errors. And you want to be able to run down this potentially huge cause tree. And you want to be able to prune that tree, right? You want to say, well, you know, this entire right branch of potential causes is not the case. We've ruled that out. So you don't have to go down any of this. And then you go down the tree a little bit further and perform some tests and say, well, we've ruled out these branches as well. So this is, you know, if you're into computer science and stuff, you know about tree branch pruning. Pretty quickly, you've reduced your search space
Starting point is 00:36:00 to something totally achievable. And you can actually then go to each of the leaves and test them individually and say, was it this? Was it this? Was it this? Nope. Okay, there's the one. Okay, it's finally we figured it out. We found the cause. And it might be a novel cause might be one you've never even considered before. So this is what I mean when I talk about pruning the cause tree. I used to call it pruning the blame tree. But I think the word blame has like, you know, personal consequences, which which is not not what I mean I just mean like what part of the system is is the cause or is to blame so being able to do this generally isn't the case that you have perfect monitoring rules in place that you've written tests for. What it is, I think,
Starting point is 00:36:45 really is that you have a fully deployed, fully usable observability suite, which isn't something that you have to keep writing new rules for. So it's something that you're able to introduce new code, which has been instrumented by the developers of the code saying, yeah, these are important, you know, counters, these are important distributions.
Starting point is 00:37:09 And then you just sort of step back. And then when the moment comes that you're that you need to look something up in the observability suite, like, what is the distribution of this bucket, you know, over over time, you can just call call for it from the observability suite. I think that's that's the kind of the righteous path forward as well. So it's, it sounds weird. But I think spending a lot of time on making the perfect alerts and the perfect dashboard is kind of the old, the old invested way. And what you really want is a observability suite that allows you to make sort of just-in-time dashboards for interrogating the system based on symptoms based on based on slo alerts if you can build that i think you're miles ahead i think you're you're in good shape yeah i mean you are this it's it sounds like uh it's music to our ears uh what you're saying because basically you know we we're good from our perspective you know we we work for dynatrace so this is exactly the um the approach that that we are taking that where it matters most and that's
Starting point is 00:38:16 either the end user because we also include end user monitoring data in our alerting on the top or service level right your slos and then in case we see an anomaly and you know we do i think that's what you guys are doing as well and others right we have the combination of you have let's say hard-coded slos let's say a certain threshold or you which is baseline and basically learn from from the past and then alert on anomalies and then we basically walk our dependency tree um into all directions because we are combining distributed tracing with full stack monitoring information and then we basically walk that tree and figure out where we in which path do we really have the problem and then we can leave out the other things and then we can also
Starting point is 00:39:02 see you know okay how did this historically where did it start and what's the true root cause so that's pretty it's it's good to hear from somebody like you that this is you know obviously really solves the problem and we've known this from our customers that they also like it but i also have to say a lot of people you know that have been living in the monitoring space for too long and are very proud of their beautiful dashboards, they still, a lot of them still start with, okay, what can I put on the dashboard so that I can, you know, I don't know, just, you know, be, feel like I'm under control if, if, if there's a problem. So this, this is a great example of really good intentions, but not really focusing on the right problem.
Starting point is 00:39:50 I actually just had a customer the other day who was asking me, they were performing a load test, and they were using a system that we have at Google. It's a managed Redis, basically. And they were saying, the CPU is too high. Why? And I said, okay, okay well that's weird uh let me check it out and they said why why is the cpu pegged and i kind of went well i don't know
Starting point is 00:40:11 what are you doing with it kind of thing and basically i said well what's the problem instead well the problem is the cpu is too high well that's that's not a problem that means that you're using your computer as well you know like you're using exactly as much of the computer as you need what's the actual problem they said say, well, there's nothing. We're just worried. And I understand that. Like, that's, you know, there's plenty of good attentions because they've seen many times in the past where elevated CPU leads,
Starting point is 00:40:36 it's a leading indicator of an actual problem later. And that's no fun. However, high CPU doesn't really mean anything by itself until you actually have a user-facing indicator that actually says, yes, this is a problem. I mean, again, Brian and I, we come from a performance engineering background. And one of the aspects of performance engineering
Starting point is 00:41:02 is obviously performance optimization. So if you see something like this, you may wonder, is this normal? And if not, how can we optimize it? But then you have to compare it either to historical data. So was it always at 100% CPU or did it just jump from 80% to 100%? Still not having an impact to the end user,
Starting point is 00:41:19 but still, why did it jump from 80% to 100% because of the last build, right? So these are obviously fair questions but i i i i understand what you're saying that people too often are just freaking out on there's a metric that looks strange to me so let's alert let's ring the alarm bells without knowing if it's if it's really impacting at the bottom line which is am i are we meeting our our slos um i think that's what it is in the end right right and and again it's always it's always based on good intentions sometimes it's it's due to uh you know past experiences that you know they've seen this
Starting point is 00:41:57 type of pattern before sometimes sadly it's just kind of cargo culting you know they've heard that this is bad so that we should totally change it um and but but really the the great thing about having something like us in place is that no matter what their intentions are what their you know maybe misunderstandings are is that you can point back to this one number and say like is this number okay and they'll go oh yeah that number is okay and then you can say okay so you know stand down everyone take a deep breath we're good um and so they can it's really a matter of practice yeah so if people have these habits you kind of have to give them a way
Starting point is 00:42:30 out you can't just tell them to stop yeah you have to give them something that's better yeah hey let me ask a question there though because you bring it almost sounds like we're talking to extremes and i'm wondering about the middle ground here. Obviously, if you're just pointing out CPU's higher, even if CPU's higher, you don't always want to jump on that. But if you're taking an extreme end-user view of it and say,
Starting point is 00:42:56 alright, CPU's higher, but there's no impact to the end-user, there is the potential to be glossing over problems that are going to arise. If the CPU went from 20% normally at the same load to 80% suddenly with the new code release, but still there's no impact to the end user, you've just pushed your system to a place much further than it was,
Starting point is 00:43:23 which could be bringing it closer to the edge. That means if you have suddenly an increase in, so what is that middle ground and how do you define that? Because in the last discussion we had about SLOs and SLIs, a lot of it came down to, again, what's the impact on the end user and defining your performance or your reliability from that perspective. But I feel like there's been this unspoken whitewashing of what the system's doing underneath. Yeah, so you have to have this first discussion that we just had at first,
Starting point is 00:43:56 and then you can get to exactly this point, which is, yeah, so what happens, like a great example of this is quota issues, right? So you can be humming along just fine. And then you're going to hit some quota somewhere. That's, you know, 200 somethings per second, whatever it is. And it's like somewhat arbitrary. It's just, it's just whatever the quota was, right? And then your system is just going to fall off a cliff, right? And then you're going to, you're going to have this big fire drill and you're going to freak out. And you're going to say, why didn't we know about this? You know, how come we didn't have a leading indicator, which is kind of what you're getting
Starting point is 00:44:28 at is some of these non SLO metrics could be used as leading indicators. And so we should pay attention to them. Right. So my, my response to that is like, uh, that's true. However, there's always going to be leading indicators that you're not watching and there's always going to be leading indicators that you're not watching. And there's always going to be indicators that occasionally lead, right? So do you want to have a lot of false positives and false negatives? Probably not. So the solution to this is not easy. But the solution to this is, you know, stay the path with SLOs, except you need to be able to now start investing in what we kind of call reliability engineering or resilience engineering, which is basically,
Starting point is 00:45:14 how do we let our system fail gracefully? So when we do hit one of these thresholds, whether it's a quota issue or CPU capping out or some other form of plateau, how do we let our system not completely collapse when that happens? Because that's really the higher level concept that you're referring to here is when something funny happens and we didn't have a leading indicator that we were tracking, how do we make sure the whole system doesn't collapse?
Starting point is 00:45:42 Or how do we make sure we don't even have a pretty big outage due to it or, or, you know, have some badness? What, you know, the ideal scenario is something unexpected happens, you know, so you hit some, either it's a quota or it's a scaling limitation that you didn't even know you had in your system and you had no indicator for it, but you hit it because, you know, you're doing great in your business and you have a lot of customers and you're making a lot of money. And one day you hit some number. Ideally, what happens is graceful degradation happens.
Starting point is 00:46:11 You know, 99% of users still continue to send you checks every second or whatever it is, you know, make you money. But like some small number of users can't. Then you can notice that because it's showing up in your error budget and then you can find the scaling limit and adjust it in due course without it being a complete panic right that would be ideal but i think the path forward to that is really that the way to get there is by uh essentially sadly it's by finding those like the hard way a few times, and having postmortems and saying, Oh, we keep following the same pattern, we keep having a bunch of client retries, maybe we should introduce, you know, exponential back off with jitter. And that will prevent this type of failure. Or maybe we need to have our load balancer send excess traffic to dev null,
Starting point is 00:47:04 right? And that way, we can keep sending real traffic to the back end systems and not completely, you know, crash the entire system. So essentially, what I'm getting at is like, it's hard, I understand that. But like, the answer isn't monitor more things. It's understand the system better and keep monitoring you know the end user's perspective but find a way for find a way to to allow your system to adapt which which is not easy but it's it is possible this is exactly resilience engineering so so i just to uh making some notes here but i'm if we come back to the if i hear this correctly if i can kind of repeat what i what what we what the perfect system should be or what a system can be that makes this
Starting point is 00:47:51 all possible is we obviously start with our slos and once we have problems we use the the cost tree analysis to figure out okay what's the potential what's the real root cause because we get faster with it and then if we detect hey you know the last three times we had this outage of slowness it was always the uh the full the 90 full disk on this particular machine or it was always the connection pool on this particular one then this obviously is new knowledge we'll convert it into a leading indicator and say, hey, we've observed that once this indicator crosses a certain threshold, 10 minutes later, there will be a problem.
Starting point is 00:48:31 So let's include this as not an SLO, but as a leading indicator for pre-alerting or for preventive alerting. I actually wouldn't go that direction. I would say instead, how do we make this thing, which is currently a leading indicator to outage, how do we prevent this from causing an outage? I wouldn't say let's alert. Because the point of an alert is to have a human do something or have a piece of code do some sort of mitigative step, right? Perform some, some,
Starting point is 00:49:06 some change. Um, why not have the system just do that change? Um, and if you, if you look at the big picture of the system, maybe, maybe, maybe this is what you're describing is that you just want the system to sort of like do its own sort of maintenance, I guess. Um, but describing that as an alert, uh, can be taken the wrong way by some operational teams and that what they'll do is they'll they'll assign a ticket to an on-call human right and expect them the phrase we use inside google which is kind of silly is this is this is called feeding the machine human blood you don't want to do that you don't you don't want to make sure that the systems persists or subsists entirely on human effort in fact you want to do
Starting point is 00:49:45 as little of that as possible so anytime that you say let's add an alert you know you should pinch yourself like that if instead you say like let's have a mitigation that's that's fine like um so having a you know cloud function or something like that or like a lambda that like steps in and performs some mitigative action uh that's it, that's a good mitigation. But again, like what you said before, that should be a, there should be an expiry on that, that should give you time, that should give you two weeks to solve what the actual problem is. And if the actual problem is, every time you run out of disk, then you know, we're unable to write the buffer out. The real solution is stop writing the buffer to a single disk. The real
Starting point is 00:50:24 solution is start writing to a distributed disk abstraction of some kind which has like an auto scaling back end and has or maybe you know change the disk that it's writing to to have like a cleanup function so that way you know we always have plenty of space and like these are all preventions they're not you know mitigations so but it's this this again goes back to the post-mortem issue of of how do you take a problem that happened in production even a kind of not a big deal like not a real outage but like an almost outage how do you get the monitoring you know how do we uh get the prevention how do we get the mitigation yeah figured out perfect so so i think what i they're all related yeah but related. Thanks for challenging me on that thought
Starting point is 00:51:05 because I think I just... Again, I'm taking notes here. It's just my opinion. I need to put on... I need to forward this list to the product team here because I really like the thing that you just said. The number of alerts is a reverse indicator to your maturity.
Starting point is 00:51:22 Well, I'm not sure if reverse is the right, but it basically is not a good indicator if your maturity. Well, I'm not sure if reverse is the right, but it basically is not a good indicator if that grows. However, obviously alerts might be necessary for a certain period of time in case you have not yet found the right solution, long-term solution to fix
Starting point is 00:51:38 so you can define an alert on a leading indicator, but it should have an expiry date because otherwise they're lingering around and then you're just freaking out certain people because more and more alerts are coming in and it's also not manageable anymore. So that's why this makes a lot of sense. Yep.
Starting point is 00:51:54 And you know what I really love about that last example just goes to show you the level of or the lack of maturity in my thought process on this, I guess, is a lot of times when i see what we would call maybe a performance anti-pattern you know my initial suggestion is well you should stop doing that then right but in your in your example with the you know like well don't change your code so it doesn't do that so you're not filling up the disk why are you filling up the disk well
Starting point is 00:52:21 there's probably reasons why you need to do something to disk and i like the idea of just, don't think of the solution as stopping the behavior that you're doing, because it might be a required behavior. Think of the solution outside of, well, you're writing to a single disk, write to distribute it. Taking it to there are new options out there, especially in the cloud providers. There's always another way you can mitigate that problem. So it's not necessarily that there's a problem with the code or the problem with the way the thing is operating. The problem is with the way it's just being handled
Starting point is 00:52:56 or the system that it's running on. And that can be improved. Another good way of thinking about the same problem is that really what you want to do is just think about what kind of trade trade-offs you're making and maybe you need to just make different trade-offs um so for example with this with the same same example this actually comes from a colleague of mine uh you need the actin uh he gave a talk at sre con a couple years ago i think called the sre i aspire to be actually maybe it was one year ago. Esrikan Europe, I think. He basically said, do you remember
Starting point is 00:53:27 there's this little thing called RAID that we all know about? Basically, whenever we would write to a disk, it was fine until it wasn't. There was occasionally a problem where the whole disk just died. How did we solve it? We added a second disk. Oh boy. Or maybe four disks
Starting point is 00:53:44 or something like that. And so what we're doing here is we're still writing to the disk. You know, we're not really changing the code that says like, I need to open a file and write to a file and everything, but we're changing the system underneath. And what we're really doing is we're trading off reliability for something. And in the base case, you're just adding an extra disk. And really, you're just you're just paying twice as much for your disc so you're trading off dollars or you know euros for uh for reliability which is a great trade-off we didn't even know that was an option before and then you know rage showed up and said well if you just spend twice as much on your discs and put them together in this particular way you never have that type of failure ever again and then you know everyone with a checkbook says
Starting point is 00:54:23 no problem like let's do it. This is totally worth the trade-off. So finding more trade-offs like that, where you're saying, I want more reliability and I'm happy to pay with either money or time or compute process costs, essentially. Finding those trade-offs is really where reliability engineering lives. And I highly recommend looking up that talk. If you can put it in the show notes or something, I'll send you the link. I just found it, so it's easy
Starting point is 00:54:53 when we Google it. The SRE I aspire to be, we should add it to it. SREcon 19. There you go. Oh, so it was a couple years ago. Oh, last year, 19. So SREcon 19, I think that was, as far as I remember, if I look at years oh last year 19 so it's a recon 19 I think that was as far as I remember
Starting point is 00:55:07 if I look at my oh geez it's only 2020 it's still 2020 that's only less than oh man I know we all nobody knows what time it is
Starting point is 00:55:15 time travels differently we all want to forget 2020 because of the whole thing that is going on but it's still 2020 yeah we're past it right yeah nope you know it's funny, this whole conversation,
Starting point is 00:55:25 so much of this conversation, I was saying earlier that I was thinking of a lot of songs when Andy was talking and so much of the conversation about, you know, people just did things to do them because that's the way they did that. It's come up a lot and I just have to mention, it keeps making me think of the Monty Python sketch, the operation room where the guy's like, bring me the machine that goes ping. You know they reveal in this machine it goes ping it's
Starting point is 00:55:49 like there it made the ping noise you know because you're doing what you've been told to do and you've historically done and you're not doing anything different just marching on with the same old way um which we know is path to failure yeah Yeah. That's cargo culting. A little dramatic there. Yeah. What is that cargo culting? Oh yeah. Have you guys heard of, this is,
Starting point is 00:56:11 this is a great little silly story, which is the idea that, uh, during world war two, there were like, uh, you know, uh, Pacific Island,
Starting point is 00:56:16 uh, tribes that, uh, were basically invaded by, you know, this war showed up on their doorstep. Yeah. And so all these planes started landing on these
Starting point is 00:56:25 islands and setting up uh sort of like refueling stations uh and so they'd set up an airstrip and like they would come in with all these planes with all this stuff in them and and the people who were like indigenous would you know they'd kind of benefit like they would give them food and and stuff and then all of a sudden the war was over and these planes stopped showing up and the airstrips were still there so the indigenous people didn't really understand what the war was or what planes were or anything and so they said well if we make these uh control towers and we like put our arms up like this and we move our arms back and forth like the people did before the planes will show up and so then this is this is what they would do and there's there's these pictures of
Starting point is 00:57:03 these people who had built control towers, small control towers, but control towers out of essentially wood and sticks and stuff sitting on them. And they look like air traffic controllers. And they're trying to get planes to show up because they want the planes to come back with tons of food and people and supplies. And they called this the cargo cult because they wanted these cargo planes to show up. And they, they were just trying to sort of redo the the actions that drew them before without really understanding what was happening underneath. So it's actually I'm not even sure how true the story really is. But like, it lends itself to matter to this concept of often, like, I hate to say it, but often, like the operation, you know, centers within these enterprises are filled with teams who are not not well informed, and they're not well paid. And they're not experts at the system they're operating. So they they just do like I
Starting point is 00:57:58 said, like best intentions, they do what they've seen works. But at the end of the day, like often the entire system has been entirely changed under their feet and no one told them and so they do something which is you know basically the wrong solution because they they don't know any better right no one no one told them to you know we don't we don't even use that database anymore like why are you spending all this time making sure that its ram is high or something like that. So this is a really common pattern, which is totally preventable. And this is part of the problem of cause-based thinking, as well as throwing code over the wall. These are all bad practices that lead to this kind of unfortunate position that some of our colleagues are in, where they're just desperately trying to keep the lights on by, you know, throwing RAM at whatever server, you know,
Starting point is 00:58:49 has a red dot next to it on the monitoring system, which is, you know, no good. The real solution is to have the people who understand the system, be able to interrogate the system to be able to make changes directly to the system that they wrote, you know, and make improvements directly. That works much better. It's a lot more rewarding, frankly, as well. Don't forget the key is to change everyone's title to include the word DevOps.
Starting point is 00:59:15 Please don't persist that. We know that's a bad idea. I don't even want to put that out as a joke. Please do not rename your NOC the SRE team or the DevOps team. this is a bad plan all of you managers out there this is not a good idea alright
Starting point is 00:59:32 hey we've been it's amazing we've been I think talking almost an hour and Steve the way we typically do this and Brian I know you probably wonder what is he doing today is he summer is he doing today? Is he summer?
Starting point is 00:59:48 Is he doing the summery or not? Do it now. Yeah. No, I think so. Let me summarize quickly because, Steve, at the end, I always try to summarize what I learned today in a short fashion. Hopefully, I got it right. Otherwise, you can scold me later on or correct me. But I really liked, you know, in the beginning when you told your story about how important it is to start with a post-mortem culture and you have to make it easy for people to fill it in.
Starting point is 01:00:14 And it shouldn't be a burden or felt like a burden, but make sure people can figure out and write down how to detect the problem, how to resolve it, how to prevent it and how to mitigate it and then have kind of a culture around it to you know constantly review it and then learn from it i think another thing i learned is that what i think that brian and i have also been talking about in the past the importance of monitoring is part of your code so because in the end we need to make sure that everything we do can be monitored so it should not just be a siloed exercise where somebody else may or may not provide some monitoring data, which will be part of the package you deliver. Now, the other thing that I really liked from you is coming to the cost tree analysis.
Starting point is 01:00:56 So instead of looking and defining many, many different alerts and many different metrics, we may think are good indicators for problems. Start at the top top start at your slos and they are typically close to the end user and in case there is a problem then follow the traditional follow the path of figuring out where is the true problem coming from by following your distributed tracing tree or whatever monitoring tool you have in place, your observability platform that can then get you to the root cause. And I also like the definition of what you say, what reliability engineering in the end really is.
Starting point is 01:01:33 It's really about how can we learn from these things and how can we make the system fail gracefully in case there is a true problem? How can we mitigate problems so we have more time to to fix it for good but i think that's that's what i like so reliability engineering for me what i took kind of notes how can we help and build a system that fails gracefully in uncertain or unknown conditions and not fully crash and that's also very very important. Last thing, this is really what I'm going to take back to the product team on our end. While custom alerts on certain metrics might be a good idea for a short period of time, it's in the end a bad indicator for your maturity because
Starting point is 01:02:20 in the end, custom alerts just tell you that you are having a system and you're not actually investing in it becoming more reliable and more resilient, but you're still fighting more fires in the end because you get a lot of alerts. And yeah, really. So did I get it kind of right?
Starting point is 01:02:39 Yeah, I think so. One small point I would add is when you talk about SLOs being customer and user facing, that's totally accurate. However, as you become more mature, you're going to have many layers, right? And then some people say or they ask the question, like, what if I have many layers? Do I have to have SLOs at every layer? Or how does this work?
Starting point is 01:02:57 And what if I'm a back end team? I don't have customers, front end customers. Your customer can be someone who works at your company. So if you're in charge of the document indexing service or something like that, your customer is all of the front-ends who call you. They don't need to know what your SLOs are and how you're tracking your own service. It's up to you. So when I say customer-facing, sometimes that means internal customers. So that's just a minor caveat, especially like for business systems as well. Like if you have an entirely
Starting point is 01:03:28 internal facing system for, you know, like, you know, big data analysis for your marketing team or something like that, then they're your customers, you know, the marketing team is your customers. Well, I guess that's also why they call them SLOs, service level objectives, and not site reliability objectives, because this is also a thing I want to make. Why do we call it SRE? Why don't we call it site reliability engineering and not service reliability engineering?
Starting point is 01:03:55 The intent is that you're looking at the holistic giant system, right? Calling it systems reliability engineering is a little, I don't know, dated feeling. And then service reliability engineering is a little bit too specific, like it's a little too focused onto one service potentially. Also, it's just a historical artifact. So it's the site. So Google was referred to as the site. So all of Google, whether it's Gmail or search or ads or cloud, this was all, you know, the site that we're trying to make more reliable. I think it's just history. Awesome.
Starting point is 01:04:32 Well, Brian, anything else from your end? No, I'm good. This was excellent, Steve. I really appreciate you taking the time to come on with us. Sure thing. It was really fun. We've been having a lot of fun SRE conversations lately, and this one was all kind of a new direction. So it was really fun we've been having we've been having a lot of fun SRE conversations lately and this one was all kind of a new direction
Starting point is 01:04:46 so it was yeah really fun for our side glad you could you were able to do it all successfully from the car sorry if there was
Starting point is 01:04:55 background little pings as I was you know people would drive by me and stuff so I hope there wasn't too much background noise no no
Starting point is 01:05:01 everything was great is there any any final thoughts you want to add before we sign off or you got everything? Um, I think, uh, one thing that's just worth noting is like, uh, you can't buy SRE off the shelf. Uh, you can't, you can barely hire them. Um, it's really something you have to build over time. Uh, it's, it's, it's specific to your company and your team. Uh, it's an investment. Uh, you have It's an investment. You have to take it slowly. You have to build it up.
Starting point is 01:05:27 And one of the most important things you can do is you have to have what we call an executive sponsor. Someone has to say, yes, as a company, we're going to do this. We're going to staff it. We're not going to cancel it in one year. It's not a project that has a timeline. It's not going to be done in 16 months. It is just a new job role that will be there forever. So as long as you take it that way, you will make continuous improvements, your systems will become more reliable, and you'll be more profitable
Starting point is 01:05:56 business if that's your goal. All right, appreciate it. And do you do, what do you do, like LinkedIn, Twitter, anything like that, that people want to follow you? Yeah, I'm Steve McGee on all the things. S-T-E-V-E-M-C-G-H-E-E on the Twitters and LinkedIn and GitHub and probably all the stuff. Google Plus? Google Plus, Wave, all of them. Yep. I think I had a, I'm pretty sure Google Reader
Starting point is 01:06:25 had a username functionality and I had Steve McGee on that too. But, you know, RIP. Or went out for Reader. I heard that one was popular. I never played with that one. Anyway, thank you very much for being on the show.
Starting point is 01:06:39 If anybody has any questions, comments, you can, you know, reach out to Steve, obviously, or just follow him. If you have any questions, comments for us, you can get us at pure underscore DT on Twitter or send us an old-fashioned email at pureperformance at dynatrace.com. I think that's it, right, Andy? It's been a while.
Starting point is 01:06:56 Anyway, thanks for listening, everybody. And thank you, Steve, for being on again. Talk to you all soon. Bye. Thanks for having me. Bye. Bye. Bye.
Starting point is 01:07:04 Bye. Thanks for having me.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.