PurePerformance - Learning from Incidents is what good SREs do with Laura Nolan

Episode Date: January 16, 2023

Incidents happen! And when asking Laura Nolan who was an SRE at Google and Slack, healthy organizations should take proper time to analyze and learn from them. This will improve future incident respon...se as well as overall system resiliency.Tune in to this episode and hear Laura’s tips & tricks what makes a good SRE organization. It starts with doing good write ups of incidents, doing your research on incident reports of software and services that you are looking into using. We also spent a good amount of time discussing root cause analysis where she highlighted an incident that happened at her time at Google and what she learned about outdated alerting.Thanks Laura for a great discussion and lots of insights.Here are the additional links we discussed during the podcastLaura on LinkedIn: https://www.linkedin.com/in/laura-nolan-bb7429/Laura on Twitter:https://twitter.com/lauraliftsIncident Template talk @ SRECon: https://www.usenix.org/conference/srecon22emea/presentation/nolan-breakWhat SRE could be talk @ SRECon: https://www.usenix.org/conference/srecon22emea/presentation/nolan-sreHowie Post-Incident Guide: https://www.jeli.io/howie/welcomeMy philosophy on Alerting article: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson. And as always, I have my predictable co-host, Andy Grabner, who's trying to throw me off as I say these words to you today, our dear listeners, because he has a very low opinion of you all. And he wants me to mess it up so that you have to suffer through my stumbling around. And I care about you though. I know Andy, Andy can be a jerk sometimes. It's the side of Andy you probably are not aware of how much of a jerk he is. He once
Starting point is 00:00:56 yelled at me for petting a puppy. I just want to let everybody know, okay? So anyhow, Andy, mean guy, how are you doing today? I'm really, you offended me when you said that I'm predictable. Well, you did the same thing again. Yeah, because then if I am predictable, If you didn't, then I was then you were doing if you didn't do that today. So what he does is he mocks me as I'm doing my intro, for the people who can't visualize because it's audio only. So he was in a bind because if he mocked me, then he was being predictable.
Starting point is 00:01:29 And if he didn't mock me, he was caving into my challenge to not mock me. So it was a catch 22, as they say. Yeah, I tried to find a nice segue now to our topic today because predictability has something to do actually with what we're going to talk about today because I want to make sure that system becomes, I think, more predictable as they start failing. Is that a good segue? Yeah, yeah, yeah. And you want to be able to understand what the past is showing you about the future, hence the predictions, yes. Yeah, exactly. And talking about predicting the future, learning from the past, learning from incidents, I think is one of the topics, is the main topic of today.
Starting point is 00:02:11 And we have a lot of incidents that happen on a regular basis. But today we want to talk about experiences that Laura Nolan has. And Laura, I think we've never met in person, unfortunately. But Laura, I saw your work online. I found you when I did some research on site reliability engineering. I saw you were speaking at SREcon and I'm pretty sure at many other conferences and online venues and also on-site venues when we were back in the days when we were speaking at places then this strange incident happened for the last two years um but laura instead of me trying to figure out how to best introduce yourself please do us the favor uh introduce yourself to our audience hello well um
Starting point is 00:02:58 yes i'm laura i am i'm a software engineer um and i guess for long time, I've been very interested in failure, what it can tell us about systems, interested in what we can learn about systems, I guess, more generally, right? I mean, every system that we work with is different and unique in its own ways, but what can we learn from one system and how it fails about other systems? And it turns out that there's actually quite a lot. There's a whole system science that's been out there for the last number of years. So I think we see things. When we look at our software systems, our production systems, we very often see repeated types of failures, sometimes within the same
Starting point is 00:03:47 system and sometimes things that look similar that happen in other people's systems. So I think there's a lot that we can learn. This is not a new insight in software. This is something that has been done in other industries, most famously, I guess, the airline industry. I mean, they spend a lot of time learning from their incidents. Incidents can teach us a lot. So, yeah, that's something I'm very interested in. It's not the only thing I do, but it's certainly something I'm very interested in.
Starting point is 00:04:15 And, Laura, I think you don't do yourself justice when you talk a little bit about your history. I just have your LinkedIn profile open, and it's fascinating. If I see who you worked for and what you did at this company. You worked for Google as a site reliability engineer, as a staff site reliability engineer. You worked at a company that Brian and I rely on every day, 24-7. You worked at Slack. And I'm pretty sure you have your fair share
Starting point is 00:04:44 of stories. And I'm pretty sure you have your fair share of stories. And I'm really looking forward to now have some conversations on what we can learn from incidents and kind of what kind of practices you had and what kind of culture you had back in these organizations. Right now, who do you work for right now? I think you switched jobs recently. I did. In the last few months, I've switched to work for a company called Stanza. Stanza is, I guess we're still fairly stealth mode in our startup. We're in early days.
Starting point is 00:05:11 We are baking our first product. Unsurprisingly, that product has definitely things to do with reliability and helping you prevent and recover from incidents. We're not an incident management software, but it's something in that space. All shall be unveiled in the coming months. So that's very interesting work. It gives me a chance to, I guess,
Starting point is 00:05:37 apply things that I've learned through the rest of my career into a product that we hope will be good for everybody in the industry. Awesome. So, yes. And you're correct. I have many stories. One of the things I used to do at Slack, it wasn't my main job,
Starting point is 00:05:51 but I fairly often used to write engineering blog posts about some of our incidents. I think I've got three or four that are up there on the Slack engineering blog. They're quite long and detailed. So those are worth a look if you're interested in incident stories. I'm not sure if there's a good timing,
Starting point is 00:06:12 but I think it kind of is a good timing because we at Dynatrace, we just had an incident last week that also forced us. We wrote a blog about it. We had, after an update of our single sign-on, our customers were no longer able to log into our systems, which obviously is an incident we're not proud of, but at least
Starting point is 00:06:34 we could fix the problem in almost no time. And just as you just said, on the engineering blog, it's like you were covering these incidents back then to really, A, I guess, show the world that you are open about this because failures can happen. I think that's also part of kind of a blameless culture that you can just admit that failures happen. And I think the more we share about what problems can happen in complex systems, the more we hopefully contribute to making the world a more resilient place
Starting point is 00:07:05 because others can learn from us. And therefore, it's not just from our own failures that we learn. My question to you would be, Laura, how can you convince an organization to actually go down that route? How can you actually, I mean, how can you be open about things that people typically are not that proud of if something bad happens? Yeah. Yeah, that's a great question. I guess when we're looking at incidents, there are two kind of strands.
Starting point is 00:07:40 Companies will do an internal incident review, something that's just done internally. Very often a lot of depth. You might think about kind of organizational factors as well as technical factors, how people work together, all of these things. And then there's the public blog post. So this is a really interesting beast. John Oldsball has talked about this quite a lot. So he says that external and internal write-ups serve different purposes. So he thinks that the internal write-up is generally, hopefully geared towards learning and actually improvement. Whereas very often the external blog post is an attempt to save face
Starting point is 00:08:16 or sort of make an extended apology or sort of convince your customers in some way that you're a responsible company trying to do the right things. I think John isn't wrong, but weirdly enough, I think that the most impactful external incident write-ups are actually the ones where people are not trying to make it an apology or trying to make themselves look good, but actually the ones where people are as honest as they possibly can be in that context. And of course, there's all sorts of complications. I mean, organizations are interested in protecting
Starting point is 00:08:52 their trade secrets. Organizations don't want to create legal trouble for themselves. So it's not easy to be completely sort of honest with your external incident reports. But I do think that it's possible to be open enough and detailed enough to actually, you know, to disseminate information through the industry. And I think there's a lot of times when we do see that. There's a lot of public incident reports that are out there and some of them are amazingly detailed. Some of them are amazingly useful.
Starting point is 00:09:24 And, you know, there are some times where, particularly when the case when you're dealing with open source software you see the same things again and again um so i think that's you know i think that there is a lot that we can learn from doing these public posts and i think that if they're done well they can be really good for a company's image, particularly wherever you're dealing with customers who are likely to be technical, you know, to be in software to sort of understand the fact that failure happens. There's no organization that's so perfect that you don't sometimes have a failure or have an incident. You know, as you said with your SSO, it's it's you know as you said with your sso it's how fast can you respond how fast can you actually recover from that um and can you learn from it and improve you know these are more important metrics than do you have incidents or not so when it comes to there is um
Starting point is 00:10:18 the absolute worst thing i think that you can do is you can kind of come across as self-congratulatory um a lot of people were for example looking at the big atlassian outage um they had that incident absolute worst thing I think that you can do is you can kind of come across as self-congratulatory. A lot of people were, for example, looking at the big Atlassian outage. They had that incident, it was last year. They had a data migration that went wrong and they had a very extended recovery period. But a lot of people thought that their blog posts that they did about that when they described that incident and the recovery was sort of an extended advertisement. So one thing that I saw that rubbed people up really the wrong way was when Atlassian said, and we used our own tracker products to help us recover faster. People were upset about that because it had been such a protracted outage so that's definitely something i think should be avoided but it also reminds me a little bit brian remember the days years ago and laura i'm not sure um
Starting point is 00:11:22 if if you have followed this but i would say if 10 years ago um when we were analyzing websites website performance and availability of of websites, especially around Thanksgiving and Black Friday and Cyber Monday. Super Bowl, yeah. So websites always went down. And I remember a time when companies were kind of proud of their website crashing because kind of like, hey, we did such a great job in marketing so that so many people tried to visit our site and our site crashed. So they kind of turned it in a, I think it's similar what you just said, right?
Starting point is 00:11:52 Hey, we messed up, but we try to give it a positive spin. And I don't think that's necessarily a bad thing about trying to give it a positive spin as long as you're still honest in the end and say it's genuine it's genuine exactly yeah and big shout out to her you know our friend James Polly he was he was always leading the vanguard of destroying people who'd be like oh yeah our site was so popular our advertising was so popular our site went down he would like flip on that you know he just go apeshit over that.
Starting point is 00:12:27 But the other thing I think you mentioned, which is really, really important, is one thing, just even you being on the show, Andy and I doing the show, all the guests, all the talks people do, is the IT community is so keen on sharing information and putting stuff out there to help others. Whether it's lessons learned, ways we had success, ways we learned from our failures, and the specific way you're talking about when you're talking about increasing your credibility.
Starting point is 00:12:54 So long as you got yourself covered legally, right? Because that's always the public post is, the legal part's always going to be a big thing because there's people waiting to find, you know, find one little door to put their foot in to sue a company. But as long as you can cover that by additionally putting as much information and explanation of what happened, how it was figured out, how it was remediated, you're then sharing with the community once again for this really pretty embarrassing incident. That, hey, we're
Starting point is 00:13:26 this big company or whatever, and we fell down, but we're going to share it all. So not only did you fix the problem, get it back up and running, but you gave back to the community as a result, which is just going to turn it into such a positive for everybody. And there's no, except for the legal side, there's no good reason to not do that because it's just going to increase your credibility. Whether you're looking at company reputation or even big picture of recruiting, people are going to look at that company and be like, hey, that was really cool. I think next time I'm on the market, I might check that company out. I can't see a net bad about it, as you say,
Starting point is 00:14:04 especially the way our industry works. And I think there are very tangible good things that come out of it as well. For example, if I'm picking up a new tool or a new SaaS service to use, and it's going to be part of my critical production systems, I search for other people's incident reports that mention that particular product, because that gives you insight into the ways it fails and the way it works. That's very hard to get out of manuals. So you can get that information, you know, pitfalls to avoid good and bad ways to use particular tools and products. So that's directly valuable, particularly in a world where, you know,
Starting point is 00:14:51 software is converging on using, I think, a smaller and smaller number of fairly popular products and services and open source projects over the years. I mean, look at the standardization on things like Kubernetes, for example, and that's a community that's been very, very open about incidents and patterns that are evolving. And then that information can make itself into the product. The product can become hardened against frequently seen failure modes. It can make its way into training and really tangibly actually
Starting point is 00:15:21 sort of elevate all of us as a community. You know, the key thing you said there too that struck me is when you're reading the details of it you understand how it was used or misused or whatever that just goes back to a direct correlation to the idea of reading product reviews as opposed to just looking at the stars you know or if you're taking a go on on a vacation and you take, oh, I got one star. Why? Oh, because I didn't like the way they made my bed. Okay, that's a dumb one star review. I'm going to ignore that one. As opposed to I went in and the sheets were soiled when I checked in. Okay, that's legitimate. Or someone talking about a product that,
Starting point is 00:16:10 I bought a lighter and I tried to burn down the building next to me and it didn't light a big enough fire. All right, well, that's not quite what you'd want to use to start a fire like that. So maybe you bought the wrong product to begin with. So just the point being, I think context is key and that's what you were really getting at it's like what was the context of those uses and when people have a bad experience or a good experience what's the context of that because that may or may not apply to what your plans are and that's going to give you that
Starting point is 00:16:38 additional information in that and then the company obviously as you said can turn around and be like hey actually we do want to fit into that context. Let's take that feedback and work with it. That's a really, really great point. Yeah, rambling now, as I do. If podcasts aren't for rambling, then what are podcasts? Oh, sorry, go on. I was going to say, just an observation about context. I think you're exactly right.
Starting point is 00:17:06 And that's something that we can get really richly from a good incident write-up. You know, a lot of context, you know, for as long as I can go back and think of, we've had public bug reports and things like that on open source products. But a well-written incident review gives you a lot more of the story, you know,
Starting point is 00:17:24 not just, you know, we did this and we saw this bug, but, you know, what's the architecture? Why are you using it in this way? And what changed to sort of trigger a problem? You know, it's much richer than the need to forget. Laura, speaking about good incident reports, are you aware of any templates, any best practices? Is there anything out there?
Starting point is 00:17:49 Because I assume maybe some of our listeners, they would like to start with this, especially in their company, kind of start first internal, then maybe public incident reviews. Is there any place to get some guidance around it? There absolutely are many places to get guidance. I don't agree with them all, actually, weirdly enough. So a really good one is this company called Jelly.io, and they have a really good guide called the Howie Incident Review Guide, so H-O-W-I-E. And that won't lead you wrong.
Starting point is 00:18:22 That one's pretty good. I think a lot of people and this is I gave a talk at SREcon 2022 just about this topic. I think when a lot of people sit down to write an incident report, they tend to sit down with a very, very detailed template. So you'll get standard incident review templates that will have a lot of it's like a big form. You've got a big document and you'll have, you know, start time of impact, end time of impact. What was the impact? Executive summary, you know, root cause.
Starting point is 00:18:54 And we should talk about root cause in a moment. Then, you know, what was the resolution? All these sorts of little snippets of information. And the problem here is people will sit down and they'll sort of approach it like you're filling in your tax form, like it's a chore, like it's something you want to get through, like you just need to get all those boxes filled in with something and get done with it. And I think that that's a real anti-passion because it leads to documents coming out the end that have almost no value. I think a good, well-written incident is something that you can put aside
Starting point is 00:19:29 and maybe somebody joining your team in a year or two years, if those systems are still there and still similar, they should be able to read those documents and get some value out of it. And that only happens if there's enough context, enough richness. And that is something that having a form with 20 or 30 sections that you have to fill in sort of mitigates against it. So I think that people should actually sit down and first, you know, think about it as a story. You know, where does the story start? Maybe it starts on at the time that the incident starts,
Starting point is 00:20:03 or maybe it starts three years before when you're making a design decision that turned out to be pivotal in that. Or maybe it's six months beforehand when you decided to add a new feature, or maybe three months beforehand when you decided you were going to have a giant flash sale on this particular day and your system gets overwhelmed. So, you know, thinking about it as a story makes you think about the, you know, where to start, how much context to give. And a good incident report, I think, does need a lot of context about system architecture and the reasons why things are how they are. You know, these are all the things that,
Starting point is 00:20:40 you know, somebody picking that up without a lot of context in a year or two years are going to need to make sense of it and it's it's a it's a big investment to to write those kinds of reports and it's not only writing you also need to go and you need to talk to the people who are involved get their stories weave those into the narrative because if you have six different people involved in an incident, you're going to have six different perspectives, six different understandings of what happened.
Starting point is 00:21:11 And, you know, they're not all going to be the same. People have deeper understandings of particular subsets or particular parts of the system, particular parts of the thread of what happened in that incident. And you've got to put them all together like a jigsaw puzzle. parts of the system, particular parts of the thread of what happened in that incident. And you've got to put them all together like a jigsaw puzzle. What's more, I mean, sometimes you get people who have different ideas about what actually happened. That one is challenging.
Starting point is 00:21:35 I mean, whether or not there's one truth, I mean, there probably is one truth, but it isn't always possible to go back and actually tell in detail who was right about a particular sequence of actions because you may not have records of what happened. And frankly, a lot of incident response happens in people's brains. What's pivotal is what do people understand about the incident as it happened? What do people understand about the system? What do they know and what do they not know? What made them think that something might have been the cause and take a particular action? And, you know, we're humans, we're very complicated bags of meat and we have all sorts of cognitive biases. You know, if I'm responding to an incident
Starting point is 00:22:20 that's happening in a particular system, I might be biased by the fact that six months ago, I dealt with an incident that looked similar and had a particular cause. So I might be predisposed to go looking at that particular part of the system that caused the incident six months ago. But today, that might be something else entirely triggering these similar symptoms. So my experience might lead me astray. So people are going to have very different accounts of the incidents because of these different cognitive biases. So you've got to put all this together into some form that is useful, but also rich, also has a lot of context. And you and your organization are going to have to try and learn from that
Starting point is 00:23:05 because that's the most important thing that comes out of an incident. I mean, it's not wrong to take out action items, alerts to change, feature changes, or sort of resilience changes to make in the code base. That's not wrong. But the big thing that organizations miss more often than not is figuring out what can we actually take away from this about the strengths, weaknesses of our systems
Starting point is 00:23:34 and also our processes, how we deal with things, how we interact as an organization, how we share what we know. Fundamentally, when we work in technology organizations dealing with complex systems, one of the big difficulties that we have is that we have a lot of complexity and everyone knows their own different slice of it. And when things are so tightly coupled
Starting point is 00:24:00 as they can be in software, in production software systems, everything has an impact on everything else. One of the key things is bringing together enough knowledge to actually make sense of things, to understand problems that may arise or are arising and sort of get that human side working optimally. Now people cannot see me me but I was just staring at you and like listening and taking it all in while you were talking about this I also found the talk that you mentioned it was called break free of the template incident write-ups they want to read I think that's the talk Laura correct yeah
Starting point is 00:24:43 so for everyone that is listening we will add the link to this talk as well uh for it was srecon emia in 2022 um laura you mentioned earlier already we we probably will talk about root cause as well and i think as you were explaining in your last explanation about how important writing good stories and good incident reports, you mentioned root cause. We need to figure out, especially in complex systems, what is the real root cause. And the root cause for a problem that seems the same as one six months ago
Starting point is 00:25:21 might be completely different because the environment has changed, the people, how they react to it has changed. We are, Brian and I, are big on root cause. I mean, we're working for an observability platform where one of our goals is to make root cause detection easier. Now, do we do an always great job? I think we can always do better and we can learn. From your perspective, what you have seen,
Starting point is 00:25:46 what do people typically get wrong when it comes to root cause or where do they get lost? Or just some tips on figuring out the root cause. What are the things that you can tell us? What I will say is that a lot of people in the industry have moved away from thinking about a singular root cause to thinking about having multiple contributing factors that play a role in an incident. So I think something that a lot of people say about complex systems is that they tend to be heavily defended. And by that, I mean, we build a lot of stuff
Starting point is 00:26:25 to try and make our systems reliable. So we build in redundancy so that we can send traffic to different machines if one fails. We build in testing so that we have automated tests that run before we put a new release of code out.
Starting point is 00:26:42 We have canarying so that we can see if the metrics for our new release seem weird. We have load shedding. We have all these things. So we've got heavy defenses. So when something goes wrong, I guess the point about a single root cause is it's not normally just that one thing that went wrong. I'll give you an example. Years and years and years ago, while I was working at Google, I worked with this very, very big distributed database. And the distributed database was, you know, a typical sort of versioned database, you know, new pieces, new little packets of data were being pushed onto this thing all the time,
Starting point is 00:27:24 and they were sort of being merged with the existing data set. And there was a periodic kind of merge process that would take two smaller chunks of data and sort of smoosh them together. So trying to optimize the reads and the writes and all this kind of stuff. So for the client consistency part of that, what clients would do is they would make queries at a particular version of that data. So if you're a client and you are doing some, you know, large, complicated queries or doing like trying to do some point in time snapshot of something, we had a lot of clients like that. This was sort of a statistics database. You would take the current kind of data version at the start, and then you would do your big complicated query looking at that data version throughout. Okay, so that's fine. One day, we started getting complaints from people that they were getting stale data in their queries. They weren't seeing what they believed to be the most up to date data that should be
Starting point is 00:28:27 in the system, and this took way too long to diagnose. But it turned out that there was basically a place where the database was publishing the most recent data version so that clients could read it and do their query against the most recent, and this process that pushed that out had gotten broken somehow. And okay, that's a root cause, right? But there are more root causes than that. So first off, why didn't we get an alert about that? We had built an alert years beforehand, it got that alert somehow broke and so we didn't get an alert that's the second root cause or contributing factor as most people would say thirdly why didn't any of us think to go and look at this particular mechanism because none of us had heard
Starting point is 00:29:21 of it i mean logically this thing had to exist And when you sat down to think about how the whole thing worked, yes. But this wasn't in any of the team's training that we had. We had never done a Wheel of Misfortune exercise about this thing or any practical dirt tests. This thing had been there for years, and we had just forgotten about it. And then one day, it silently broke, and the alert that we'd put in years before failed failed and we had to figure this out from first principles. So there's a whole bunch more contributing factors that went into this incident.
Starting point is 00:29:53 So if you're going to talk about, you know, what was the sort of proximate technical cause of something not working? Sure, maybe there is one root cause, but when we think about the broader systems of, you know, how do we, how are we enforcing the invariance in our systems, which is monitoring and alerting, right? We alert when we think that some property of our system is not as it should be. Most incidents involve some aspects of, you know, alerting not being the way it should be. Extended incidents like that one typically involve some case, some reason where the responder's mental model is not actually in sync with the real system. And people having trouble figuring out what is actually happening and what should be happening and what is the difference between those two things. So there's a whole bunch of things that go into any sufficiently complicated or serious incident that it's very, very hard to reach it down to just,
Starting point is 00:30:54 well, this line of code here or this permission here. And I think what most people who argue against a single root cause say is when you stop at one root cause, you're losing information. There's all this other stuff around that single root cause that contributed into the incident as well. And if we stop because we found the root cause, we don't think about, well, you know, why is it that that wasn't in our team training? And why is it that we never did a will and misfortune exercise about this thing? And those are questions that we should be asking. So it's the same with the five whys. I mean, you know, why only five?
Starting point is 00:31:35 Why not 10 whys? Why not three whys? So, and, you know, even with the sort of the broader contributing factors approach, you can certainly miss things. But it's just, it encouraged you to take that sort of broader look at what happened. And, you know, the sort of broader socio-technical systems, which means that the meaty humans that look after the computers, you were equally important. Again, a fascinating thought for me is because you mentioned the alert that you set up years ago, but it was never, I guess, updated, tested, validated, and therefore, you know, it contributed to the alert not obviously going off and alerting it it correctly how can we fix this because i've
Starting point is 00:32:25 heard this a lot from from people we talked to over the years they're setting up alerts and alert here and alert there they get forgotten they never really get tested and re-evaluated um brian we had a we i think we had a conversation a year or two ago with anna medina she was back then working for gremlin as a chaos engineer. And we talked about test-driven operations. Basically, really using kind of chaos engineering also to continuously validate that the alerts that you've set up are still working, right?
Starting point is 00:33:01 By bringing the system artificially into a chaotic situation. And basically these alerts, whether it's alerts, thresholds, SLOs, remediation books, they should all be considered part of your code. Because essentially most of them, they are code, especially now as we're doing everything as code anyway, also your monitoring configuration, you're alerting. So why don't we get better in also including this in our testing?
Starting point is 00:33:32 Is there a way how we should also measure kind of test coverage of our alerts if our rules are still working, if this even makes sense, right? Because if you configure the alerts over years and years, maybe 50% of them don't make sense anymore because these measures, these metrics that you defined them on, they are no longer either producing data
Starting point is 00:33:53 or completely different value ranges. Absolutely. I was just going to add to that, though. You talk about configuring all these different alerts and everything, and I guess keeping in the spirit of SRE, shouldn't we instead be focusing on the end goal instead of alerting on subsystems? So if your alerts, let's call them, are based on availability, based on response time,
Starting point is 00:34:29 based on responsiveness and how much they're failing, basically based on whoever the consumer is, whether it's another system or an actual user, a meatbag, as you say, if that's what your alerts are based on, those are going to be a lot more consistent and valid over time, unless you want to change, instead of three seconds, we're down to two seconds now. But as opposed to monitoring CPU or memory consumption or all these other things which are going to be flipping around, if you have an end-user, end-cons end consumer focused set of alerts, if you have observability set up in the entire back end then,
Starting point is 00:35:11 so that when one of these do trigger, you can go through and look at everything else and find out what's going on, wouldn't that be a safer, more bulletproof, if you will, way to set that up? But I'm asking you specifically, Laura, because you come from this SRE background, right? Because this is not in my wheelhouse.
Starting point is 00:35:27 This is just my understanding of this. I'm really interested on your take on all those old system alerts versus more of an SRE approach, at least the SRE way I understand it, which could be wrong. You are not even slightly wrong about that. So what this is,
Starting point is 00:35:40 is the difference between symptom-based or SLO-based alerting, if you're using SLOs, same idea, versus your traditional sort of alerts that might be based on causes for system problems, like, as you say, high CPU, that sort of thing. There's a link. If you're putting links on your podcast, there's one I'll send you. Rob Owischuk is a person who's written a really good document on this. It's very long. There's one I'll send you. Rob Owischuk is a person
Starting point is 00:36:05 who's written a really good document on this. It's very long. It's very detailed. What I would say is symptom-based alerts are good, but you have to take a broad view of what your users are expecting from your systems. So in this case, that particular story I was talking about,
Starting point is 00:36:22 having a fresh, recently published, up-to-date data version published in this particular place was actually part of the system's contract with its users. So it's a valid thing to monitor about. It's not the same as monitoring for CPU and so forth. Other things that people typically want to monitor on, things like the golden signals, you know, are you serving requests without an excessive percentage of errors,
Starting point is 00:36:53 without excessive latency, that kind of thing. So those are golden signal type alerts, as you say, as opposed to things like CPU and so forth. The reason that we don't want to typically monitor on CPU memory, that kind of thing, is because, as you say, thresholds change over time. Those alerts can get very stale. You can end up with either alerts that don't work, that don't reflect user experience, or you can end up with alerts that are noisy, that are constantly paging your on-calls. And both of those are bad. And the worst case is you can have both of those at once, which is really, really bad. So most people, I mean, there's definitely a strong trend
Starting point is 00:37:34 towards the symptom-based alerting model. The trick here is you have to be clear about what your system is providing. It's not always just ORPC errors and latency. Freshness is something that people do frequently overlook, but it can be very important for a lot of systems. Correctness more generally, if that makes sense. Yeah. And I think, Brian, in some of the last episodes, we talked exactly about this, like using SLOs on kind of the system boundaries, like the system boundaries, meaning as close as possible to the end consumer, or if you have critical core systems, like an authentication system,
Starting point is 00:38:19 a storage system, and then on these system boundaries, defining your SLOs and then alerting on those, but not on each individual CPU metric. But when you explained this, Brian, I remember the conversation we had. But then I thought, still, when we're building resilient systems, we try to become resilient to a problem of a downstream system we're depending on the problem is though when we never re-evaluate that how do we measure the resiliency of that downstream system right if we never evaluate re-evaluate the thresholds we set on when we flip over the traffic when we scale up when we scale down when we maybe you know fall back to a cache versus going all the way through the database.
Starting point is 00:39:10 I think if we then never revalue it, then we could actually end up in a situation where we should have been alerted much earlier, where the kind of self-healing resiliency should have kicked in automatically, but it didn't because we never revisited all of these settings, these self-healing things. And therefore, all of a sudden, boom, everything fails because we just never revalued them. That becomes tricky too, right? Because where do you store all that information? How do you track it all? How do you even know what it's for?
Starting point is 00:39:35 Like the complexity of code, the complexity of the interdependent systems. If you think about when you went from monoliths to microservices, trying to keep track of what service is talking to what service to what service to what service, same thing goes into these thresholds and everything else. Like, keeping track of that stuff is going to be very, very difficult. Sorry, Laura, you were going to contribute something there too. I didn't mean to.
Starting point is 00:39:56 I've got a couple of comments. So first off, you're right. There is a lot of information to track there. There's a whole sort of emerging set of tools for doing that. Spotify has a service catalog product, which I have entirely forgotten the name of. Backstage. Yes, Backstage. Thank you very much. And there's a whole bunch of, I mean, nearly every sort of observability tool that you see is starting to bring in ideas now of showing your graph of service calls, all this kind
Starting point is 00:40:24 of stuff. So the tooling is starting to creep in to make this easier for you. But regarding SLOs and boundaries, there's an interesting thing there. When you divide up an organization really rigidly and you say, okay, you own backend service X and you own a different backend service Y and X calls Y. There's a tendency for service X, depending on the organization and its culture and everything like that. But I've seen situations where the people running service X say, well, service Y has a five nines SLO. So we expect it to give five nines and to never be down effectively. And what happens then is they stop thinking about, well, what can we do to be more reliable if service Y goes down? Is there a way that we can use cache data or is there a way that we can do graceful
Starting point is 00:41:19 degradation and still do useful work? And sometimes when they say, okay, well, it has a five lines SLO, we're good, they skip over that part. So when you think about the system overall being resilient or robust, you can't think about it just in terms of SLO math, like X calls Y, and, you know, these have high SLOs, so the overall system will have a high SLO. You've got to think about how can each service be more forgiving to SLO transgressions from other services because they will happen. And, you know, having thought about that and having thought about how we can be more resilient in the face of failure is the way to get to the most reliable possible overall system, not just the SLO.
Starting point is 00:42:05 So I think that's a pitfall that a lot of organizations do fall into. to get to the most reliable possible overall system, not just the SLOs. So I think that's a pitfall that a lot of organizations do fall into and you can do better to... SLOs are a great tool, but they shouldn't be this rigid organizational boundary where it's just this expectation and no sort of further thinking about resilience. It kind of reminded me of an analogy, right? We are taking electricity for granted.
Starting point is 00:42:29 Why would I even start buying candles? Other than for romantic reasons, obviously. But I'm just saying, right? But now we all of a sudden see that, at least here in Europe, we don't know. We need to get prepared for a situation where maybe electricity is not there for a couple of hours or even longer. But I guess we never thought about it because it was just always there. It can turn on the light
Starting point is 00:42:56 by clicking a switch. I think, Laura, that the point you made, it sounds slightly familiar. I don't know if it came up in the past or not, but if it did, I think we completely forgot about it. So I think it's a fantastic reminder. And what you talked about was, yes, you have your SLO. But the level of forgiveness that the different services can have for other downstream services, right? If you have your first service below the browser or below the web server, right, how forgiving can that be if some downstream service is starting possibly causing an impact to
Starting point is 00:43:37 the SLO? Is there some sort of AB, some canary, some sort of switch that can go into effect that can then make up for it. Or if you know something else can be a problem, what can you do on these upstreet ones to be more forgiving of it? And I think the danger people might fall into with SLOs is they convert over, they get a bunch of SLOs set up, and they pat themselves on the back. But the SLOs are just a measure. Still not doing anything for your system.
Starting point is 00:44:09 You're monitoring it better. You're looking at it from a better context, but you haven't made any improvements. Yeah, exactly. And that's the part that's important. You can have very, very shallow implementations of SLO where you just say, okay, well, here's some numbers, here's some metrics, job's good. But you haven't actually done that work to think about, well, what actually do my users expect from this system? So you can miss things like my client version data file from the previous example. You can miss things like freshness requirements. You can miss a whole
Starting point is 00:44:42 host of things. And as I said, you lose that opportunity to be more forgiving. Because sometimes services do have options rather than just failing because one of their downstream services is failing. You may have the option to serve something static, serve something stale, omit a piece of data from your results. In the worst case, you should at least fail fast and let your user know that you have failed. Whereas it's not uncommon to see services that will try and call that failing backend service, wait indefinitely, and then suddenly, you know,
Starting point is 00:45:16 your service has used up some resource, be it memory, be it connections, be it threads, be it something else. And, you know, now your whole system is down and needs a reboot. That's the worst of all worlds. And you can put in all the SLOs in the world, but if you don't do that work to avoid those scenarios, you're not making your system more robust.
Starting point is 00:45:38 Hey, Laura, I hate to say we're getting close to the end of our recording, and it's just fascinating to listen to you and hear from your experience. I took a lot of notes that I will convert into a little description of the blog. You also mentioned that you have a couple of links maybe that you want to share. I tried to do some Googling on the site with some of the names you gave me, but it would be great if you could send them over so everybody that is listening, if you want to follow up on some of the blogs and some of the names you gave me, but it would be great if you could send them over. So everybody that is listening, if you want to follow up on some of the blogs and some of the presentations that Laura has done,
Starting point is 00:46:10 find them in the summary of the podcast. But I don't want to just cut you short and say that's it. I want to still give you, especially, oh, thank you for that. Yeah, perfect. So we'll add that link. She was just sending something over to me. Laura, is there
Starting point is 00:46:27 a final thought, anything we want to make sure that our listeners take away from this podcast? I think the big thing that I think people should take away is that in a healthy organization, an organization that's learning,
Starting point is 00:46:44 that's growing, it should be okay to take the time to properly analyze incidents, particularly unusual ones, ones that are not well understood, ones that were complicated, that involved a lot of different systems factors coming together. A healthy organization should be able to take that time to do that analysis and really do that learning. If your organization is pressuring you to, you know, come up with a detailed incident review in three days, you know, that's not healthy. So if you're an engineering leader, you know, think about giving people that time and space and think about finding ways to reward
Starting point is 00:47:22 people who are, you know, doing that work to keep your organization healthy and learning. And, you know, if you're an engineer, you know, be curious and, you know, try and learn. That's all we can do every day. Awesome. And the link is called My Philosophy on Learning from Rob Evershoek. I hope I kind of pronounced his name correctly. Anyway, the link is a link to a Google Doc.
Starting point is 00:47:49 It's called My Philosophy on Alerting. We'll share this as well. Brian, anything from you? No, I would just say, Laura, I know, well, first of all, thank you. This has been tremendous. As always, Andy and I love this because we get to learn so much as well.
Starting point is 00:48:04 But I am curious when your new company decides to be more public about what it is they're doing. I'd love to maybe convene again because obviously you all see a problem out there and you're all coming up with a possible solution. And based on this conversation, I think that would be really fascinating to hear more details about what the problem is and what you guys are looking to do to solve that in the industry. So definitely keep in touch with us because that might be another great show to follow up on. Not so much for product placement really,
Starting point is 00:48:35 but really discussing about that what is it you're looking to solve? Like what have you all identified? Because I'm sure everybody can benefit from it, hopefully from your company, but others as well. But really, that's it. I think this has been fantastic and thank you so much for taking the time. Thank you for having me. I'm glad that all of our electricity stayed on for the hour. Yeah, that's right. All right. Then, then Brian I want to say
Starting point is 00:49:05 thanks to you this was show number one in 2023 after five years oh yeah that's right but the first
Starting point is 00:49:14 recording in 2023 that's right because we recorded the first show in 2022 very much looking forward to many more episodes
Starting point is 00:49:21 with you and getting close to 200 closer getting closer to 200. Closer, I think. Getting closer to 200. I think in the upper 170s. Anyway, it's been awesome. Thank you so much.
Starting point is 00:49:30 Thanks for everyone for listening. And thank you again, Laura, for coming on. Bye-bye, everyone. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.