PurePerformance - Run Towards the Fire: Why we should love incidents with Lisa Karlin Curtis

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and with me I have a smiling and fire wielding Andy Grabner today. How are you doing Andy? I'm good. I just don't want to play with the fire anymore because it looks funny, it makes funny noises, it also makes fire, but it also smells really strange when I light it. So I better put it away.

Starting point is 00:00:48 That's always a good, don't play with fire unless you want to get burned. When this case I want to run away from the fire and not towards the fire. Sometimes you want to run towards the fire, right? Sometimes it seems at least there's people out there and actually one of them happens to be out there. Oh yeah. That's's s'mores there. Oh yeah, that's a good point. Yeah. Not sure we'll talk about s'mores today, but we'll definitely talk about running towards the fire, embracing incidents because this was a talk I happened to see at a conference in Prague. I was just in Prague recently at the Engineering Leaders Conference. And I was on stage and I think a little bit before me, Lisa, if I'm not mistaken, you were on stage

Starting point is 00:01:32 and talked about running towards the fire, embracing incidents. And now without further ado, Lisa, thank you so much for being on the show. Can you please do me the favor and just introduce yourself to our listeners? Like who you are, what's your background? please do me a favor and just introduce yourself to our listeners, like who you are, what's your background. Yeah, of course. Thanks so much for having me. So I'm Lisa Carlin-Curtis. I was one of the founding engineers at Incident.io. So we build incident management software,

Starting point is 00:01:56 so that is software that helps you respond. It also includes status pages and on-call, so we'll have the joyful job of waking you up at 2 o'clock in the morning when you don't want to be woken up, which I'm sure will be delightful. And yeah, so I've been here for about four years and, you know, thinking about incidents for basically all of that time, talking to a lot of customers about incidents and kind of, I don't know, I've

Starting point is 00:02:16 always been very interested in things like safety theory and understanding how you build resilient teams, how do you get to a place where incident response is something that you build resilient teams, how do you get to a place where incident response is something that you feel comfortable with, that you feel that you get value from as a company. It's been super fun to be able to work on that problem for the last few years as we grow the team. It was great to meet Andy in Prague and very happy to have a chance to chat about this stuff. Yeah, big shout out to Marianne and his team for building together this conference. So ELC engineering leadership conference.

Starting point is 00:02:49 I was really for me, it was the first time hearing about their community was fantastic having 300 engineering leaders. One day in the room and I really liked also, they had a main track right where we both spoke and I believe did you also do a workshop? Yeah, I think I think you did as well right where you get a chance to chat to like 25 people in a room. It was just a really nice format for getting to have some more in-depth conversations and do a Q&A that's a little bit less awkward than your sort of Q&A in front of 300 people where

Starting point is 00:03:16 nothing really comes. Yeah, I also really like their mentoring sessions. So because it was an engineering leadership conference, you could sign up as a mentee and also as a mentor to have, I'm not sure how long these sessions were, maybe 15, 30 minutes, I don't know, but I thought it was really, it was a great conference. Sounds like a great conference, yeah. Yeah. I will actually be back next week in Prague

Starting point is 00:03:42 as a follow-up on that conversation that some of the conversations I had. But now, back to your talk. Lisa, what struck me in your talk, you were embracing incidents. Kind of like, hey,, you know, make them visible, make people participate in incidents because we can learn a lot. I don't want to tell it basically in your words, but I would like you to give me or give us a little bit of an overview of what are some of the

Starting point is 00:04:18 maybe not so common approaches towards incident management that you were promoting. lots of common approaches towards incident management that you were promoting? Yeah, sure. So I think that for most people, they think of incidents as being like a cost or a tax. So it's like, we have some code, it's in production, sometimes it goes wrong, that's sad. So what we're going to try and do is we're going to try and reduce that tax as much as possible. So that means we're going to have as few people as possible. Probably we're going to have the same people because they know what they're doing and they can deal with it quickly. And then everybody can kind of go back to doing what they're supposed to be doing.

Starting point is 00:04:50 And I think there are a few different problems with that approach. So the first and obvious one is that you have a churn risk. So if you're a small company, probably your first three engineers who look a bit like me, handle all the incidents. Once you get to like, I've seen 100 person companies where fundamentally, all the incidents are managed by maybe three to five people. If one of those people is on holiday, or God forbid, as we do, you know, those three people will go to San Sebastian together on a holiday for a weekend, and

Starting point is 00:05:17 something goes wrong in that time, then you're you're in real trouble. And so you also want to make sure that like you are growing the number of people who are able to do that stuff And the only way that's going to happen is if people practice So no one has ever got better incident response by like sitting at their laptop and staring at it really really intently and deciding they're going to get better the way you get Better is by doing it and that's the same for everybody And so part of it is about this kind of organizational resilience, making sure that you can

Starting point is 00:05:45 handle like new things. The other problem is that as your organization scales, you'll get more incidents, and eventually those people who can handle incidents, that becomes their entire job. And they stop being able to add value to your business in other ways. And because incident response is pretty highly correlated with lots of other things, the likelihood is that they're also very, very good elsewhere. So if those people are spending their entire lives in incidents, incident response is pretty highly correlated

Starting point is 00:06:25 other side of this is about the way that people write code and build software. So when you're building something and designing it, you make a load of choices kind of explicitly and implicitly about what's your data model going to be? How are you going to implement it? What's your architecture, if you like? How are you going to add observability to it? What can you see? What can't you see? What are your expected failure modes? And all of that, if you do that well, you then reduce the burden of incidents. If you do that badly, you create many more incidents. And for me, the best way of getting really good at that

Starting point is 00:06:54 is also by being in incidents, because that's what gives you the scars and the burned fingers, so that you can sort of look at a design or look at an idea and say, oh, I actually think that's going to go wrong in this way, because I saw that that one time, and I think we can do risk it by doing x, y, z. And actually, I think it's one of the key differentiators between more junior and more senior engineers is like, is that first design kind of about right, it will

Starting point is 00:07:17 never be perfect, there will always be things that go wrong. But there's a big difference between the sort of you know, v zero that somebody without any experience would come up with versus somebody who's kind of been there and done it and knows the common ways in which these kind of systems fail. And I think that that second piece, I guess, is what maybe we talked about a little bit more in the talk is like you can learn all of those skills by being an incident.

Starting point is 00:07:39 But in order to do that, you probably have to bring people into incidents who on day one are not going to be adding that much value to the response and So you'll bring people along for the ride You're showing you're working you're doing that stuff in public and by doing so All the people around you are going to be leveling up and all of a sudden rather than these five people being sat You know being the martyrs and desperately trying to keep everything running Instead they start to be able to lean on people because they've seen it two or three times and they can actually say, oh, actually, could you roll out this fix? I'm comfortable. I can now kind of get back to my dinner or get back to my planning meeting. And I think it makes a really

Starting point is 00:08:14 big mindset shift if you start thinking about something that you want to bring the team into rather than something where you're constantly trying to protect the team. So many questions. I keep notes. Even before questions, I would just say I would love for you to give that speech in front of a bunch of people and see if there's any one person who would say, no, I disagree. Because I think everything you said in that was made was so extremely sensible. It's like you can't even argue with what you said. But I'm sure

Starting point is 00:08:46 Andy will- I think that was spicy. Oh, no. I mean, it wasn't even spicy. It's just like it's so common sense and you see it in so many other industries that it's like, of course, we probably don't think about it. But when you lay it out like that, it would be amazing, amazing in a depressing way to find somebody who would resist what you're saying there. I'm sure Andy will come up with some questions that will. No, I know Andy. My first question is what happened in San Sebastian?

Starting point is 00:09:15 We just ate loads of food. It was great. The food there is so good for anyone, particularly if you're European based, I would so recommend spending a weekend there, particularly if you eat meat. The meat is just unbelievable. The food there is so good for anyone, great incident management software and practices instant bio. That meant that we could all go to San Sebastian and it wasn't a terrifying idea that we were all on a plane at the same time. So I got a couple of, I wrote a couple of notes. You said it's a big problem if you end up having people that their sole role and purpose in the company becomes incident management or incident fighting incidents. soul, role and purpose in the company becomes incident management or incident fighting incidents. I remember you said, you said something like staring at incidents. It reminded me of the movie, guys, what was he called? Man that stare at goats. It's kind of like people that stare at incidents. And obviously

Starting point is 00:10:20 that should not, they will just not go away just if you stare at them. But for me, the much bigger thing is if there's only a small number of people that are dealing with incidents and that's the anything, everything that they do, they are not, there's two things, right? They're not adding value to the organization and B, they become your biggest risk in your organization because if they are gone, then nobody can deal with these incidents anymore. biggest risk in your organization. you would be my role as a host here on the podcast. I also shared the deck with him. Lisa, if you're good with it, can we share the deck with the people here?

Starting point is 00:11:15 Folks, if you're interested in the talk, the slides are in the podcast notes. But as you talk about embracing incidents. Who are the right people to then really jump? How to prioritize and focus on the ones that are really important? Do you have any recommendations on how to evaluate that? Yeah, I think I'm going to steal an analogy that Brian was chatting about just before we went on air, which was talking about like a medical context where you have like surgeries. And so obviously, like if you have a hospital, and you have, you know, three

Starting point is 00:12:10 surgeons that can do a particular surgery, and then you have 40 other surgeons, but they can't do the really hard stuff, then you've obviously got a problem because at some point, those three surgeons are probably going to retire or you know, one of them might be on holiday. And so what you do is you plan in making sure that the people that you've got that are kind of next on the roster to learn that particular specialized surgery,

Starting point is 00:12:30 you would say, right, well, we want to make sure that you do like six watching, then maybe you do six like participating with supervision, and then you're sort of accredited to go and do that surgery in future. So you lay out like a plan. The problem with incidents is that unlike surgeries they are like unplanned and as you say you have these kind of big moments where

Starting point is 00:12:48 you have lots going on and then you might have kind of periods of time where you have very little. But that is also true of surgery right? Like in in surgery you also you have a roster of people and some surgeries are planned and some are not. And so I think it kind of it's the same approach where first of all, you have to look at your incidents and go, you know, what is the most important thing and what can wait and that is like a triage process. And ideally, that probably means you need to like categorize your incidents, you know, if you've got 100, one person cannot load up into their

Starting point is 00:13:17 kind of RAM 100 incidents. So you have to come up with a way of like segmenting them by product, category, severity, whatever it is. Once you've got those segments, then you want you know, people to take each of the segments, start to prioritize them by what's the impact, whether that's impact on your customers impact internally. And then you can start to look at your roster of like, who are we onboarding? Like, who are the people that we're currently investing in to try and teach them these skills? And maybe

Starting point is 00:13:44 that's a buddy system. So maybe you just have like, cool, you know, Brian, you're on incidents. So every time the lease is in an incident, you jump in as well. And you sidecar Lisa for a while. Probably wouldn't recommend that long term because you'll probably learn more from working with a few different people than learning one. So then you start to say, right, well maybe I've got like a pool of people and I pull someone from that pool

Starting point is 00:14:07 or I offer it into Slack and I'm like, hey, I could do with an extra pair of hands here as anybody who's onboarding coming in. And I think you can treat it a bit similarly to that or maybe the kind of interview training or interview onboarding, which is something that kind of everybody has a process for. You have a pool of people that are kind of like ramping

Starting point is 00:14:23 and you try and actively pull them in. There are going to be moments where the world is on fire and training is just not your priority. And that's fine. As long as you're not in that mode for months and months on end. Like sometimes there will be a really bad day where four or five different things have gone wrong

Starting point is 00:14:39 and you're just like, no, no, this is not the day that I'm going to take any risks. This is not the day that I'm going to be thinking about someone else's growth. I am'm going to take any risks. imagine that you have a checklist in the same way of what are the things that we want you to have experienced before we'd be comfortable with you getting onto that on CallRota. Now, that might also include things like drills and game days. So, there will be some scenarios that you can't just rely on happening organically because hopefully they don't happen very often.

Starting point is 00:15:34 So, we don't have very many data leaks, thank goodness, touch wood, right? But I would want every engineer who goes on Call to know how to handle a data leak. And that's maybe something that we would do a drill or a tabletop exercise to give them that experience because they're not going to get it in real life. The problem with a drill and a tabletop exercise is that you've got no adrenaline because there's no risk. And so you also need people to learn how do you deal with adrenaline? How do you deal with being paged at two o'clock in the morning?

Starting point is 00:16:00 That's a very unnatural thing to ask a human to do, to be able to switch on that quickly from deep sleep. And so those are all the kinds of things that to do to be able to switch on that quickly from deep sleep. Those are all the kinds of things that you want to be practicing with people. Not to go back to promoting a TV show, but as you explain more the analogy to the medical situation in the show, The Pit, which is on max HPO, whatever, The big climax is there's a mass casualty event and everyone's going to the emergency room. So it's all people on staff and you see them doing exactly what you're saying where the top ER docs are pairing with some of the lower people who would need to see some stuff and explaining what they're doing. So there's all different types of injuries coming in, like critical to medium to very

Starting point is 00:16:49 light. And like, let's say the medium ones, which would be maybe the medium incidents, you have the people who have experience already, but aren't top-notch, but like, I can take you because you have enough experience and you can independently handle these middle-level ones because I've trained you enough, I have faith in you, and I'm going to take the people who have very little experience and let them watch me do this crazy stuff because they haven't seen anything yet. I need to get them to the point where I can let them go on the lower things. And it's that same cycle you're talking about

Starting point is 00:17:17 where, yeah, you're always wanting to get people involved because, you know, not just the fact of burnout, but you always have to have that bench, right? You know, and like if you've been working on an incident for eight hours, right, or overnight, in the morning, you better have someone who can take over for you. You know, I remember we used to have, you know, back when I worked on the other side, you know, you'd have releases and you know, everyone's familiar with the release from hell that goes wrong and everyone's working all night and it's now 11 o'clock in the morning the next day and you're still working and it's like who's going to come in and relieve you? Right? And if you don't have, if you're not doing exactly what you're talking about,

Starting point is 00:17:56 you're going to have no one. So I don't know. Yeah. I wholeheartedly, all I can do is say I'm wholeheartedly agreeing so hardcore with everything you're saying. It's just... So I think what, when it gets kind of interesting, and also to be blunt, like the place where the surgery analogy falls over is that in a surgery context, your job is surgery. So there's not another thing that you're supposed to be doing. Whereas the problem in when you're supposed to be doing. Incidents are primarily interrupts, they're not planned. They're often not budgeted for.

Starting point is 00:18:35 Which is kind of fascinating to me because no one has ever had a quarter where nothing went wrong unless they didn't have a product that was out in the wild. I've never seen a roadmap that had 10% incidents, but that's probably about that, whether you have exactly what that number is. But usually you can probably extrapolate out from past data if you collect that data what it's gonna look like for you. And the problem here is that these junior or mid-level,

Starting point is 00:19:01 maybe even senior engineers who just haven't been at your company that long, those are also people who are shipping products, and they're probably shipping to deadlines. And often the way in which that work is prioritized, which is through like the product side, if you like, rather than the sort of resilience side, that becomes a like a bit of a tussle of like, oh, can I pull this person off what they're supposed to be doing in order to help this? And it's like, oh, well, it's not really a good time, because we've got a deadline on Friday, so maybe next week. And so all of this stuff that we've been talking about that sounds so simple and straightforward,

Starting point is 00:19:32 and like, it's really easy to buy into sort of when you hit reality with it, it can become really, really challenging, because you're suddenly saying, I want this group of people to be interruptible. And the reason I'm interrupting them is to teach them. And so it I'm interrupting them is to teach them. And so it's not going to deliver direct business value today. It's going to deliver us business value in a month, in a year. And it's the classic kind of problem of like, you know, short term versus long term.

Starting point is 00:20:01 And that training can often, it can always wait because it's never going to be super urgent that Andy gets one of his major incidents in today. It just is not. Whereas if you are, particularly if you're working with customers, if you've got frequent deadlines, or if you're a company where you care a lot about momentum, it's really difficult to pay that tax. The way that we do it at Incident is that we ring-fence people. So we have a thing called Product Responder, where we basically pull people off projects proactively. And so they're on smaller tickets that we know don't have deadlines,

Starting point is 00:20:26 that we know can be flexible. And that means they can be interruptible in a way that still feels safe and doesn't derail projects. But if you don't have that in place, because it doesn't make sense for your team, because you don't have enough of that kind of work, I think it's really difficult for this

Starting point is 00:20:40 not to become a repeated pull and push between probably realistically your tech lead or your senior engineers and your product managers. And that's what I've seen go wrong in a lot of places. Taking so many notes. Well, thank you, Lisa. I got another question in terms of, there is the term, I think it's called a near miss. It's basically a problem that in the end didn't have any negative impact. And kind of this is the question that I have to you, when does an incident become an incident? And when does an incident become something that you also need to report all the way up?

Starting point is 00:21:24 Because a lot of organizations are driven by also reporting, right? How many critical incidents did you have in a day, in a week, in a month? Do you have any thoughts on that topic? So first of all, what defines a real incident? When does an incident become an incident? And also are there categories of incidents where you say, these are the ones that you want to report, these are the ones that you want or should report? These are the ones that you, that is, it's not, there's no need to involve your, your board, your C level.

Starting point is 00:21:54 Yeah, I mean, so you kind of mentioned this at the beginning, but, you know, my philosophy is that basically most people don't declare nearly enough incidents. And what I mean by that is there are lots of things that happen in organizations that require an urgent response, maybe even also a coordinated response between multiple people, but they just happen in a slack thread. And that kind of works, but the problem with that is that it means that that is not visible to anybody. It's not visible in terms of reporting. It's not visible in terms of helping other people learn from it.

Starting point is 00:22:24 And there is no opportunity for anybody to like share what they learned It's not visible in terms of reporting. we would just declare an incident and treat that, even if it's one customer hitting one edge case. And maybe we end up declining that incident because it turned out it was user error. But more often than not, our users are pretty smart and it's probably something we've done, so we need to deploy a fix and we'd like to kind of continue that incident process.

Starting point is 00:22:57 But equally, we might create an incident for something that hasn't got something to do directly with a customer. So maybe our CI pipeline is down. Whenever our CI pipeline goes red, we declare an incident for something that hasn't got something to do directly with a customer. So maybe our CI pipeline is down. Whenever our CI pipeline goes red, we declare an incident. And that means that the whole of engineering can easily see what's going on. They know whether they should merge or not merge. Maybe they can like jump in to help out

Starting point is 00:23:16 because it's visible and it's announced to the team. And then you get the same thing in other areas. So when our support queue gets too long, we declare an incident. And that means that people can kind of mob on support tickets. Or if we have a really bad interaction with a customer and we're kind of worried about that relationship,

Starting point is 00:23:32 we would declare like a customer success incident. And the thing that all of these things have in common is that they require an urgent and coordinated response. And if you're going to put it down for a day or two days and it's just going to go on to like your to-do list in your, you know, Slack DM or whatever, it's probably not an incident. But if it's going to be your priority and you would potentially like consider moving a meeting or skipping a meeting for it, probably it's an incident. And if you've got incident tooling and you know, I don't want to like shill the software

Starting point is 00:24:02 too much, but like there are lots of, you know, other providers exist. There's lots of people making really great incident management software. If you have tooling that helps you do this, that will spin you up a channel, which is your dedicated space, that will announce it to the people who said they're interested in incidents of that category. You then start to be able to just build a repeatable machine where you can deal with these low level issues in public in a way that people can learn from. And it also means that everyone's just super familiar with your processes. So when we're talking about like an incident process, maybe you have a list of severities,

Starting point is 00:24:32 but like what does P1 mean? What does critical mean? Like nobody knows. But if you're using those definitions a lot, if everybody knows that you have an incident lead, and this is what an incident lead does, and maybe you have like a particular way in which you bring other people in. So you have escalation parts where you can escalate to like legal if you have a compliance issue. If you're practicing all of those all the time for these smaller issues, when the really big ones come along, you're not trying to learn the tooling and find the piece of paper in a drawer that

Starting point is 00:24:59 tells you what you're supposed to do in incidents, you're just like, cool, this is second nature to me, this is just a bit more serious. And now I've got to deal with this very complicated problem. But at least I'm not also dealing with like a whole set of toolings or processes that I've never seen before. And so when you start doing that, you have more incidents. And then what happens is that you get questions from above being like, wait, has everything got really bad? Like is our got really bad? potentially like churn if we're in kind of like a b2b environment particularly

Starting point is 00:25:49 and they might cost us in dollars if we're b2c if we're like e-commerce is literally going to be dollars on the table but there are other costs as well which is around interruptions and like how much time did we spend on incidents that we could have spent building or supporting our customers and you need to think of those two sets of costs as very separate and report on them very separately but for that second group it's really useful those two sets of costs as very separate and report on them very separately. But for that second group, it's really useful to track lots of things as incidents, because it means you get an understanding of how often your team is getting interrupted and what's the range of people who are being impacted by that. But on the impact side, maybe lots of those incidents had no external impact at all.

Starting point is 00:26:21 You know, if our CI pipeline is down, our customers don't care. But we care. And so it's important that you then report them in a way that like it is clear what are the things that are impacting customers and in what way are they impacting customers? And which are the things that we're tracking internally, but from the point of view of our like customer churn risk, trust, resilience, we probably don't need to be worrying about. I like your examples that you brought about, you know, when to declare an incident and you brought resilience, we probably don't need to be worrying about. I like your examples that you brought about, when to declare an incident, and you brought customer issues,

Starting point is 00:26:53 CI pipeline is down, support queue is too long. For me, this sounds like you could define even SLOs on it. You can say, my SLO for my CI pipeline is that the pipelines have a 99% success rate and they need to finish within five minutes because otherwise we cannot deploy fix fast enough. If that SLO is breached for whatever reason, the tooling is down, somebody messed up the pipeline, then you're breaching your SLO and therefore it could trigger an incident. Now if this then becomes really custom impacting, as you said, is a different deal. then you're breaching your SLO and therefore it could trigger an incident. Now if this then becomes really customer impacting, as you said, is a different deal.

Starting point is 00:27:30 This is something that you need to find out in the triaging process. Or also like the support queue is too long. That's a great SLO, right? Maybe not the support queue, but how long does a customer need to wait until they get a first response. And if that keeps growing, obviously then the Q also goes up and things like that. Then you could declare an incident. I got another question. What I see, at least we are more and more promoting, it seems. And I don't want to take this literally, but testing production. more promoting, it seems. And I don't want to take this literally, but testing production.

Starting point is 00:28:10 Meaning, we push out into production, we hide things behind a feature flag, which is great, because feature flags are perfect for experimentation. So I believe if we do that, and more and more people do this, we will eventually have more incidents by default, because we can assume that these things are not maybe as well tested because it's still an experimentation phase. So do you see this already also as a trend that we get more incidents because of kind of the new modern way of releasing, not only deploying, but releasing software? And if you see this, is there kind of like also a quote unquote new normal? What is a new normal versus the old normal? Do we were we supposed to see maybe one incident per quarter and now in the cloud native companies, they see 10 incidents per week? So I think that I don't think that's a shift that we have seen mainly because I think that we've turned up kind of once that shift has mostly happened. So most of the companies that we are working with are companies that are releasing into production multiple times a day. And so that means they're already in that kind of experiment test in production type world where they tend to have quite a lot of incidents.

Starting point is 00:29:21 That's unsurprising. You're not going to buy incident management software if you do one one release a quarter and have one incident a quarter. a lot of incidents. can be much harder to debug because it's often not as structured as a human would do. And I think that AI allows you to build a lot more stuff, and that's very exciting. I use AI in my day-to-day when I'm coding. However, it also allows you to build a lot more stuff. And if you build a lot more stuff, there will be more bugs and more things will go wrong. If you think of bugs as being really a factor of the amount of stuff that you ship, which is not quite fair, but is a useful enough analogy, then what you end up with is start to say,

Starting point is 00:30:14 well, okay, we're going to take a risk here where we're comfortable as a business to ship a lot more and accept that we're going to have to then do a little bit more cleanup on the other side. So in a traditional environment, you'd have maybe a human write some code and then a second human reviews that code. If we're moving to a world in which AI is writing code and a first human reviews that code, all of a sudden we've gone from four eyes to two. And that is going to be, I think, probably, and we're seeing it internally,

Starting point is 00:30:42 that there is a bit of a reduction in quality that's going to happen there. I think that is largely the right call for lots of people. probably, and we're seeing it internally, but that we'll continue to see a trend where there are more of these smaller bugs that maybe only impact, as you say, a handful of customers that have this thing feature flagged on, but that handful of customers are still going to be upset and you still care about it, you still want to fix it. So yeah, I think that we will see that trend continue. But I also think that even in that world where someone goes, oh, I only have one incident a month or one incident a quarter, if I'm being honest, my response is, I don't think that's true. I think there are lots of things that interrupt your team that you are not treating as incidents right now. That probably you would benefit from treating as incidents.

Starting point is 00:31:37 Which basically comes back to what you said earlier. You need to make things visible. Maybe currently these things are hidden, whether consciously or unconsciously, that certain problems are just, I guess, pushed under the rug, and were for whatever reason, right, not tracked. Well, I mean, 100%. And obviously, you know, we're humans, we feel shame very acutely, we want to be respected and liked by our peers. And what that means

Starting point is 00:32:01 is if I screw something up, and I think I can fix it in like 10 minutes, I'm not going to tell anybody by default, like I have to train my default is still is going to be I'm just going to sort it and no one needs to know. And I need to be sort of trained almost and like buy into the idea that I should put that in public even though I'm going to look a bit silly. And if I as like a leader in my organization do that, then that's what will empower other people to be able to do the same But if you're not in a company which has that culture and you're the only person who's doing everything in public then what it looks like is that you're really bad at your job and your peers are much better at their jobs and It's is very difficult to change the tide on a culture like that where everything does happen in private and DMS and oh

Starting point is 00:32:42 Can you just help me? I've just hit the wrong button and I've made this configuration error. Like that is happening all the time at most companies. It's quite hard to push that out into the open. It takes a really concerted effort from your management, from your leadership to make that a thing that is rewarded by your company rather than just blaming people or mocking people for the errors that they've made.

Starting point is 00:33:06 I had one additional thought on your comment earlier on the, with the advent, with the rise of AI being used to generate code, we will need to expect more problems in the end. But I like the, like from four eyes to two eyes, it makes a lot of sense. And we need to get better in debugging that type of code that was not initially I like the, from four eyes to two eyes, it makes a lot of sense. And we need to get better in debugging that type of code that was not initially created by us. And I know this is not a product pitch at all, in production where you can set non-breaking breakpoints, where you can basically say,

Starting point is 00:33:45 I wanna like, if I would debug locally, I can debug in a production environment and I get all of my variables, my stack traces and everything. And now that he mentioned this, I think this would be even more useful because even more code gets generated and committed to source code repositories

Starting point is 00:34:04 that have not been written by the person that debugs it. And I think this is a pretty fascinating thought. I mean, imagine a world where whenever you're trying to debug an issue, you can never tap someone on the shoulder who wrote the code. You go to the git blame, and the git blame is an LLM that has no memory.

Starting point is 00:34:23 So yeah, you can ask the LLM for help, but the LLM is not the same as a person who wrote the code. Whereas often, that's the first thing you do, right? You look at the git blame, and you walk over to somebody's desk, and you're like, can you just walk me through, because I think this has gone wrong in this way, but this kind of looks intentional in the code,

Starting point is 00:34:40 and all of that process is gone. And so we need to replace it with something. And I think that something probably is unfortunately more AI, not unfortunately, but it's complicated. But I think that we can also use AI to help us with that because we can use AI to help us debug. We can say, here's a load of details from my error, here is all of what the variables were in production.

Starting point is 00:35:03 Please, can you help me aggregate across these seven different errors that I've got and tell me what is the pattern here? here is all of what the variables were in production. to really help you with that stuff, to help you find, oh, well, in the last incident, Lisa ran this command, so maybe you want to do that. Building those products is hard. We're hoping that we're making good progress. I think we've got something that's very exciting. But I think that that does become the counter foil that you need to deal with the fact that you can't just tap someone on the shoulder and be like, what were you thinking here? What about using one AI to write the code and another AI to be the second pair of eyes? I mean, I think that's where we're going. I don't think that's insane. No, it's not in a way because it's going to be a different model

Starting point is 00:36:00 looking for the same thing. But I did have a more serious question, though. You were talking about the idea of going back to the idea of testing in production and doing all this stuff in production. When you're talking about incident management tracking and tools like yours and others, does that also track what the blast radius of incidents was? Because I think just thinking about it on the spot, if you're going more and more to releasing in production and having quicker releases in production, a sign of maturity would be having a very narrow blast radius. So is that part of the standard practices for incident management to be tracking that blast radius or is that something that still has to get baked into it at some point?

Starting point is 00:36:49 Yeah, 100%. When we're talking about the two different ways that incidents impact your business in terms of impact on your users and your effort expended, if you like, I think that if you're doing incident management well, you definitely want to know what the impact was. if you like. enterprise customers, how many customers that were how bad was it? post-mortem, whatever it is. and roll that up into a how much do I care, I guess, from a customer impact. So what you want, ideally, you want to be able to pull that report

Starting point is 00:38:15 that tells you the thing that your tech lead on the ground probably already knew, but couldn't really prove that they really need to do some investment in workflows because it's burning a load of customer trust. But actually, Scribe maybe is fine. There was one high profile incident, but it only affected one customer. And actually workflows is the thing that is impacting your customers much more. And having data to back that up and make those decisions about where you invest is really powerful.

Starting point is 00:38:36 You can only get that data if you collect it, and you can only collect it if you track the work and the issues. And so again, that's why I think having robust incident management, having lots of things going through that robust process is what allows you to kind of see that data and then advocate for particular bits of investment. I think it's also a great indicator on how mature you are

Starting point is 00:38:57 at doing these production releases, because again, if most of your, if almost all of your incidents are only impacting a very few users, that means however you're doing your feature flags and all, you're catching them on the first round, you're catching them as soon as they go in as opposed to, Oh, we missed it until it went public. Right. So it can really give you, that data can be used to help identify your maturity level in that,

Starting point is 00:39:22 which I think is another great thing to track overall. Like, yeah, we're doing great. You know, of course we're going to have problems in production because of what we're doing here. But the way we're limiting them is fantastic, right? Versus terrible. All right. Yeah, we have. Sorry, go ahead.

Starting point is 00:39:37 No, I was just saying that that's what I was hoping it was going to because I think that's really important for an organization. Yeah, 100%. was hoping it was going to because I think that's really important for an organization. Yeah, 100%. One of the things that we also track internally, which I think is really interesting, is how many of your incidents are caused by things you just shipped? And how many of your incidents are caused by latent bugs? So sometimes there'll be a bug, there's been around for two years, but somebody finds the right incantation to hit some edge case, and that's the thing that causes an issue. And the way that you want to react to that

Starting point is 00:40:08 as an engineering organization is very different to if it's something that you shipped two weeks ago. Because if it's if most of your incidents are stuff that you're shipping very recently, that probably means you need to change your bar about when you ship how much testing you do internally, how you do your rollouts. If stuff is a latent bug, honestly, if a bug wasn't discovered for two years, probably it's not your fault for writing it.

Starting point is 00:40:31 And that's not something that I would be necessarily wanting to dig into very hard, unless there was a huge spike in one area of your code base. And so the way that you want to respond to like, oh, we've had quite a lot of bugs around this area of the product, changes a lot depending on when it was introduced. Cool. Hey, Lisa, I also just looking at the slides that you had and in the very end you were highlighting a guide that you wrote, it's called

Starting point is 00:41:00 the Modern Incident Management, the Tactical Playbook,book incident.io guide. We will also add the link to the podcast description. Anything else that is in that guide that will be interesting to look into from our readers? Well, listeners actually, not readers, listeners, the listeners. Yeah. They will read. I think we both. Yeah. I'm trying to think. I think a lot of what I've just said is very much in that guide,

Starting point is 00:41:29 particularly the response chunk. I was very involved in us writing it back in the day. I genuinely think it's a really good resource, and it has most of the stuff that I would want to say in it. I think that some of the other key points in there that we haven't discussed, it talks a lot about on-call, how do you onboard people onto on call, how do you set up your rotations, how do you think about alerts and alert fatigue. It has a really great section that my colleague Lawrence wrote on like practice and game days

Starting point is 00:41:57 and drills and how to run a good game day and what that looks like. And then it has quite a lot about the post incident flow, which we haven't really discussed here. So that is, how do you learn from incidents? How do you share those learnings? So if you imagine a V1 of an incident management world, where basically every individual in your organization is going on their own little feedback loop where they go have an incident, they learn that something bad happened, and they learn not to do it again, and then they know, but nobody else does. The point

Starting point is 00:42:23 of having like a post-incident process and having an incident management tool that helps you push that information out is so that you can turn those many, many, many feedback loops into a single, larger feedback loop where one person gets really, really stuck because they hit a strange behavior of post-gress locking that they weren't expecting.

Starting point is 00:42:42 They can then share that knowledge with all the other people who are working with Postgres, and all of a sudden, your organization has all got better, rather than just like each person discovering it as they make that individual mistake. And that sounds kind of silly, but actually I think that's how most organizations operate by default for most things.

Starting point is 00:42:59 So yeah, looking at that post-incident stuff, thinking about how do you choose the right level of process? Because again, if we're talking about trying to encourage people to declare more incidents, a surefire way of making that not happen is to give them a 10 page form they have to fill out at the end of every incident, because then they're just not going to do it. And then like the final bit of the guide is all focusing on kind of insights and how do you like understand your incidents at an aggregate level. So I guess that was some of the stuff we were talking about.

Starting point is 00:43:26 I just particularly draw attention to workload. So that is how much time are my teams spending in incidents? And what features are those incidents on? And particularly for us as a growth stage company, that means that the opportunity cost of our engineering teams is really high. I live in that dashboard with my team. That's the thing that I look at every single week to make sure that we're in a good spot and that there is an investment that's being left on the table. So yeah, there's some highlights, but also the guide will do a much better job than

Starting point is 00:43:54 my memory of it. Well, if you take the offer, I would like to invite you back for another episode. I think the whole post-incident process and what an organization can learn from incidents, I'm sure with a couple of good examples from your career, would be very much appreciated. Yeah, I'd absolutely love to. That stuff is super interesting and gets a very bad reputation. Cool. I think with this, Brian, time to close this episode. We covered a lot of ground.

Starting point is 00:44:23 We have a lot of links for people in the description of the podcast and we have the verbal agreement that we get Lisa back at a future episode. Brian, anything else? And we ended on the idea of looking at how incidents, what you can learn from incidents as an organization and analyzing and looking for those anti-patterns, which is pretty much how we started this podcast 10 years ago now. So I forgot to mention at the beginning, this episode is airing right around our 10 year anniversary of when our first episode aired.

Starting point is 00:45:01 So Lisa, thank you for being part of history for us, at least. Andy and I probably had no clue we'd be still doing this 10 years later, but it's been 10 years of amazing guests like you, amazing information sharing. As we started, similar to the incident stuff, we started looking at a lot of performance anti-patterns and how we can learn from them and how to stop them propagating from the next technology to the next, which is very similar and in the exact same realm as what you want to do with the incidents when you have them, learn from them, share them so other people can avoid them. So we're coming full circle, I think, on a topic here. But again, thank you to all of the listeners who've, if anyone's been with us since the

Starting point is 00:45:45 beginning, thank you. But especially Lisa, thank you. It's an honor to have you on as our 10-year anniversary guest, especially for such a fun podcast on such a... It's hard not to be excited about this topic because it hits Andy and I exactly where we've always had our cares and passions of analyzing this stuff and sharing the data and making sure people are learning from it and growing. And it's awesome to see the concept of incident management tools helping propagate the idea of

Starting point is 00:46:20 share people, share what went wrong, don't hide behind what went wrong because it's a learning for everybody else. So that's my spiel. Thanks so much for having me. It's been really fun. Thank you. Appreciate it. All right. Thank you everybody. Bye-bye.

PurePerformance - Run Towards the Fire: Why we should love incidents with Lisa Karlin Curtis

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.