Software at Scale - Software at Scale 59 - Incident Management with Nora Jones

Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Hey, welcome to another episode of the Software at Scale podcast. Joining me today is Nora Jones, the founder and CEO of Jelly, an incident management platform. Previously, she's held senior technical leadership roles at Netflix and Slack and was a big part of the chaos engineering movement. Welcome to the show. Thank you so much for having me. Excited to be here. Of course. Chaos engineering is such an interesting term. Could you help listeners understand like what is chaos engineering? Does it mean just breaking everything at your company?

Starting point is 00:00:46 What does that term mean? Yeah, it's funny. I don't directly work in chaos engineering anymore, but it's always like, I think chaos engineering and incident management are all interrelated. But the way I would define chaos engineering is creating turbulent conditions in production that would already exist in production. And doing this allows you the ability to experiment on it, control it a little bit and see how it might be reacted to. So something charity majors always says is, you know, we're always testing

Starting point is 00:01:19 in production. And so we might as well embrace it a little bit. And so I really think chaos engineering is about embracing the turbulent conditions that can only occur in production. And the way I see that relate to like the whole incident life cycle is just, it relates to learning from incidents and blame aware culture, allowing folks to do stuff like that, allowing the safety and the experimental culture to do stuff like that, allowing the safety and the experimental culture to do stuff like that will actually benefit your business in the long run. So it's tooling, it's philosophies, it's like a way of approaching how you build software. It's a lot of different things. And one of the more famous tools that I remember, it was like Chaos Monkey. From what I remember, there was this whole group of tools where like they take down instances,

Starting point is 00:02:06 they take down clusters. Were you part of that work in any way? I'm curious to know. Yeah, so that set of tooling was created in like 2011, I believe. I joined Netflix in 2017. So it was several years after they had created that tooling, but the creation of that tooling

Starting point is 00:02:24 had spawned a chaos engineering team. It was so valuable to think that, oh, this infrastructure we rely on could fail. And that was actually pretty novel thinking back in 2011. It was like, oh, if we're using other people's software, we expect it to be resilient. But we all work at companies and we all manage software at these companies. We know it's not totally resilient. And so it was kind of waking up to that fact a little bit more. I worked on Chaos Monkey a little bit at Netflix, but I primarily worked on a tool called CHAP, which was a chaos automation platform, which kind of

Starting point is 00:03:00 basically set up like an A-B test. But as part of the experimental portion of the A-B test, we would fail a particular part of Netflix intentionally and make sure that the reactions were the same as the control portion or that it didn't deviate from the customer experience too much. So it actually allowed software engineers, I think, a way to understand their end users a little bit more, which as you can imagine, at a company like Netflix, it's really hard for software engineers to gain access and do user research, like formal user research with end users. And so it was a nice way of understanding how different failure modes of

Starting point is 00:03:44 the service actually impacted the bottom line and the customer experience and driving software engineers closer to that. And it sounds like a really novel tool because it's not that many companies that are investing so much in understanding how their own software breaks. It almost feels like software is so buggy these days, like especially when you go to book a flight or something, things are always down or there's some system that's off. What got you interested in software reliability? I'm wondering if there's like an origin story

Starting point is 00:04:17 of some sorts, like why focus on software reliability? It's a great question. I think, you know, my whole life, I've actually always inherently been interested in risk and how it drives human behavior. And I think it's really fascinating that are involved in those types of sports as well, because learning how other people make mistakes actually helps me be better at those sports. And so I think I kind of naturally gravitated towards it in the software world. I started my career in hardware. I have a degree in computer engineering, and I did some work with the Navy for a little bit.

Starting point is 00:05:08 And then I went to a home automation and home security company. And so as you can imagine, in those types of organizations, the stakes are really high if the software fails. In a home security company, if the software fails, someone's house could get broken into, right? So the stakes are quite high. And so I was hired to make sure that those things didn't happen. But I was also, you know, things happen, like incidents still occur, you can't be completely perfect there. And one of the most important things afterwards is to

Starting point is 00:05:37 ensure people feel comfortable sharing why what they did made sense to them at the time. When I was in that security job, I mean, I was pretty young. I think I was like 23 years old. And we didn't have a lot of folks that did what I did. And I was in on a Saturday one day, testing a release to make sure it got out on time. And I missed a test. And it got released to customers and the test failed. And I remember being interviewed by people in the C-suite at that company. And I was so young and I was so nervous that I had really messed up, but they actually made me feel really comfortable about

Starting point is 00:06:17 the process. And the way they asked me questions was to just learn how they could improve as an organization in the future. It's not some 23-year-old that was working on a Saturday to blame. It's like, how did the organization enable that in ways that made sense? And so I think it was just a culmination of things that led me to this way of thinking and this way of doing. And I've just over time seen that companies that embrace this sort of learning from incidents and incident management aware culture actually have quite a big competitive advantage over their competitors.

Starting point is 00:06:56 I hear a little bit about the motivation behind starting Jelly. Tell us a little more about what Jelly does. Sure. Yeah, great question. So Jelly is an end-to-end incident management platform. We cover everything from when your incident starts to helping you get the right folks in a room, to helping you keep the right folks updated, to integrating with all the tools that you use and love, to helping you find patterns and understand from it afterwards. What really makes us stand out is that

Starting point is 00:07:26 we really focus on the human pieces behind this. So in a lot of incidents in the software industry, if you'll look up any public postmortem or any public incident review, most of the time, they don't mention what was hard for the people. They mentioned what failed about the technology. But the thing about our organizations is our They mentioned what failed about the technology. But the thing about our organizations is our system does not just consist of technology. It consists of technology and people. It consists of people that have to work alone, that have to work with each other, that have to work with other pieces of technology. And none of that usually gets covered or even analyzed by a lot of companies. And it's mainly because, you know, as software engineers, we're not really trained to do that.

Starting point is 00:08:12 We're trained to look at how technology failed. We're not really trained to look at how our processes failed and how our documentation failed and how we can learn from it and why things made sense to us at the time. So Jelly's primary focus, like while we do focus on the technology, we also focus on how the people intersect with the technology. So a lot of costs behind incidents are quite high because of coordination and cognition. And we aim to help lower those costs to make it really easy for folks. So we're not going to get rid of all your incidents for you, but we are going to help you make them a normal part of your work. Yeah, I think of a spectrum of companies on the software reliability maturity curve almost

Starting point is 00:08:59 where maybe companies on one side of the spectrum don't track incidents or don't acknowledge incidents have a fairly blameful culture and the other side which is slightly more utopian where like there's very few incidents and incidents that do happen everyone learns from them and systems improve and processes improve where have you seen most companies lie? Like, of course, there's going to be like a bell curve probably, but like, what do you think or what have you seen from customers or the market? How are tech companies doing today? How have things changed from maybe five years ago? In terms of what specifically, like how we look at incidents? Or how companies think about reliability?

Starting point is 00:09:43 Yeah, I mean, how companies think about reliability? Wow, that's, I mean, it's a really great question. I mean, I'm sure the company you're in now has had incidents, right? Like, I think all of us as customers of, you know, so I work, you know, I manage a vendor, you work at a vendor, right? But yet, I bet your expectations of the vendors you use reliability has probably gone up in the last five years. And I really think that's how the software industry has changed is like people aren't putting up with as much low reliability vendors, especially for really important things. You know, like I run an incident management company and our reliability, you know, even though we're helping other people with all their reliability, our reliability becomes really important. And I know, you know, you talked about Datadog a little bit last time, but like Datadog had an incident back in March and their reliability is of utmost importance to folks. So I think a lot of our expectations in the software industry have

Starting point is 00:10:45 changed because we've learned so much about what it means to have good reliability that there's just really no excuses anymore if you're not doing some of the bare minimum. Because frankly, you're just not going to keep your customers. You're going to lose to a vendor that is a little bit more reliable than you. And this can be maybe very similar to security and other aspects. You need to be secure. You also need to be reliable. I guess your platform needs to work. And yeah, I think with security or compliance, that's what we've seen, right? Like companies knowing what something like a SOC 2 was a few years ago.

Starting point is 00:11:25 You know, I'll say like, I think that's actually a really good comparison. I remember when I was looking at getting my SOC 2 a couple years ago for our SOC 2 for Jelly a couple years ago, I remember someone gave me the advice that had started a company like 10 years before and was like, oh, you don't really need to worry about that until you IPO. And it's amazing how much the software industry has changed because that's not true anymore. And I think there's something similar happening with incidents, right? It's just, it's not acceptable anymore to not have, you know, and also part of getting your SOC 2, you have to describe your incident management program. And so it's just like really not acceptable anymore to not have some practices in place that demonstrate to your customers and also to your employees that you're taking reliability seriously and not just throwing

Starting point is 00:12:20 features over the fence and hoping that your customers are finding the issues. It's that you're being proactive about them. So going back to what you mentioned about Jelly, so what is a human first incident management process, right? Like, how does that differ from like the technology first process? Like, what are the small things that your product needs to do? Or even the larger things in terms of your product philosophy that need to be different for it to be like a human first product versus like a technology

Starting point is 00:12:52 first product yeah and i want to clarify a little bit i wouldn't say that like human is more important than technology or technology is more important than human. It's like, we really focus on how the human works with the technology, which is not what a lot of platforms do in the software industry in general. It's not treating the human as a part of the system. It's like, assuming that the technology can take care of everything for you. And what we really do with the design of our software, like from the get-go, is make it easy for the human that's going to be interacting with it every day. So our users are not necessarily having incidents like all the time, right? They have other things that are going on.

Starting point is 00:13:35 And so part of what our software does is actually reminds them how to be in an incident. It makes it easy for them to kind of remember the things that are important for them to be doing. And so I really think it's just making sure that the human is not ignored from the equation. You know, I think there's a lot of talk right now in the industry about AI naturally, which is really cool. But I think the most exciting articles I'm seeing about this are how it helps elevate the human and not how it replaces the human.

Starting point is 00:14:09 And so I feel like that's kind of the biggest thing is like we are on the team of the human rather than replacing them. Mm-hmm. So what is like the bare minimum, right? Like I think blameless culture is like often repeated as like a keyword. Like what is the bare minimum your organization should be doing in an effective incident management procedure? Yeah.

Starting point is 00:14:32 So one of the ways our incident response tool really differs is we really focus on helping everyone know what's going on. So I think that's a hard part in incidents for a lot of people is connecting the dots between your go-to-market team, between your engineering team, between your executives, because all of them are impacted by the incident and they all care about different things in the incident. And so keeping them all connected in a low coordination way and also in a non-stressful way is something our tool really focuses on. And so I think that is something that a lot of companies in the software industry could start doing is just really focusing on all those connective tissues rather than focusing on

Starting point is 00:15:17 individual departments. And then after the incident takes place, I think some basic things, you mentioned the word blameless. I like to use the word blame aware. So it's like you're making sure you're not finger pointing, but making it safe enough to name names like, oh, Jose was the one that pushed this code. Jose, could you talk about it? And making sure it's okay for Jose to talk about it and not actually skirting around the fact that Jose pushed this code. Because if you're in an organization where that is really uncomfortable to talk about it and not actually skirting around the fact that Jose pushed this code. Because if you're in an organization where that is really uncomfortable to talk about,

Starting point is 00:15:50 I think that is a big signal that there are other deeper things going on in your organization. So that's what I would say some of the bare minimum is, is making it okay to have those conversations. Kind of like I mentioned earlier, like how I was interviewed when I was earlier in my career, making it okay for folks to share those things so that the org can learn from them. And making the, or like changing the culture of the org like that,

Starting point is 00:16:18 especially if it's something else, it is not trivial is my guess. It's like, let's say I'm an engineer or like I'm in engineering leadership in an organization and I want to introduce or I want to change our incident management culture to be slightly more blame aware or blameless. What steps can I take? Yeah, I think some of the ways that folks will run into trouble with this is that when they're doing

Starting point is 00:16:46 an incident review or when they're talking about an incident, instead of looking at just the raw facts, they're pontificating about it and people are sharing their opinions on the experience rather than just looking diplomatically at what happened. And when I say diplomatically, like it is impossible to be completely diplomatic, especially if you participated in it. But at the very least, you can start by actually pulling up the conversation. So if the conversation happened in Slack or it happened in a Zoom recording, just showing exactly how people interacted, what they said, what they did, actually takes people off of the defensive a little bit and has them feel more comfortable sharing with each other how they participated.

Starting point is 00:17:32 Because it's no one calling them out on what they said. It's just showing what they said. the most important parts is like when you're asking questions or facilitating a conversation around an incident, making sure, and this is after the incident has taken place, making sure you're not sharing your opinions on what happened, but trying to take a very objective approach and letting others share what was going on for them and why whatever they did made sense to them at the time. So there really has to be just like no judgment from the facilitator. Yeah. I'm curious if, feel free to not answer this because this might be a little specific, but I'm curious if that's the kind of motivation you see prospective buyers of Jelly have when

Starting point is 00:18:22 they want to use a tool like Jelly. It's like, I want to change the culture of how we manage incidents at my company. And I think changing the culture along with the tool is going to help with that. Or that's how maybe I would think about it, but I don't know how true that is generally. So what I mean, I know we're not like getting into the product specifically here, but we try to make it really easy for you as a like we meet you where you are today. So if you're using templates today, you can use templates and jelly. If you're you know, if you're a little bit of a blameful company, that's okay. You can still use us like we meet wherever the person is at where they at today. And we just help them get like a little bit better. I think of it like a video game, you're just going like half a level up or like one level up. We sometimes get folks that want to change their culture. But I would say that is rarer than most of the folks that come through our door. They're just interested in all the stuff we can do. And like, a lot of the stuff we bubble up for folks are patterns around like

Starting point is 00:19:25 who is responding to incidents, what time, what on-call schedules. And so a lot of like what our platform can do can actually help you improve your on-call schedules, improve your road mapping, improve your service ownership. And so I think people really get interested in using our platform to help have conversations and make better decisions after incidents. But I wouldn't say like a huge number are approaching it, like wanting to do a giant cultural change. We have seen some of that with our customers and it is really cool to watch. How do I know that I'm actually learning from my incidents? Like, is there like a metric I could track? I know that there's probably metrics for, you know, is my on-call schedule healthy or not? But how do I track that my organization's actually improving?

Starting point is 00:20:16 Is that even possible? Oh, that's a really good question. I mean, there's a lot of things you can look at after incidents. I think there's a lot of things that are not always helpful after incidents. I think if you're looking at any metric after an incident, making sure you're adding context to it and adding a story behind it, rather than just like taking the metric at face value, but really digging into what it means if you're making a decision off of it. But I would say like metrics to help you understand if you're learning, I would say are people participating in the incident review. So, you know, it's something interesting you're doing right now is you're interviewing me for a podcast and you're

Starting point is 00:20:54 interviewing me for something that I have, you know, expertise in. And that's exactly how we should be treating people after the incident too, is like essentially interviewing someone after the incident for something they have expertise in. Like, I think you mentioned to me at the beginning that I should be talking a certain percentage of the time and you should be talking a certain percentage of the time. That's actually what I advise people after incidents too. And so there's little things you can look for like that, like how many people were talking? Was it the same person talking the whole time? Did we get diversity of perspectives in the room? Did we have people from all over the organization? Did people say they learned something? Like how many people viewed this document?

Starting point is 00:21:40 And then in terms of like, so that's in terms of learning, but in terms of actually improving your incidents, I would actually look at your incidents over time and just see, oh, how are our incidents doing? Do we, like if we need Ariel in every incident that involves console, even if she's not on call over time, do we stop needing Ariel as much? Because that means that more people are learning about console and we don't need to rely on Arial as a single point of failure for that service. And so those are things that show that just we're improving as a whole over time. Because I think a lot of people look at the cost of incidents in terms of how long it was and when the customer got to a good experience, which is so important. Don't get me wrong. But there's so much more that goes into it, right? Like what if 300 people from your org

Starting point is 00:22:30 participated? What if it interrupted everyone's roadmap that day? What if we had all these people that were pulled in when it was the middle of the night for them? Like those things are paper cuts that just add up over time and end up leading to bloat in organizations where you're over hiring and overspending in areas that you don't need to be. And so, yeah, it's like, I mean, I say all that ultimately it's reflection. It's like looking at various things that you want to improve and seeing how you're improving over time and really looking at the coordinate of pieces of that. Yeah, I think that response makes a lot of sense to me, especially again, if I contrast it with something that I'm a little more familiar with, like developer

Starting point is 00:23:14 experience. There's no one incredible metric that's going to tell you what the development is at your organization, but exactly. Paper cuts, seeing how things add up, seeing is like everyone working 24-7 to make sure they ship stuff on time, how blocked people are. It's actually really similar where you have a bunch of signals that are not extremely exact metrics, but they're kind of the same across different organizations. There will be specific things for each organization based on its context. Then you have to add all of that up and make a judgment call. So that makes a lot of sense to me. Let's say that I'm on this journey of improving incident management and my company is slightly larger now. And I feel like I've done a reasonable job as a senior engineering leader in my group.

Starting point is 00:24:06 How do I make sure like the rest of my organization comes along, you know, especially if I cannot be in every single meeting, and I cannot make sure that there's enough people speaking in every meeting as an example? How do I codify some of the stuff? And is that somewhere where something like Jelly helps where it can help you codify a set of principles or a set of, this is how I need to do X, Y, how is how I need to manage incidents in my organization as an example? Yeah, I'm really glad you asked that question.

Starting point is 00:24:36 I also wanted to touch on the developer experience thing you brought up because developer experience should be brought into incident metrics. Like it should, like the responder metrics. Like it should like the responder experience and how it's impacting people, those worlds should a hundred percent overlap. And then the second question or the question you asked around, like, how do I do this in my organization when I have a lot of other stuff to do and I can't be in every room, right?

Starting point is 00:25:03 Like that's almost a full-time job. And I've been that person in a couple of organizations and you're right, it's tiresome and it's a lot of work. And what I ended up doing in those organizations was I ended up building allies, people that were really interested in doing this on their team. So I would have allies like on the front end team, I'd have allies like on the search team. But again, like, you know, quality wanes if you don't have someone managing that experience. And what I really wanted was a tool like Jelly. And so that was why I left, like part of the reason why I left Slack is like, I wanted a tool like that. It was a lot of work to maintain the quality of that without something

Starting point is 00:25:45 that was helping folks be consistent. So that's a lot of what it's meant for is to make it easy for folks to develop a learning culture without taking a lot of time. And then going into the technical aspect of it, like you had an example where, you know, there's very few engineers who might understand how technology works and you want to, and like part of understanding whether your organization is learning to see if it's the same set of engineers who get pulled into a certain kind of incident. And if that's improving over time, the flip side of this is making sure your organization learns efficiently, right? What But how could, let's say I'm a reliability leader or an engineering leader in my organization,

Starting point is 00:26:30 and I want to make sure that this one engineer can impart their knowledge to the rest of the organization. What is the most effective way in your mind to do something like this? Is it something like tech talks? Oh my gosh, yeah, This happens all the time. And the mistake I see a lot of orgs make is that they put that pressure on that person. They're like, you know, in order to be an even more senior engineer, you have to teach people. But like when you're telling that person to teach people while they're also expected to manage the load of all the work that no one else understands, they will 100% burn out. It won't be tomorrow. It won't be in a week. It will be in a few months and their quality

Starting point is 00:27:13 is going to wane and they're going to eventually leave the organization. But the ways that you can get around this is actually learning like kind of what we were talking about when I was talking about what you were doing with podcasting with me, right? You're interviewing me. We need to interview experts internally, right? We need to ask them what they're looking at when they're solving something. So if they come into an incident channel and they're like, got it. And they send this graph in the channel and they're like, okay, I'm doing this with these hosts and everyone's like, yay. And know, I'm doing this with these hosts and everyone's like, yay. And they're applauding them. Right. But are we ever actually asking them

Starting point is 00:27:50 how they knew to do that and what led them there? And the answer is no. And they're usually really bad at actually explaining that too. And so I think the work needs to go on to others in the organization to learn cognitive style interviewing. And there's, we have a free resource on our website. It's called the Jelly Howie Guide. If you just go to jelly.io slash Howie, you can see how you can do some cognitive style interviewing. So it's asking someone things about what they have expertise in and like how they knew to do it. Someone drops a link to a graph. If I was interviewing them cognitively afterwards, I would say,

Starting point is 00:28:30 hey, you know, you dropped this link to a graph. Before I get to that, like, could you tell me a little bit about how you found out about this incident? And, you know, I'll throw them a softball question at first and they'll tell me about how they found out about it. And I will, part of my job during this is to almost play dumb a little bit and get them to share everything with me. And the more that we do, and the more people we share that within our organization, the more experts that we build. And it actually helps people feel really valued too. Like experts are normally really happy to like be interviewed about their expertise and share it. But asking them to come up with a whole tech talk and a whole documentation series is going to be really overwhelming. But then who is the person

Starting point is 00:29:16 who's supposed to interview the expert? I think that's where like the confusion is. Is it engineers on the team? Is it the incident manager? Is it the on-call of the team? Yeah. Point of contact for the incident. How do you, there's the response, the fusion of responsibility over there.

Starting point is 00:29:34 Yeah, no, it's a great question. And there's a few different ways you could approach it. And I'd advise people differently depending on the size of their organization, how far along their business is. But if your business is a few hundred folks, that means that not everyone is in every single incident. And so if you can get someone that is... We have on-call rotations. Why not have sort of incident review rotations where if you're up next and you didn't participate in the incident, you do a 30 minute one-on-one with like the expert in it. And you write up about like what you learned about their expertise and just spending a little bit of time doing something like that will just pay off dividends in the future. It's just, you know, it's like it's time we're spending

Starting point is 00:30:23 anyway. It's just time with like a different focus. So that's kind of how I would recommend it with a larger organization. But, you know, when I was in office with folks, I would sometimes just ask them to go for a walk with me or ask them to have lunch with me. And I would do the interview, write it up and share it with other folks internally yeah which takes a little bit of time but like the value pays off quite a bit yeah it almost sounds like you could even have like a volunteer rotation of sorts so yeah they're in reviewers they could help share knowledge outside of like a more formal venue like a tech talk like because it's so easy for example to have an incident post-mortem that doesn't have any information at all it could be like a tech talk like because it's so easy for example to have an incident post-mortem that

Starting point is 00:31:06 doesn't have any information at all it could be like a centralized group of people who are thinking about incidents who are reviewing incidents who are writing stuff down and sharing knowledge with the organization and it doesn't come to the set of experts in each exactly also do this extra responsibility yeah i like that a lot and yeah you mentioned it, it's organizations that can come up with sustainable, useful processes, evolve them as the organization changes. Those end up having the best competitive differentiators and end up growing the quickest. I completely agree with your earlier point. And one thing that I just want to

Starting point is 00:31:45 bring up as well is like, whoever's interviewing the expert is naturally going to get better at that piece of technology. So if I'm interviewing the expert in console, and I'm also an engineer at that company, guess what, I am all of a sudden going to have a lot more expertise about console, right? And like, you know, we think about when we're early in our career and we're trying to figure out the kind of engineer we want to be, what do we do? We look at who's in our organization and we look at who we admire and we watch them, right? And so it's about continuing to do that, right? Like watching what these experts do and recognizing them in different areas will help us build more experts. And yeah, we'll get folks a lot better. And that's why it's so important to have that be

Starting point is 00:32:30 on a rotation as well, so that you're getting that expertise spread out. And it's also really important to have that rotation be people that are writing code and participating in incidents so that they can put that new knowledge to work. Yeah. Another aspect of incident management is certainly, you know, action items, right? Making sure you're trying to prevent the incident. You can detect it quicker. You can remediate problems better. You can mitigate it quicker. What is the most effective way to manage your incident action items or your follow-ups, right? Like some organizations have processes around, like you need to complete all P0 action items in 30 days or fewer or some other number like that. What have you seen work in practice or like what are people doing wrong or how should we be thinking about this?

Starting point is 00:33:19 I see action items taking place no matter what after incidents. I also see learning taking place no matter what after incidents. I also see learning taking place no matter what after incidents. It's just the level of quality in those two worlds that I see us mess up on. So if we're fixing without actually taking the time to talk about the thing that happened and learn from it, we're going to put ourselves in a bad position later. We're maybe not doing the best fixes for our organization. We're also maybe not sure why we're doing those fixes. And they're going to come back to bite us later on. It won't be right now. Now, if we're learning, you know, and not fixing, like it's still beneficial, you know, if it will help later

Starting point is 00:34:02 on, it will help in the code people write. It will help in how they interact in incidents. And so I think people kind of have the wrong idea when they're like, oh, all P0s must be completed and X days and easier action items that maybe are not higher quality. And so I think if you actually just, you know, allow for a little bit more time to talk about it, you'll get easier action items. If you really want to spend the time doing this in a really great way, I would separate when you're doing action items versus when you're talking about the incident. So not talking about how you're going to fix it, but just talking about how it unfolded and how it came to be. And then coming up with your action items, you'll have like much higher quality action items. And no one's going to need to be the action item police, where they're like, you know, slacking you to make sure you did it like no one likes that. And if it's not getting done, it's probably, you know,

Starting point is 00:35:06 there's probably a good reason that it's not getting done. Yeah. Yeah. It almost sounds like you're saying, you know, separate the problem discovery from the solution. Exactly. Yeah. Right.

Starting point is 00:35:19 Which like makes so much sense, but you know, we kind of forget about it sometimes when we're in it, but it's like when we stop and think about it, it's oh yeah that is what we should do yeah yeah but on the other side you kind of want to make sure you don't want this incident to happen again and like should you schedule to talk about the same yeah it's like a tricky one to think about yeah yeah it's always tricky and i don't think every incident deserves the same level of treatment. Sometimes an incident might involve no incident review prep and it's just me and my colleagues hopping on a call for 30 minutes and reviewing the Slack transcript,

Starting point is 00:35:59 right? So that's a really low level. And then we come up with action items at the end. So boom, we just spent 30 minutes on it. Sometimes it's a higher level where I'm interviewing a couple people that had expertise. And I'd maybe take that approach if it was an incident that like almost entirely relied on one person, that's probably worth digging into and just spending a little bit more time on. But I don't think it needs to be an all or nothing thing. I don't think you need to spend like several weeks on a thing to get value out of it. There's little things that you can do

Starting point is 00:36:32 in order to set your organization up for success. You could even send people a little survey after the incident. Like if you really wanna get low level about it and just be like, how did you feel about this incident? Do you think our coordination went well? Do you think our communication went well? And just have people spend like five minutes on it. And so those are little things that you can do to still develop a learning culture that match like the level of effort that you might have time for.

Starting point is 00:36:59 And formally, you can break down incidents by severity level and then a different set of people or a different set of policies might apply to each incident level. And that's how you can kind of operationalize this. A slightly more controversial question. In what you said, I heard, you know, review the Slack transcript. Like, does every incident need a postmortem doc? It sounds like your opinion is maybe not like maybe some of them you just learn from on your own with your colleagues without spending the time to write everything down in the structured format i'm just want to clarify is that the case is that how do you think

Starting point is 00:37:36 about like you know policies for every incident should have xyz like at least one action item at least one a policy towards we need to learn from every incident yeah what the incident was yeah i like that question i do think there does need to be some guardrails otherwise people will just not really know how to do it right and i think people need boundaries when they're learning things, right? And so I think at a bare minimum, I would recommend reviewing how the conversation unfolded and not cherry picking things from the conversation, but actually sitting down and reviewing the Slack conversation. And if the Slack conversation was a really long Slack conversation, it's probably worth just taking a little bit more time to do that incident review.

Starting point is 00:38:23 And you'll thank me later when your organization is taking more time to do that incident review. And I like, you'll thank me later when your organization is like taking the time to learn how they worked together and how they evaluated things afterwards. And also like how much folks talk, you know, what time of day it was for folks, where people were looking, what services were impacted, what technologies were helpful for us. And so at a bare minimum, I would recommend reviewing the Slack transcript or the Zoom transcript, marking it up a little bit akin to taking a highlighter, taking some notes, but just sharing what did we learn from this? And I think there's a couple different sections

Starting point is 00:39:01 you could highlight too. You could highlight, here's when we detected the incident. Here's when we repaired it. Here's when we were diagnosing things. And those might not be clear linear paths, but you could just mark different areas where that kind of stuff was happening. And I guarantee you'll gain a little out of it. I think making an explicit decision, are we going to have action items here and sharing how folks came up with that decision is going to be really important.

Starting point is 00:39:30 Yeah, I like that much more than the very monolithic, like ensure every incident has at least one or whatever. I like the approach of you want to learn from every incident rather than you need to have some deliverable because then it just feels like incidents are even more work yeah they're an incident for something that seems frivolous or whatever yeah john allspot always says you know are your postmortems meant to be filed or are they meant to be read and i feel like a lot of organizations over time because of the guidelines that are put in place end up creating this like write only culture for their incident reviews which is useful to no one it's just like meant to help defend the person that's writing it and so organizations aren't setting their folks up for success if that's like what they're requiring yeah what are your favorite resources on learning how to manage incidents? Well, where can someone read up more on like processes, tooling, thought process?

Starting point is 00:40:33 Yeah. Oh, yeah, I love that. So a few years ago, like back in late 2018, I started a community called Learning from Incidents in Software, and it gathered people from all over the software industry to learn and share about how they learn from incidents and how they enact these practices in their companies. So there's a website that exists with some blog posts around that. But we also held our first conference within that community a couple months ago. And so all those videos are up on YouTube. If you go under the Jelly channel on YouTube, you'll see all the learning from incidents videos, which were sponsored by Great Circle and Brent Chapman.

Starting point is 00:41:11 But yeah, they have a lot of... There's folks from Salesforce, from Indeed, from Quizlet, from startups, from IBM, all sharing how they've developed this kind of culture internally. And it's really helpful to see because it's not just like the theory behind it. It's how they're actually doing it. Yeah, I think videos are just one of the best ways to learn, like highest bandwidth way of. Totally. Yeah, so I will link that in the show notes.

Starting point is 00:41:39 And what gets you excited about, you know, the future of the software industry and your role in it over the next few years, at least? Yeah, I mean, I think of this as also just making the software industry a better place to work. There's a reason why tenure is so low in the software industry, right? We get burned out, we move to another place two years later. But like, you know, if we're job hopping every two years, we're not actually taking the time to develop expertise in the particular company system we're in, which will benefit us as engineers. It will benefit our organizations.

Starting point is 00:42:13 And it's ultimately going to benefit end users in society if folks are actually taking some time to put care into their software, into their work. So I'm really excited about the level of attention incidents are getting in the industry right now because I just think it's going to be really important to all of our end products. So I'm excited as a consumer and a creator. Yeah, I'm excited too. I hope the airline apps become less. I'm guessing you were impacted by the Southwest outage.

Starting point is 00:42:47 I've been impacted enough. And it's just one of my pet peeves. When I read a tweet somewhere that said, you know, don't use an airline's website, but use an app, less likely that the app will be buggy because there's a slower release cadence. That made me really sad because, you know, that's not why an app should be quicker or like more reliable but so i like to pick on that industry in particular but maybe i'm being too harsh no i mean i think it makes sense too i also think it all comes down to like salaries in that industry and how employees are being treated and like developing learning organizations also helps people feel i mean people stay at companies where they're feeling respected and valued and compensated. And so this kind of helps take care of the respected and valued piece

Starting point is 00:43:31 for sure. Well, Nora, we should stay in touch. And yeah, it was a really good conversation. I feel like I learned a lot and became a little less procedural in my thinking about incidents. So thank you so much for being a guest and I hope to have you again at some point. Yeah, thank you too. It was really great to be on this.

Pet Camera - EBO Air 2

Software at Scale - Software at Scale 59 - Incident Management with Nora Jones

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

Software at Scale - Software at Scale 59 - Incident Management with Nora Jones

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.