Software at Scale - Software at Scale 59 - Incident Management with Nora Jones
Episode Date: July 5, 2023Nora is the CEO and co-founder of Jeli, an incident management platform.Apple Podcasts | Spotify | Google PodcastsNora provides an in-depth look into incident management within the software indust...ry and discusses the incident management platform Jeli.Nora's fascination with risk and its influence on human behavior stems from her early career in hardware and her involvement with a home security company. These experiences revealed the high stakes associated with software failures, uncovering the importance of learning from incidents and fostering a blame-aware culture that prioritizes continuous improvement. In contrast to the traditional blameless approach, which seeks to eliminate blame entirely, a blame-aware culture acknowledges that mistakes happen and focuses on learning from them instead of assigning blame. This approach encourages open discussions about incidents, creating a sense of safety and driving superior long-term outcomes.We also discuss chaos engineering - the practice of deliberately creating turbulent conditions in production to simulate real-world scenarios. This approach allows teams to experiment and acquire the necessary skills to effectively respond to incidents.Nora then introduces Jeli, an incident management platform that places a high priority on the human aspects of incidents. Unlike other platforms that solely concentrate on technology, Jeli aims to bridge the gap between technology and people. By emphasizing coordination, communication, and learning, Jeli helps organizations reduce incident costs and cultivate a healthier incident management culture. We discuss how customer expectations in the software industry have evolved over time, with users becoming increasingly intolerant of low reliability, particularly in critical services (Dan Luu has an incredible blog on the incidence of bugs in day-to-day software). This shift in priorities has compelled organizations to place greater importance on reliability and invest in incident management practices. We conclude by discussing how incident management will further evolve and how leaders can set their organizations up for success. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Hey, welcome to another episode of the Software at Scale podcast.
Joining me today is Nora Jones, the founder and CEO of Jelly, an incident management platform. Previously,
she's held senior technical leadership roles at Netflix and Slack and was a big part of the chaos
engineering movement. Welcome to the show. Thank you so much for having me. Excited to be here.
Of course. Chaos engineering is such an interesting term. Could you help listeners
understand like what is chaos engineering? Does it mean just breaking everything at your company?
What does that term mean?
Yeah, it's funny.
I don't directly work in chaos engineering anymore, but it's always like, I think chaos
engineering and incident management are all interrelated.
But the way I would define chaos engineering is creating turbulent conditions in production
that would already exist in production. And
doing this allows you the ability to experiment on it, control it a little bit and see how it
might be reacted to. So something charity majors always says is, you know, we're always testing
in production. And so we might as well embrace it a little bit. And so I really think chaos engineering is about embracing the turbulent conditions that can only occur in production.
And the way I see that relate to like the whole incident life cycle is just, it relates to
learning from incidents and blame aware culture, allowing folks to do stuff like that, allowing
the safety and the experimental culture to do stuff like that, allowing the safety and the experimental culture to do
stuff like that will actually benefit your business in the long run. So it's tooling,
it's philosophies, it's like a way of approaching how you build software. It's a lot of different
things. And one of the more famous tools that I remember, it was like Chaos Monkey. From what I
remember, there was this whole group of tools where like they take down instances,
they take down clusters.
Were you part of that work in any way?
I'm curious to know.
Yeah, so that set of tooling was created in like 2011,
I believe.
I joined Netflix in 2017.
So it was several years after they had created that tooling,
but the creation of that tooling
had spawned a chaos engineering team.
It was so valuable to think that, oh, this infrastructure we rely on could fail.
And that was actually pretty novel thinking back in 2011.
It was like, oh, if we're using other people's software, we expect it to be resilient.
But we all work at companies and we all manage
software at these companies. We know it's not totally resilient. And so it was kind of waking
up to that fact a little bit more. I worked on Chaos Monkey a little bit at Netflix, but I
primarily worked on a tool called CHAP, which was a chaos automation platform, which kind of
basically set up like an A-B test. But as part of the experimental portion of the A-B test,
we would fail a particular part of Netflix intentionally
and make sure that the reactions were the same as the control portion
or that it didn't deviate from the customer experience too much.
So it actually allowed software engineers, I think, a way to understand
their end users a little bit more, which as you can imagine, at a company like Netflix,
it's really hard for software engineers to gain access and do user research, like formal user
research with end users. And so it was a nice way of understanding how different failure modes of
the service actually impacted the bottom line and the customer experience and driving software engineers closer to that.
And it sounds like a really novel tool because it's not that many companies that are investing so much in understanding how their own software breaks.
It almost feels like software is so buggy these days,
like especially when you go to book a flight
or something, things are always down
or there's some system that's off.
What got you interested in software reliability?
I'm wondering if there's like an origin story
of some sorts, like why focus on software reliability?
It's a great question.
I think, you know, my whole life,
I've actually always inherently been interested in risk and how it drives human behavior. And I think it's really fascinating that are involved in those types of sports as well,
because learning how other people make mistakes actually helps me be better at those sports.
And so I think I kind of naturally gravitated towards it in the software world. I started my
career in hardware. I have a degree in computer engineering, and I did some work with the Navy
for a little bit.
And then I went to a home automation and home security company.
And so as you can imagine, in those types of organizations, the stakes are really high if the software fails.
In a home security company, if the software fails, someone's house could get broken into,
right?
So the stakes are quite high.
And so I was hired to make sure that
those things didn't happen. But I was also, you know, things happen, like incidents still occur,
you can't be completely perfect there. And one of the most important things afterwards is to
ensure people feel comfortable sharing why what they did made sense to them at the time. When I was in that security job, I mean, I was pretty young.
I think I was like 23 years old.
And we didn't have a lot of folks that did what I did.
And I was in on a Saturday one day, testing a release to make sure it got out on time.
And I missed a test.
And it got released to customers and the test failed. And I remember
being interviewed by people in the C-suite at that company. And I was so young and I was so
nervous that I had really messed up, but they actually made me feel really comfortable about
the process. And the way they asked me questions was to just learn how they could improve as an
organization in the future. It's not some 23-year-old that was working on a Saturday to blame.
It's like, how did the organization enable that in ways that made sense?
And so I think it was just a culmination of things that led me to this way of thinking
and this way of doing.
And I've just over time seen that companies that embrace this sort of learning from incidents
and incident management aware culture actually have quite a big competitive advantage over
their competitors.
I hear a little bit about the motivation behind starting Jelly.
Tell us a little more about what Jelly does.
Sure.
Yeah, great question.
So Jelly is an end-to-end incident
management platform. We cover everything from when your incident starts to helping you get the right
folks in a room, to helping you keep the right folks updated, to integrating with all the tools
that you use and love, to helping you find patterns and understand from it afterwards. What really makes us stand out is that
we really focus on the human pieces behind this. So in a lot of incidents in the software industry,
if you'll look up any public postmortem or any public incident review, most of the time,
they don't mention what was hard for the people. They mentioned what failed about the technology.
But the thing about our organizations is our They mentioned what failed about the technology. But the thing about
our organizations is our system does not just consist of technology. It consists of technology
and people. It consists of people that have to work alone, that have to work with each other,
that have to work with other pieces of technology. And none of that usually gets covered or even analyzed by a lot of companies.
And it's mainly because, you know, as software engineers, we're not really trained to do that.
We're trained to look at how technology failed. We're not really trained to look at how our
processes failed and how our documentation failed and how we can learn from it and why things made
sense to us at the time.
So Jelly's primary focus, like while we do focus on the technology, we also focus on how the people intersect with the technology. So a lot of costs behind incidents are quite high because of
coordination and cognition. And we aim to help lower those costs to make it really easy for folks.
So we're not going to get rid of all your incidents for you, but we are going to help
you make them a normal part of your work.
Yeah, I think of a spectrum of companies on the software reliability maturity curve almost
where maybe companies on one side of the spectrum don't track incidents or don't acknowledge
incidents have a fairly blameful culture and the other side which is slightly more utopian where
like there's very few incidents and incidents that do happen everyone learns from them and
systems improve and processes improve where have you seen most companies lie? Like, of course, there's going to be like
a bell curve probably, but like, what do you think or what have you seen from customers or the market?
How are tech companies doing today? How have things changed from maybe five years ago?
In terms of what specifically, like how we look at incidents?
Or how companies think about reliability?
Yeah, I mean, how companies think about reliability? Wow, that's, I mean, it's a really great question. I mean, I'm sure the company you're in now has had incidents, right? Like, I think all of us as customers of, you know, so I work, you know, I manage a vendor, you work at a vendor, right? But yet, I bet your expectations of the vendors you use reliability has probably gone
up in the last five years. And I really think that's how the software industry has changed is
like people aren't putting up with as much low reliability vendors, especially for really
important things. You know, like I run an incident management company and our reliability,
you know, even though we're helping other people with all their reliability,
our reliability becomes really important. And I know, you know, you talked about Datadog a little
bit last time, but like Datadog had an incident back in March and their reliability is of utmost
importance to folks. So I think a lot of our expectations in the software industry have
changed because we've learned so much about what it means to have good reliability that there's
just really no excuses anymore if you're not doing some of the bare minimum. Because frankly,
you're just not going to keep your customers. You're going to lose to a vendor that is a little bit more reliable than you.
And this can be maybe very similar to security and other aspects.
You need to be secure.
You also need to be reliable.
I guess your platform needs to work.
And yeah, I think with security or compliance, that's what we've seen, right? Like companies knowing what something like a SOC 2 was a few years ago.
You know, I'll say like, I think that's actually a really good comparison. I remember
when I was looking at getting my SOC 2 a couple years ago for our SOC 2 for Jelly a couple years
ago, I remember someone gave me the advice that had started a company like 10 years before and
was like, oh, you don't really need to worry about that until you IPO. And it's amazing how much the software industry has changed because that's not true anymore.
And I think there's something similar happening with incidents, right?
It's just, it's not acceptable anymore to not have, you know, and also part of getting
your SOC 2, you have to describe your incident management program. And so it's just like really not acceptable anymore to not have some practices in place that demonstrate to your
customers and also to your employees that you're taking reliability seriously and not just throwing
features over the fence and hoping that your customers are finding the issues.
It's that you're being proactive about them.
So going back to what you mentioned about Jelly, so what is a human first incident management
process, right?
Like, how does that differ from like the technology first process?
Like, what are the small things that your product needs to do?
Or even the larger things in terms of your product
philosophy that need to be different for it to be like a human first product versus like a technology
first product yeah and i want to clarify a little bit i wouldn't say that like human is more
important than technology or technology is more important than human. It's like, we really focus on how the human works
with the technology, which is not what a lot of platforms do in the software industry in general.
It's not treating the human as a part of the system. It's like, assuming that the technology
can take care of everything for you. And what we really do with the design of our software,
like from the get-go, is make it easy for the human that's going to be interacting with it every day.
So our users are not necessarily having incidents like all the time, right?
They have other things that are going on.
And so part of what our software does is actually reminds them
how to be in an incident.
It makes it easy for them to kind of remember
the things that are important for them to be doing.
And so I really think it's just making sure that the human is not ignored from the equation. You
know, I think there's a lot of talk right now in the industry about AI naturally, which is really
cool. But I think the most exciting articles I'm seeing about this are how it helps elevate the human
and not how it replaces the human.
And so I feel like that's kind of the biggest thing is like we are on the team of the human
rather than replacing them.
Mm-hmm.
So what is like the bare minimum, right?
Like I think blameless culture is like often repeated as like a keyword.
Like what is the bare minimum your organization should be doing
in an effective incident management procedure?
Yeah.
So one of the ways our incident response tool really differs
is we really focus on helping everyone know what's going on.
So I think that's a hard part in incidents for a lot of people is connecting
the dots between your go-to-market team, between your engineering team, between your executives,
because all of them are impacted by the incident and they all care about different things in the
incident. And so keeping them all connected in a low coordination way and also in a non-stressful way is something our tool really
focuses on. And so I think that is something that a lot of companies in the software industry
could start doing is just really focusing on all those connective tissues rather than focusing on
individual departments. And then after the incident takes place, I think some basic things,
you mentioned the word blameless. I like to use the word blame aware.
So it's like you're making sure you're not finger pointing, but making it safe enough
to name names like, oh, Jose was the one that pushed this code.
Jose, could you talk about it?
And making sure it's okay for Jose to talk about it and not actually skirting around
the fact that Jose pushed this code. Because if you're in an organization where that is really uncomfortable to talk about it and not actually skirting around the fact that Jose pushed this
code. Because if you're in an organization where that is really uncomfortable to talk about,
I think that is a big signal that there are other deeper things going on in your organization.
So that's what I would say some of the bare minimum is, is making it okay to have those
conversations. Kind of like I mentioned earlier, like how I was interviewed
when I was earlier in my career,
making it okay for folks to share those things
so that the org can learn from them.
And making the, or like changing the culture
of the org like that,
especially if it's something else,
it is not trivial is my guess.
It's like, let's say I'm an engineer
or like I'm in engineering leadership in an organization
and I want to introduce or I want to change our incident management culture to be slightly
more blame aware or blameless.
What steps can I take?
Yeah, I think some of the ways that folks will run into trouble with this is that when they're doing
an incident review or when they're talking about an incident, instead of looking at just
the raw facts, they're pontificating about it and people are sharing their opinions on
the experience rather than just looking diplomatically at what happened.
And when I say diplomatically, like it is impossible to be
completely diplomatic, especially if you participated in it. But at the very least,
you can start by actually pulling up the conversation. So if the conversation happened
in Slack or it happened in a Zoom recording, just showing exactly how people interacted,
what they said, what they did, actually takes people off of the defensive a little bit and has them feel more comfortable sharing with each other how they participated.
Because it's no one calling them out on what they said.
It's just showing what they said. the most important parts is like when you're asking questions or facilitating a conversation
around an incident, making sure, and this is after the incident has taken place, making sure you're
not sharing your opinions on what happened, but trying to take a very objective approach and
letting others share what was going on for them and why whatever they did made sense to them at the time.
So there really has to be just like no judgment from the facilitator.
Yeah. I'm curious if, feel free to not answer this because this might be a little specific,
but I'm curious if that's the kind of motivation you see prospective buyers of Jelly have when
they want to use a tool like Jelly. It's like, I want to change the culture of how we manage incidents at my company. And I think changing
the culture along with the tool is going to help with that. Or that's how maybe I would think about
it, but I don't know how true that is generally. So what I mean, I know we're not like getting
into the product specifically here, but we try to make it really easy for you as a like we meet you where you are today. So if you're using templates today, you can use templates and jelly. If you're you know, if you're a little bit of a blameful company, that's okay. You can still use us like we meet wherever the person is at where they at today. And we just help them get like a little
bit better. I think of it like a video game, you're just going like half a level up or like
one level up. We sometimes get folks that want to change their culture. But I would say that is
rarer than most of the folks that come through our door. They're just interested in all the stuff we
can do. And like, a lot of the stuff we bubble up for folks are patterns around like
who is responding to incidents, what time, what on-call schedules. And so a lot of like what our
platform can do can actually help you improve your on-call schedules, improve your road mapping,
improve your service ownership. And so I think people really get interested in using our platform
to help have conversations and make better decisions
after incidents. But I wouldn't say like a huge number are approaching it, like wanting to do a
giant cultural change. We have seen some of that with our customers and it is really cool to watch.
How do I know that I'm actually learning from my incidents? Like, is there like a metric I could track? I know that there's probably metrics for, you know,
is my on-call schedule healthy or not? But how do I track that my organization's actually improving?
Is that even possible? Oh, that's a really good question. I mean,
there's a lot of things you can look at after incidents. I think there's a lot of things that
are not always helpful after incidents. I think if you're looking at any metric after an incident,
making sure you're adding context to it and adding a story behind it, rather than just like
taking the metric at face value, but really digging into what it means if you're making
a decision off of it. But I would say like metrics to help you understand if you're learning,
I would say are people participating in the incident review. So, you know, it's something
interesting you're doing right now is you're interviewing me for a podcast and you're
interviewing me for something that I have, you know, expertise in. And that's exactly how we
should be treating people after the incident too, is like essentially interviewing
someone after the incident for something they have expertise in. Like, I think you mentioned
to me at the beginning that I should be talking a certain percentage of the time and you should
be talking a certain percentage of the time. That's actually what I advise people after incidents too.
And so there's little things you can look for like that, like how many people were talking? Was it the same person talking the whole time?
Did we get diversity of perspectives in the room? Did we have people from all over the organization?
Did people say they learned something? Like how many people viewed this document?
And then in terms of like, so that's in terms of learning, but in terms of actually improving your incidents, I would actually look at your incidents over time and just see, oh, how
are our incidents doing?
Do we, like if we need Ariel in every incident that involves console, even if she's not on
call over time, do we stop needing Ariel as much?
Because that means that more people are learning about console and we don't need to rely on Arial as a single point of failure for that service.
And so those are things that show that just we're improving as a whole over time.
Because I think a lot of people look at the cost of incidents in terms of how long it was and when the customer got to a good experience, which is so important.
Don't get me wrong. But there's so much more that goes into it, right? Like what if 300 people from your org
participated? What if it interrupted everyone's roadmap that day? What if we had all these people
that were pulled in when it was the middle of the night for them? Like those things are paper cuts
that just add up over time and end up leading to bloat in organizations where
you're over hiring and overspending in areas that you don't need to be. And so, yeah, it's like,
I mean, I say all that ultimately it's reflection. It's like looking at various things that you want
to improve and seeing how you're improving over time and really looking at the
coordinate of pieces of that. Yeah, I think that response makes a lot of sense to me, especially
again, if I contrast it with something that I'm a little more familiar with, like developer
experience. There's no one incredible metric that's going to tell you what the development
is at your organization, but exactly. Paper cuts, seeing how things add up, seeing is like everyone working 24-7 to make
sure they ship stuff on time, how blocked people are. It's actually really similar where you have
a bunch of signals that are not extremely exact metrics, but they're kind of the same across
different organizations. There will be specific
things for each organization based on its context. Then you have to add all of that up and make a
judgment call. So that makes a lot of sense to me. Let's say that I'm on this journey of improving
incident management and my company is slightly larger now. And I feel like I've done a reasonable job as a senior engineering leader in my group.
How do I make sure like the rest of my organization comes along, you know, especially if I cannot be
in every single meeting, and I cannot make sure that there's enough people speaking in every
meeting as an example? How do I codify some of the stuff? And is that somewhere where something
like Jelly helps where it can help you codify a set of principles
or a set of, this is how I need to do X, Y,
how is how I need to manage incidents
in my organization as an example?
Yeah, I'm really glad you asked that question.
I also wanted to touch on
the developer experience thing you brought up
because developer experience
should be brought into incident metrics.
Like it should, like the responder metrics. Like it should like the
responder experience and how it's impacting people, those worlds should a hundred percent
overlap. And then the second question or the question you asked around, like, how do I do
this in my organization when I have a lot of other stuff to do and I can't be in every room, right?
Like that's almost a full-time job. And I've been that person in a couple of organizations and you're right,
it's tiresome and it's a lot of work. And what I ended up doing in those organizations was I ended
up building allies, people that were really interested in doing this on their team. So I
would have allies like on the front end team, I'd have allies like
on the search team. But again, like, you know, quality wanes if you don't have someone managing
that experience. And what I really wanted was a tool like Jelly. And so that was why I left,
like part of the reason why I left Slack is like, I wanted a tool like that. It was a lot of work
to maintain the quality of that without something
that was helping folks be consistent. So that's a lot of what it's meant for is to make it easy
for folks to develop a learning culture without taking a lot of time.
And then going into the technical aspect of it, like you had an example where, you know,
there's very few engineers
who might understand how technology works and you want to, and like part of understanding whether
your organization is learning to see if it's the same set of engineers who get pulled into
a certain kind of incident. And if that's improving over time, the flip side of this is
making sure your organization learns efficiently, right? What But how could, let's say I'm a reliability leader or an engineering leader in my organization,
and I want to make sure that this one engineer can impart their knowledge to the rest of the organization.
What is the most effective way in your mind to do something like this?
Is it something like tech talks?
Oh my gosh, yeah, This happens all the time. And the mistake I see a lot
of orgs make is that they put that pressure on that person. They're like, you know, in order to
be an even more senior engineer, you have to teach people. But like when you're telling that person
to teach people while they're also expected to manage the load of all the work that no one else understands, they will 100%
burn out. It won't be tomorrow. It won't be in a week. It will be in a few months and their quality
is going to wane and they're going to eventually leave the organization. But the ways that you can
get around this is actually learning like kind of what we were talking about when I was talking
about what you were doing with podcasting with me, right? You're interviewing me. We need to interview experts internally,
right? We need to ask them what they're looking at when they're solving something.
So if they come into an incident channel and they're like,
got it. And they send this graph in the channel and they're like, okay, I'm doing this with these
hosts and everyone's like, yay. And know, I'm doing this with these hosts
and everyone's like, yay. And they're applauding them. Right. But are we ever actually asking them
how they knew to do that and what led them there? And the answer is no. And they're usually really
bad at actually explaining that too. And so I think the work needs to go on to others in the
organization to learn cognitive style interviewing.
And there's, we have a free resource on our website. It's called the Jelly Howie Guide.
If you just go to jelly.io slash Howie, you can see how you can do some cognitive style
interviewing. So it's asking someone things about what they have expertise in and like
how they knew to do it. Someone drops a link to a graph.
If I was interviewing them cognitively afterwards, I would say,
hey, you know, you dropped this link to a graph.
Before I get to that, like, could you tell me a little bit about how you found out about this incident?
And, you know, I'll throw them a softball question at first and they'll tell me about how they found out about it. And I will, part of my job during this is to almost play dumb a little bit and get them to share everything with me. And the more that we do, and the more people we share that within our organization, the more experts that we build. And it actually helps people feel really valued too. Like experts are normally really happy
to like be interviewed about their expertise and share it.
But asking them to come up with a whole tech talk
and a whole documentation series
is going to be really overwhelming.
But then who is the person
who's supposed to interview the expert?
I think that's where like the confusion is.
Is it engineers on the team?
Is it the incident manager?
Is it the on-call of the team?
Yeah.
Point of contact for the incident. How do you, there's the response, the fusion of responsibility
over there.
Yeah, no, it's a great question. And there's a few different ways you could approach it.
And I'd advise people differently depending on the size of their organization, how far along their business is.
But if your business is a few hundred folks, that means that not everyone is in every single
incident. And so if you can get someone that is... We have on-call rotations. Why not have
sort of incident review rotations where if you're up next and you didn't participate in the
incident, you do a 30 minute one-on-one with like the expert in it. And you write up about like what
you learned about their expertise and just spending a little bit of time doing something like that
will just pay off dividends in the future. It's just, you know, it's like it's time we're spending
anyway. It's just time with like a different focus. So that's kind of how I would recommend it with a larger
organization. But, you know, when I was in office with folks, I would sometimes just ask them to go
for a walk with me or ask them to have lunch with me. And I would do the interview, write it up and
share it with other folks internally yeah which takes a little bit of
time but like the value pays off quite a bit yeah it almost sounds like you could even have like a
volunteer rotation of sorts so yeah they're in reviewers they could help share knowledge outside
of like a more formal venue like a tech talk like because it's so easy for example to have
an incident post-mortem that doesn't have any information at all it could be like a tech talk like because it's so easy for example to have an incident post-mortem that
doesn't have any information at all it could be like a centralized group of people who are thinking
about incidents who are reviewing incidents who are writing stuff down and sharing knowledge with
the organization and it doesn't come to the set of experts in each exactly also do this extra
responsibility yeah i like that a lot and yeah you mentioned it, it's organizations that can come up with sustainable, useful
processes, evolve them as the organization changes.
Those end up having the best competitive differentiators and end up growing the quickest.
I completely agree with your earlier point.
And one thing that I just want to
bring up as well is like, whoever's interviewing the expert is naturally going to get better at
that piece of technology. So if I'm interviewing the expert in console, and I'm also an engineer
at that company, guess what, I am all of a sudden going to have a lot more expertise about console,
right? And like, you know, we think about when we're early in our career and we're trying to figure out the kind of engineer we want to be,
what do we do? We look at who's in our organization and we look at who we admire
and we watch them, right? And so it's about continuing to do that, right? Like watching
what these experts do and recognizing them in different areas will help us build more experts.
And yeah, we'll get folks a lot better. And that's why it's so important to have that be
on a rotation as well, so that you're getting that expertise spread out. And it's also really
important to have that rotation be people that are writing code and participating in incidents
so that they can put that new knowledge to work.
Yeah. Another aspect of incident management is certainly, you know, action items, right? Making
sure you're trying to prevent the incident. You can detect it quicker. You can remediate
problems better. You can mitigate it quicker. What is the most effective way to manage your
incident action items or your follow-ups, right? Like some organizations have processes around, like you need to complete all P0 action items in 30 days or fewer or some other number like that.
What have you seen work in practice or like what are people doing wrong or how should we be thinking about this?
I see action items taking place no matter what after incidents. I also see learning taking place no matter what after incidents.
I also see learning taking place no matter what after incidents.
It's just the level of quality in those two worlds that I see us mess up on.
So if we're fixing without actually taking the time to talk about the thing that happened
and learn from it, we're going to put ourselves in a bad position later. We're maybe not doing
the best fixes for our organization. We're also maybe not sure why we're doing those fixes.
And they're going to come back to bite us later on. It won't be right now. Now, if we're learning,
you know, and not fixing, like it's still beneficial, you know, if it will help later
on, it will help in the code people write. It will help in how they interact in incidents. And so I think people kind of have the wrong idea when they're like, oh, all P0s must be completed and X days and easier action items that maybe are not higher quality. And so I think
if you actually just, you know, allow for a little bit more time to talk about it, you'll get easier
action items. If you really want to spend the time doing this in a really great way, I would separate
when you're doing action items versus when you're talking about the incident. So not talking about
how you're going to fix it, but just talking about how it unfolded and how it came to be.
And then coming up with your action items, you'll have like much higher quality action items.
And no one's going to need to be the action item police, where they're like, you know,
slacking you to make sure you did it like no one likes that. And if it's not getting done, it's probably, you know,
there's probably a good reason that it's not getting done.
Yeah.
Yeah.
It almost sounds like you're saying, you know,
separate the problem discovery from the solution.
Exactly.
Yeah.
Right.
Which like makes so much sense, but you know,
we kind of forget about it sometimes when we're in it,
but it's like when we stop and think about it, it's oh yeah that is what we should do yeah yeah but on the other
side you kind of want to make sure you don't want this incident to happen again and like
should you schedule to talk about the same yeah it's like a tricky one to think about yeah yeah
it's always tricky and i don't think every incident deserves the
same level of treatment. Sometimes an incident might involve no incident review prep and it's
just me and my colleagues hopping on a call for 30 minutes and reviewing the Slack transcript,
right? So that's a really low level. And then we come up with action items at the end. So boom,
we just spent 30 minutes on it. Sometimes it's a higher level where I'm interviewing a couple
people that had expertise. And I'd maybe take that approach if it was an incident that like
almost entirely relied on one person, that's probably worth digging into and just spending
a little bit more time on. But I don't think it needs to be an all or nothing thing.
I don't think you need to spend like several weeks
on a thing to get value out of it.
There's little things that you can do
in order to set your organization up for success.
You could even send people a little survey
after the incident.
Like if you really wanna get low level about it
and just be like, how did you feel about this incident?
Do you think our coordination went well? Do you think our communication went well? And just have
people spend like five minutes on it. And so those are little things that you can do to still develop
a learning culture that match like the level of effort that you might have time for.
And formally, you can break down incidents by severity level and then a different set of people or a different set of policies might apply to each incident level.
And that's how you can kind of operationalize this.
A slightly more controversial question.
In what you said, I heard, you know, review the Slack transcript.
Like, does every incident need a postmortem doc?
It sounds like your opinion is maybe not like maybe some of them you
just learn from on your own with your colleagues without spending the time to write everything
down in the structured format i'm just want to clarify is that the case is that how do you think
about like you know policies for every incident should have xyz like at least one action item at least one a policy towards we need to learn
from every incident yeah what the incident was yeah i like that question i do think there does
need to be some guardrails otherwise people will just not really know how to do it right
and i think people need boundaries when they're learning things, right? And so I think at a bare minimum, I would recommend reviewing how the conversation unfolded
and not cherry picking things from the conversation, but actually sitting down and reviewing the
Slack conversation.
And if the Slack conversation was a really long Slack conversation, it's probably worth
just taking a little bit more time to do that incident review.
And you'll thank me later when your organization is taking more time to do that incident review. And I like, you'll thank me
later when your organization is like taking the time to learn how they worked together and how
they evaluated things afterwards. And also like how much folks talk, you know, what time of day
it was for folks, where people were looking, what services were impacted, what technologies were
helpful for us. And so at a bare minimum,
I would recommend reviewing the Slack transcript or the Zoom transcript,
marking it up a little bit akin to taking a highlighter, taking some notes,
but just sharing what did we learn from this? And I think there's a couple different sections
you could highlight too. You could highlight, here's when we detected the incident.
Here's when we repaired it.
Here's when we were diagnosing things.
And those might not be clear linear paths, but you could just mark different areas where
that kind of stuff was happening.
And I guarantee you'll gain a little out of it.
I think making an explicit decision, are we going to have action items here and sharing
how folks came up with that decision is going to be really important.
Yeah, I like that much more than the very monolithic, like ensure every incident has at least one or whatever.
I like the approach of you want to learn from every incident rather than you need to have some deliverable because then it just feels like incidents are even more work yeah they're an incident for something that seems frivolous or
whatever yeah john allspot always says you know are your postmortems meant to be filed or are
they meant to be read and i feel like a lot of organizations over time because of the guidelines
that are put in place end up creating this like write only culture
for their incident reviews which is useful to no one it's just like meant to help defend the person
that's writing it and so organizations aren't setting their folks up for success if that's like
what they're requiring yeah what are your favorite resources on learning how to manage incidents? Well, where can someone read up more on like processes, tooling, thought process?
Yeah.
Oh, yeah, I love that. So a few years ago, like back in late 2018, I started a community called Learning from Incidents in Software, and it gathered people from all over the software industry to learn and share about how they learn from incidents
and how they enact these practices in their companies.
So there's a website that exists with some blog posts around that.
But we also held our first conference within that community a couple months ago.
And so all those videos are up on YouTube.
If you go under the Jelly channel on YouTube, you'll see all the
learning from incidents videos, which were sponsored by Great Circle and Brent Chapman.
But yeah, they have a lot of... There's folks from Salesforce, from Indeed, from Quizlet,
from startups, from IBM, all sharing how they've developed this kind of culture internally. And
it's really helpful to see because it's not just like the theory behind it.
It's how they're actually doing it.
Yeah, I think videos are just one of the best ways to learn,
like highest bandwidth way of.
Totally.
Yeah, so I will link that in the show notes.
And what gets you excited about, you know,
the future of the software industry and your role in it
over the next few
years, at least? Yeah, I mean, I think of this as also just making the software industry a better
place to work. There's a reason why tenure is so low in the software industry, right? We get burned
out, we move to another place two years later. But like, you know, if we're job hopping every two
years, we're not actually taking the time to develop expertise in the particular company system we're in, which will benefit us as engineers.
It will benefit our organizations.
And it's ultimately going to benefit end users in society if folks are actually taking some time to put care into their software, into their work. So I'm really excited about the level of attention
incidents are getting in the industry right now
because I just think it's going to be really important
to all of our end products.
So I'm excited as a consumer and a creator.
Yeah, I'm excited too.
I hope the airline apps become less.
I'm guessing you were impacted by the Southwest outage.
I've been impacted enough. And it's just one of my pet peeves. When I read a tweet somewhere that
said, you know, don't use an airline's website, but use an app, less likely that the app will
be buggy because there's a slower release cadence. That made me really sad because,
you know, that's not why an app should be quicker or like more reliable but
so i like to pick on that industry in particular but maybe i'm being too harsh no i mean i think
it makes sense too i also think it all comes down to like salaries in that industry and how
employees are being treated and like developing learning organizations also helps people feel
i mean people stay at companies where they're feeling respected and valued and compensated. And so this kind of helps take care of the respected and valued piece
for sure. Well, Nora, we should stay in touch. And yeah, it was a really good conversation. I
feel like I learned a lot and became a little less procedural in my thinking about incidents.
So thank you so much for being a
guest and I hope to have you again at some point. Yeah, thank you too. It was really great to be on
this.