PurePerformance - Run Towards the Fire: Why we should love incidents with Lisa Karlin Curtis
Episode Date: April 28, 2025Do you plan for incidents? Do you have a time / cost budget for it in your sprint or quarterly planning? Do you have engineers that are "interruptible"?We discussed those and more questions with Lisa ...Karlin Curtis, Founding Engineer at incident.io who teaches us why we need to think differently about dealing with incidents!In our discussion we learn why modern incident management embraces more incidents that are publicly shared within an organization to foster learning. We learn about how to train more people to become incident responders, how to triage and categorize incidents, how to better plan for them and how to best report on themWe also touch on AI - and how AI-generated code will eventually result in more Incidents which we should use as an opportunity to learn and improve our engineering processP.S: This was our 10th-anniversary podcast episode!!Here the links we discussed in the podcast:Lisa's LinkedIn: https://www.linkedin.com/in/lisa-karlin-curtis-a4563920/Her talk at ELC Prague: https://docs.google.com/presentation/d/18536WBHBcPEppEeXXP7o5UQOX2XfWoGmfds2CHegHq4/edit?slide=id.g3434e0cba65_0_0#slide=id.g3434e0cba65_0_0Incident Playbook: https://incident.io/guideÂ
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and with me I have a smiling and fire wielding Andy Grabner today.
How are you doing Andy?
I'm good.
I just don't want to play with the fire anymore because it looks funny, it makes funny noises,
it also makes fire, but it also smells really strange when I light it.
So I better put it away.
That's always a good, don't play with fire unless you want to get burned.
When this case I want to run away from the fire and not towards the fire.
Sometimes you want to run towards the fire, right?
Sometimes it seems at least there's people out there and actually one of them happens
to be out there.
Oh yeah. That's's s'mores there.
Oh yeah, that's a good point. Yeah. Not sure we'll talk about s'mores today, but we'll definitely talk about running towards the fire, embracing incidents because this was a talk I happened to see at a conference in Prague. I was just in Prague recently at the Engineering Leaders Conference.
And I was on stage and I think a little bit before me, Lisa, if I'm not mistaken, you were on stage
and talked about running towards the fire, embracing incidents. And now without further ado,
Lisa, thank you so much for being on the show. Can you please do me the favor and just introduce
yourself to our listeners? Like who you are, what's your background?
please do me a favor and just introduce yourself to our listeners, like who you are, what's your background.
Yeah, of course. Thanks so much for having me.
So I'm Lisa Carlin-Curtis.
I was one of the founding engineers at Incident.io.
So we build incident management software,
so that is software that helps you respond.
It also includes status pages and on-call,
so we'll have the joyful job of waking you up at 2 o'clock in the morning
when you don't want to be woken up,
which I'm sure will be delightful. And yeah, so I've
been here for about four years and, you know, thinking about
incidents for basically all of that time, talking to a lot of
customers about incidents and kind of, I don't know, I've
always been very interested in things like safety theory and
understanding how you build resilient teams, how do you get
to a place where incident response is something that you build resilient teams,
how do you get to a place where incident response is something that you feel comfortable with,
that you feel that you get value from as a company.
It's been super fun to be able to work on that problem for the last few years as we grow the team.
It was great to meet Andy in Prague and very happy to have a chance to chat about this stuff. Yeah, big shout out to Marianne and his team for building together
this conference. So ELC engineering leadership conference.
I was really for me, it was the first time hearing about their
community was fantastic having 300 engineering leaders. One day
in the room and I really liked also, they had a main track
right where we both spoke and I believe did you also do a
workshop?
Yeah, I think I think you did as well right where you get a chance to chat to like 25 people in a room.
It was just a really nice format for getting to have some more in-depth conversations and
do a Q&A that's a little bit less awkward than your sort of Q&A in front of 300 people where
nothing really comes. Yeah, I also really like their mentoring sessions. So because it was an
engineering leadership conference, you could sign up as a mentee and also as a mentor to have,
I'm not sure how long these sessions were,
maybe 15, 30 minutes, I don't know,
but I thought it was really, it was a great conference.
Sounds like a great conference, yeah.
Yeah.
I will actually be back next week in Prague
as a follow-up on that conversation
that some of the conversations I had.
But now, back to your talk.
Lisa, what struck me in your talk, you were embracing incidents.
Kind of like, hey,, you know, make them visible, make people
participate in incidents because we can learn a lot.
I don't want to tell it basically in your words,
but I would like you to give me or give us a little bit of an overview of what are some of the
maybe
not so common approaches towards incident management that you were promoting.
lots of common approaches towards incident management that you were promoting? Yeah, sure. So I think that for most people, they think of incidents as being like a cost or a tax.
So it's like, we have some code, it's in production, sometimes it goes wrong, that's sad.
So what we're going to try and do is we're going to try and reduce that tax as much as possible.
So that means we're going to have as few people as possible.
Probably we're going to have the same people because they know what they're doing and they can deal with it quickly.
And then everybody can kind of go back to doing what they're supposed to be doing.
And I think there are a few different problems with that approach.
So the first and obvious one is that you have a churn risk.
So if you're a small company, probably your first three engineers who look a bit like me, handle all the incidents.
Once you get to like, I've seen 100 person companies
where fundamentally, all the incidents are managed by maybe
three to five people. If one of those people is on holiday, or
God forbid, as we do, you know, those three people will go to
San Sebastian together on a holiday for a weekend, and
something goes wrong in that time, then you're you're in real
trouble. And so you also want to make sure that like you are
growing the number of people who are able to do that stuff
And the only way that's going to happen is if people practice
So no one has ever got better
incident response by like sitting at their laptop and staring at it really really intently and deciding they're going to get better the way you get
Better is by doing it and that's the same for everybody
And so part of it is about this kind of organizational resilience, making sure that you can
handle like new things. The other problem is that as
your organization scales, you'll get more incidents,
and eventually those people who can handle incidents,
that becomes their entire job. And they stop being
able to add value to your business in other ways. And
because incident response is pretty highly correlated
with lots of other things, the likelihood is that
they're also very, very good elsewhere. So if those people are spending their entire lives in incidents, incident response is pretty highly correlated
other side of this is about the way that people write code and build software. So when you're building something and designing it, you make a load of choices
kind of explicitly and implicitly about what's your data model going to be?
How are you going to implement it? What's your architecture, if you like?
How are you going to add observability to it? What can you see? What can't you see?
What are your expected failure modes? And all of that, if you do that well,
you then reduce the burden of incidents.
If you do that badly, you create many more incidents.
And for me, the best way of getting really good at that
is also by being in incidents,
because that's what gives you the scars and the burned fingers,
so that you can sort of look at a design or look at an idea and say,
oh, I actually think that's going to go wrong in this way, because I
saw that that one time, and I think we can do risk it by doing
x, y, z. And actually, I think it's one of the key
differentiators between more junior and more senior engineers
is like, is that first design kind of about right, it will
never be perfect, there will always be things that go wrong.
But there's a big difference between the sort of you know, v
zero that somebody without any experience would come up with
versus somebody who's kind of been there and done it and knows
the common ways in which these kind of systems fail.
And I think that that second piece, I guess, is what maybe we talked
about a little bit more in the talk is like you can learn all of those skills
by being an incident.
But in order to do that, you probably have to bring people into incidents
who on day one are not going to be adding that much value to the response and
So you'll bring people along for the ride
You're showing you're working you're doing that stuff in public and by doing so
All the people around you are going to be leveling up and all of a sudden rather than these five people being sat
You know being the martyrs and desperately trying to keep everything running
Instead they start to be able to lean on people because they've seen it two or three times and they can actually say, oh, actually, could you roll out this fix? I'm comfortable. I can
now kind of get back to my dinner or get back to my planning meeting. And I think it makes a really
big mindset shift if you start thinking about something that you want to bring the team into
rather than something where you're constantly trying to protect the team.
So many questions.
I keep notes.
Even before questions, I would just say I would love for you to give that speech in
front of a bunch of people and see if there's any one person who would say, no, I disagree.
Because I think everything you said in that was made was so extremely sensible.
It's like you can't even argue with what you said. But I'm sure
Andy will-
I think that was spicy.
Oh, no. I mean, it wasn't even spicy. It's just like it's so common sense and you see
it in so many other industries that it's like, of course, we probably don't think about it.
But when you lay it out like that, it would be amazing, amazing in a depressing way to find somebody
who would resist what you're saying there. I'm sure Andy will come up with some questions
that will. No, I know Andy.
My first question is what happened in San Sebastian?
We just ate loads of food. It was great. The food there is so good for anyone, particularly
if you're European based, I would so recommend spending a weekend there, particularly if you eat meat. The meat is just unbelievable. The food there is so good for anyone,
great incident management software and practices instant bio. That meant that we could all go to San Sebastian and it wasn't a terrifying idea that we were all on a plane at the same time.
So I got a couple of, I wrote a couple of notes.
You said it's a big problem if you end up having people that their sole role and purpose in the company becomes incident management or incident fighting incidents.
soul, role and purpose in the company becomes incident management or incident fighting incidents. I remember you said, you said something like staring at
incidents. It reminded me of the movie, guys, what was he called? Man that
stare at goats. It's kind of like people that stare at incidents. And obviously
that should not, they will just not go away just if you stare at them. But for
me, the much bigger thing is if there's only a small number of people that are dealing
with incidents and that's the anything, everything that they do, they are not, there's two things,
right?
They're not adding value to the organization and B, they become your biggest risk in your
organization because if they are gone, then nobody can deal with these incidents anymore. biggest risk in your organization.
you would be my role as a host here on the podcast. I also shared the deck with him.
Lisa, if you're good with it, can we share the deck with the people here?
Folks, if you're interested in the talk, the slides are in the podcast notes.
But as you talk about embracing incidents.
Who are the right people to
then really jump? How to prioritize and focus on the ones that are really important? Do
you have any recommendations on how to evaluate that?
Yeah, I think I'm going to steal an analogy that Brian was chatting about just before we
went on air, which was talking about like a medical context where you have like surgeries. And so obviously,
like if you have a hospital, and you have, you know, three
surgeons that can do a particular surgery, and then you
have 40 other surgeons, but they can't do the really hard stuff,
then you've obviously got a problem because at some point,
those three surgeons are probably going to retire or you
know, one of them might be on holiday. And so what you do is
you plan in making sure that the people that you've got
that are kind of next on the roster
to learn that particular specialized surgery,
you would say, right,
well, we want to make sure that you do like six watching,
then maybe you do six like participating with supervision,
and then you're sort of accredited
to go and do that surgery in future.
So you lay out like a plan.
The problem with incidents is that unlike surgeries
they are like unplanned and as you say you have these kind of big moments where
you have lots going on and then you might have kind of periods of time where
you have very little. But that is also true of surgery right? Like in in surgery
you also you have a roster of people and some surgeries are planned and some are
not. And so I think it kind of it's the same approach where first of all, you have to look at your incidents and go, you
know, what is the most important thing and what can wait and
that is like a triage process. And ideally, that probably
means you need to like categorize your incidents, you
know, if you've got 100, one person cannot load up into their
kind of RAM 100 incidents. So you have to come up with a way
of like segmenting them by product, category, severity, whatever it is. Once you've got those
segments, then you want you know, people to take each of the
segments, start to prioritize them by what's the impact,
whether that's impact on your customers impact internally. And
then you can start to look at your roster of like, who are we
onboarding? Like, who are the people that we're currently
investing in to try and teach them these skills? And maybe
that's a buddy system.
So maybe you just have like, cool, you know, Brian, you're on incidents.
So every time the lease is in an incident, you jump in as well.
And you sidecar Lisa for a while.
Probably wouldn't recommend that long term because you'll probably learn more from working
with a few different people than learning one.
So then you start to say, right, well maybe I've got like a pool of people
and I pull someone from that pool
or I offer it into Slack and I'm like,
hey, I could do with an extra pair of hands here
as anybody who's onboarding coming in.
And I think you can treat it a bit similarly to that
or maybe the kind of interview training
or interview onboarding,
which is something that kind of everybody has a process for.
You have a pool of people that are kind of like ramping
and you try and actively pull them in.
There are going to be moments where the world is on fire
and training is just not your priority.
And that's fine.
As long as you're not in that mode
for months and months on end.
Like sometimes there will be a really bad day
where four or five different things have gone wrong
and you're just like, no, no,
this is not the day that I'm going to take any risks.
This is not the day that I'm going to be thinking
about someone else's growth. I am'm going to take any risks.
imagine that you have a checklist in the same way of what are the things that we want you to have experienced before we'd be comfortable with you getting onto that on CallRota.
Now, that might also include things like drills and game days.
So, there will be some scenarios that you can't just rely on happening organically
because hopefully they don't happen very often.
So, we don't have very many data leaks, thank goodness, touch wood, right?
But I would want every engineer who goes on Call to know how to handle a data leak.
And that's maybe something that we would do a drill or a tabletop exercise to give them
that experience because they're not going to get it in real life.
The problem with a drill and a tabletop exercise is that you've got no adrenaline because there's
no risk.
And so you also need people to learn how do you deal with adrenaline?
How do you deal with being paged at two o'clock in the morning?
That's a very unnatural thing to ask a human to do, to be able to switch on that quickly
from deep sleep. And so those are all the kinds of things that to do to be able to switch on that quickly from deep sleep.
Those are all the kinds of things that you want to be practicing with people.
Not to go back to promoting a TV show, but as you explain more the analogy to the medical situation in the show, The Pit, which is on max HPO, whatever, The big climax is there's a mass casualty event and everyone's
going to the emergency room. So it's all people on staff and you see them doing exactly what you're
saying where the top ER docs are pairing with some of the lower people who would need to see some stuff
and explaining what they're doing.
So there's all different types of injuries coming in, like critical to medium to very
light.
And like, let's say the medium ones, which would be maybe the medium incidents, you have
the people who have experience already, but aren't top-notch, but like, I can take you
because you have enough experience and you can independently handle these middle-level
ones because I've trained you enough, I have faith in
you, and I'm going to take the people who have very little experience and let them watch me
do this crazy stuff because they haven't seen anything yet. I need to get them to the point
where I can let them go on the lower things. And it's that same cycle you're talking about
where, yeah, you're always wanting to get people involved because, you know, not just the fact of burnout, but you
always have to have that bench, right?
You know, and like if you've been working on an incident for eight hours, right, or
overnight, in the morning, you better have someone who can take over for you.
You know, I remember we used to have, you know, back when I worked on the other side,
you know, you'd have releases and you know, everyone's familiar with the release from hell that goes wrong and everyone's working all night and it's now 11 o'clock in the
morning the next day and you're still working and it's like who's going to come in and relieve you?
Right? And if you don't have, if you're not doing exactly what you're talking about,
you're going to have no one. So I don't know. Yeah. I wholeheartedly, all I can do is say I'm
wholeheartedly agreeing so hardcore with
everything you're saying. It's just... So I think what, when it gets kind of interesting,
and also to be blunt, like the place where the surgery analogy falls over is that in a surgery
context, your job is surgery. So there's not another thing that you're supposed to be doing.
Whereas the problem in when you're supposed to be doing.
Incidents are primarily interrupts, they're not planned.
They're often not budgeted for.
Which is kind of fascinating to me because no one has ever had a quarter where nothing went wrong unless they didn't have a product that was out in the wild.
I've never seen a roadmap that had 10% incidents,
but that's probably about that,
whether you have exactly what that number is.
But usually you can probably extrapolate out
from past data if you collect that data
what it's gonna look like for you.
And the problem here is that these junior or mid-level,
maybe even senior engineers
who just haven't been at your company that long, those are also people who are shipping products, and they're probably shipping to deadlines.
And often the way in which that work is prioritized, which is through
like the product side, if you like, rather than the sort of resilience side,
that becomes a like a bit of a tussle of like, oh, can I pull this person off what they're supposed to be doing in order to help
this? And it's like, oh, well, it's not really a good time, because we've got a deadline on
Friday, so maybe next week. And so all of this stuff that we've
been talking about that sounds so simple and straightforward,
and like, it's really easy to buy into sort of when you hit
reality with it, it can become really, really challenging,
because you're suddenly saying, I want this group of people to be
interruptible. And the reason I'm interrupting them is to
teach them. And so it I'm interrupting them is to teach them.
And so it's not going to deliver direct business value today.
It's going to deliver us business value in a month, in a year.
And it's the classic kind of problem of like, you know, short term versus long term.
And that training can often, it can always wait because it's never going to be super urgent
that Andy gets one of his major incidents in today.
It just is not. Whereas if you are, particularly if you're working with customers,
if you've got frequent deadlines, or if you're a company where you care a lot about momentum,
it's really difficult to pay that tax.
The way that we do it at Incident is that we ring-fence people.
So we have a thing called Product Responder, where we basically pull people off projects proactively.
And so they're on smaller tickets that we know don't have deadlines,
that we know can be flexible.
And that means they can be interruptible
in a way that still feels safe
and doesn't derail projects.
But if you don't have that in place,
because it doesn't make sense for your team,
because you don't have enough of that kind of work,
I think it's really difficult for this
not to become a repeated pull and push
between probably realistically your tech lead
or your senior engineers and your product managers. And that's what I've seen go wrong in a lot of
places. Taking so many notes. Well, thank you, Lisa. I got another question in terms of,
there is the term, I think it's called a near miss.
It's basically a problem that in the end didn't have any negative impact.
And kind of this is the question that I have to you, when does an incident become an incident?
And when does an incident become something that you also need to report all the way up?
Because a lot of organizations are driven by also reporting, right?
How many critical incidents did you have in a day, in a week, in a month?
Do you have any thoughts on that topic?
So first of all, what defines a real incident?
When does an incident become an incident?
And also are there categories of incidents where you say, these are the ones
that you want to report, these are the ones that you want or should report?
These are the ones that you, that is, it's not, there's no need to involve your, your board, your C level.
Yeah, I mean, so you kind of mentioned this at the beginning, but,
you know, my philosophy is that basically most people don't declare nearly enough incidents.
And what I mean by that is there are lots of things that happen in organizations
that require an urgent response, maybe even also a coordinated response
between multiple people, but they just happen in a slack thread.
And that kind of works, but the problem with that is that it means that
that is not visible to anybody. It's not visible in terms of reporting.
It's not visible in terms of helping other people learn from it.
And there is no opportunity for anybody to like share what they learned It's not visible in terms of reporting.
we would just declare an incident and treat that, even if it's one customer hitting one edge case.
And maybe we end up declining that incident
because it turned out it was user error.
But more often than not, our users are pretty smart
and it's probably something we've done,
so we need to deploy a fix
and we'd like to kind of continue that incident process.
But equally, we might create an incident for something
that hasn't got something to do directly with a customer.
So maybe our CI pipeline is down. Whenever our CI pipeline goes red, we declare an incident for something that hasn't got something to do directly with a customer. So maybe our CI pipeline is down.
Whenever our CI pipeline goes red, we declare an incident.
And that means that the whole of engineering
can easily see what's going on.
They know whether they should merge or not merge.
Maybe they can like jump in to help out
because it's visible and it's announced to the team.
And then you get the same thing in other areas.
So when our support queue gets too long,
we declare an incident.
And that means that people can kind of mob
on support tickets.
Or if we have a really bad interaction with a customer
and we're kind of worried about that relationship,
we would declare like a customer success incident.
And the thing that all of these things have in common
is that they require an urgent and coordinated response.
And if you're going to put it down for a day or two days and it's just going to go on to
like your to-do list in your, you know, Slack DM or whatever, it's probably not an incident.
But if it's going to be your priority and you would potentially like consider moving
a meeting or skipping a meeting for it, probably it's an incident.
And if you've got incident tooling and you know, I don't want to like shill the software
too much, but like there are lots of, you know, other providers exist.
There's lots of people making really great incident management software.
If you have tooling that helps you do this, that will spin you up a channel,
which is your dedicated space, that will announce it to the people who said
they're interested in incidents of that category.
You then start to be able to just build a repeatable machine where you can deal
with these low level issues in public in a way that people can learn from.
And it also means that everyone's just super familiar with your processes. So when we're talking about like an incident process, maybe you have a list of severities,
but like what does P1 mean? What does critical mean? Like nobody knows.
But if you're using those definitions a lot, if everybody knows that you have an incident lead,
and this is what an incident lead does, and maybe you have like a particular way in which you bring
other people in. So you have escalation parts where you can
escalate to like legal if you have a compliance issue. If
you're practicing all of those all the time for these smaller
issues, when the really big ones come along, you're not trying to
learn the tooling and find the piece of paper in a drawer that
tells you what you're supposed to do in incidents, you're just
like, cool, this is second nature to me, this is just a bit more
serious. And now I've got to deal with this very complicated problem. But at
least I'm not also dealing with like a whole set of toolings or processes that I've never
seen before. And so when you start doing that, you have more incidents. And then what happens
is that you get questions from above being like, wait, has everything got really bad?
Like is our got really bad?
potentially like churn if we're in kind of like a b2b environment particularly
and they might cost us in dollars if we're b2c if we're like e-commerce is literally going to be dollars on the table but there are other costs as well which is around interruptions and like
how much time did we spend on incidents that we could have spent building or supporting our
customers and you need to think of those two sets of costs as very separate and report on them very
separately but for that second group it's really useful those two sets of costs as very separate and report on them very separately.
But for that second group, it's really useful to track lots of things as incidents, because
it means you get an understanding of how often your team is getting interrupted and what's
the range of people who are being impacted by that.
But on the impact side, maybe lots of those incidents had no external impact at all.
You know, if our CI pipeline is down, our customers don't care.
But we care.
And so it's important that you then report them in a way that like it is clear what are the things
that are impacting customers and in what way are they impacting customers? And which are the things
that we're tracking internally, but from the point of view of our like customer churn risk, trust,
resilience, we probably don't need to be worrying about.
I like your examples that you brought about, you know, when to declare an incident and you brought resilience, we probably don't need to be worrying about.
I like your examples that you brought about, when to declare an incident, and you brought customer issues,
CI pipeline is down, support queue is too long.
For me, this sounds like you could define even SLOs on it.
You can say, my SLO for my CI pipeline is that the pipelines have a 99% success rate
and they need to finish within five minutes because otherwise we cannot deploy fix fast enough.
If that SLO is breached for whatever reason, the tooling is down,
somebody messed up the pipeline, then you're breaching your SLO and therefore it could trigger an incident.
Now if this then becomes really custom impacting, as you said, is a different deal. then you're breaching your SLO and therefore it could trigger an incident.
Now if this then becomes really customer impacting, as you said, is a different deal.
This is something that you need to find out in the triaging process.
Or also like the support queue is too long. That's a great SLO, right?
Maybe not the support queue, but how long does a customer need to wait until they get a first response.
And if that keeps growing, obviously then the Q also goes up and things like that.
Then you could declare an incident.
I got another question. What I see, at least we are more and more promoting, it seems.
And I don't want to take this literally, but testing production. more promoting, it seems.
And I don't want to take this literally, but testing production.
Meaning, we push out into production, we hide things behind a feature flag, which is great, because feature flags are perfect for experimentation.
So I believe if we do that, and more and more people do this, we will eventually have more incidents by default, because we can assume that these things are
not maybe as well tested because it's still an experimentation phase.
So do you see this already also as a trend that we get more incidents because of kind
of the new modern way of releasing, not only deploying, but releasing software?
And if you see this, is there kind of like also a quote unquote new normal? What is a new normal versus the old normal? Do we were we supposed to see maybe one incident per quarter and now in the cloud native companies, they see 10 incidents per week?
So I think that I don't think that's a shift that we have seen mainly because I think that we've turned up kind of once that shift has mostly happened. So most of the companies that we are working with are companies that are releasing into production multiple times a day.
And so that means they're already in that kind of experiment test in production type world where they tend to have quite a lot of incidents.
That's unsurprising. You're not going to buy incident management software if you do one one release a quarter and have one incident a quarter. a lot of incidents.
can be much harder to debug because it's often not as structured as a human would do.
And I think that AI allows you to build a lot more stuff, and that's very exciting. I use AI in my day-to-day when I'm coding.
However, it also allows you to build a lot more stuff.
And if you build a lot more stuff, there will be more bugs and more things will go wrong. If you think of bugs as being really a factor of the amount of stuff that you ship,
which is not quite fair,
but is a useful enough analogy,
then what you end up with is start to say,
well, okay, we're going to take a risk here
where we're comfortable as a business to ship a lot more
and accept that we're going to have to then
do a little bit more cleanup on the other side.
So in a traditional environment, you'd have maybe a human write some code
and then a second human reviews that code. If we're moving to a world in which AI is writing code
and a first human reviews that code, all of a sudden we've gone from four eyes to two.
And that is going to be, I think, probably, and we're seeing it internally,
that there is a bit of a reduction in quality that's going to happen there. I think that is largely the right call for lots of people. probably, and we're seeing it internally,
but that we'll continue to see a trend where there are more of these smaller bugs that maybe only impact, as you say,
a handful of customers that have this thing feature flagged on, but that handful of customers are still going to be upset and you still care about it, you still want to fix it.
So yeah, I think that we will see that trend continue.
But I also think that even in that world where someone goes, oh, I only have one incident a month or one incident a quarter, if I'm being honest, my response is, I don't think that's true.
I think there are lots of things that interrupt your team
that you are not treating as incidents right now.
That probably you would benefit from treating as incidents.
Which basically comes back to what you said earlier.
You need to make things visible.
Maybe currently these things are hidden,
whether consciously or unconsciously,
that certain problems are just, I guess, pushed under the rug,
and were for whatever reason, right, not tracked.
Well, I mean, 100%. And obviously, you know, we're humans, we feel shame very
acutely, we want to be respected and liked by our peers. And what that means
is if I screw something up, and I think I can fix it in like 10 minutes, I'm not going to tell anybody by default, like I have to train my
default is still is going to be I'm just going to sort it and no one needs to know. And I
need to be sort of trained almost and like buy into the idea that I should put that in
public even though I'm going to look a bit silly. And if I as like a leader in my organization
do that, then that's what will empower other people to be able to do the same
But if you're not in a company which has that culture and you're the only person who's doing everything in public
then what it looks like is that you're really bad at your job and your peers are much better at their jobs and
It's is very difficult to change the tide on a culture like that where everything does happen in private and DMS and oh
Can you just help me? I've just hit the wrong button and I've made this configuration error.
Like that is happening all the time at most companies.
It's quite hard to push that out into the open.
It takes a really concerted effort from your management,
from your leadership to make that a thing
that is rewarded by your company
rather than just blaming people or mocking people
for the errors that they've made.
I had one additional thought on your comment earlier on the, with the advent,
with the rise of AI being used to generate code, we will need to expect more
problems in the end.
But I like the, like from four eyes to two eyes, it makes a lot of sense.
And we need to get better in debugging that type of code that was not initially I like the, from four eyes to two eyes, it makes a lot of sense.
And we need to get better in debugging that type of code
that was not initially created by us.
And I know this is not a product pitch at all, in production where you can set non-breaking breakpoints, where you can basically say,
I wanna like, if I would debug locally,
I can debug in a production environment
and I get all of my variables,
my stack traces and everything.
And now that he mentioned this,
I think this would be even more useful
because even more code gets generated
and committed to source code repositories
that have not been written by the person that debugs it.
And I think this is a pretty fascinating thought.
I mean, imagine a world where
whenever you're trying to debug an issue,
you can never tap someone on the shoulder
who wrote the code.
You go to the git blame,
and the git blame is an LLM that has no memory.
So yeah, you can ask the LLM for help,
but the LLM is not the same as a person who wrote the code.
Whereas often, that's the first thing you do, right?
You look at the git blame,
and you walk over to somebody's desk,
and you're like, can you just walk me through,
because I think this has gone wrong in this way,
but this kind of looks intentional in the code,
and all of that process is gone.
And so we need to replace it with something.
And I think that something probably is unfortunately
more AI, not unfortunately, but it's complicated.
But I think that we can also use AI to help us with that
because we can use AI to help us debug.
We can say, here's a load of details from my error,
here is all of what the variables were in production.
Please, can you help me aggregate across these seven different errors that I've got and tell me what is the pattern here? here is all of what the variables were in production.
to really help you with that stuff, to help you find,
oh, well, in the last incident, Lisa ran this command, so maybe you want to do that.
Building those products is hard. We're hoping that we're making good progress. I think we've
got something that's very exciting. But I think that that does become the counter foil that you
need to deal with the fact that you can't just tap someone on the shoulder and be like, what were you thinking here? What about using one AI to write the code and
another AI to be the second pair of eyes? I mean, I think that's where we're going.
I don't think that's insane. No, it's not in a way because it's going to be a different model
looking for the same thing. But I did have a more serious question, though. You were talking about the idea of going back to the idea of testing in production and
doing all this stuff in production. When you're talking about incident management tracking and
tools like yours and others, does that also track what the blast radius of incidents was?
Because I think just thinking about it on the spot, if you're going more and more to
releasing in production and having quicker releases in production, a sign of maturity
would be having a very narrow blast radius.
So is that part of the standard practices for incident management to be tracking that blast
radius or is that something that still has to get baked into it at some point?
Yeah, 100%.
When we're talking about the two different ways that incidents impact your business in
terms of impact on your users and your effort expended, if you like, I think that if you're
doing incident management well, you definitely want to know what the impact was. if you like.
enterprise customers, how many customers that were
how bad was it? post-mortem, whatever it is. and roll that up into a how much do I care, I guess,
from a customer impact.
So what you want, ideally, you want to be able to pull that report
that tells you the thing that your tech lead on the ground probably already knew,
but couldn't really prove that they really need to do some investment in workflows because it's burning a load of customer trust. But actually, Scribe maybe is fine.
There was one high profile incident,
but it only affected one customer.
And actually workflows is the thing
that is impacting your customers much more.
And having data to back that up and make those decisions
about where you invest is really powerful.
You can only get that data if you collect it,
and you can only collect it
if you track the work and the issues.
And so again, that's why I think having robust incident management,
having lots of things going through that robust process
is what allows you to kind of see that data
and then advocate for particular bits of investment.
I think it's also a great indicator on how mature you are
at doing these production releases,
because again, if most of your, if almost all of your incidents
are only impacting a very few users,
that means however you're doing your feature flags and all,
you're catching them on the first round,
you're catching them as soon as they go in as opposed to, Oh,
we missed it until it went public. Right. So it can really give you,
that data can be used to help identify your maturity level in that,
which I think is another great thing to track overall. Like, yeah, we're doing great.
You know, of course we're going to have problems in production because of what we're doing
here.
But the way we're limiting them is fantastic, right?
Versus terrible.
All right.
Yeah, we have.
Sorry, go ahead.
No, I was just saying that that's what I was hoping it was going to because I think that's
really important for an organization.
Yeah, 100%. was hoping it was going to because I think that's really important for an organization.
Yeah, 100%. One of the things that we also track internally, which I think is really interesting, is how many of your incidents are caused by things you just shipped? And
how many of your incidents are caused by latent bugs? So sometimes there'll be a bug, there's
been around for two years, but somebody finds the right incantation to hit some edge case,
and that's the thing that
causes an issue. And the way that you want to react to that
as an engineering organization is very different to if it's
something that you shipped two weeks ago. Because if it's if
most of your incidents are stuff that you're shipping very
recently, that probably means you need to change your bar
about when you ship how much testing you do internally, how
you do your rollouts. If stuff is a latent bug, honestly,
if a bug wasn't discovered for two years,
probably it's not your fault for writing it.
And that's not something that I would be necessarily
wanting to dig into very hard,
unless there was a huge spike in one area of your code base.
And so the way that you want to respond to like,
oh, we've had quite a lot of bugs
around this area of the product,
changes a lot depending on when it was introduced. Cool. Hey, Lisa, I also just looking at the slides
that you had and in the very end you were highlighting a guide that you wrote, it's called
the Modern Incident Management, the Tactical Playbook,book incident.io guide. We will also add the
link to the podcast description. Anything else that is in that guide that will be interesting
to look into from our readers? Well, listeners actually, not readers, listeners, the listeners.
Yeah. They will read.
I think we both.
Yeah.
I'm trying to think.
I think a lot of what I've just said is very much in that guide,
particularly the response chunk.
I was very involved in us writing it back in the day.
I genuinely think it's a really good resource,
and it has most of the stuff that I would want to say in it.
I think that some of the other key points in there that we haven't discussed,
it talks a lot about on-call, how do you onboard people onto on call, how do you set up your rotations,
how do you think about alerts and alert fatigue.
It has a really great section that my colleague Lawrence wrote on like practice and game days
and drills and how to run a good game day and what that looks like.
And then it has quite a lot about the post incident flow, which we haven't really discussed here. So that is, how
do you learn from incidents? How do you share those learnings?
So if you imagine a V1 of an incident management world, where
basically every individual in your organization is going on
their own little feedback loop where they go have an incident,
they learn that something bad happened, and they learn not to
do it again, and then they know, but nobody else does. The point
of having like a post-incident process
and having an incident management tool
that helps you push that information out
is so that you can turn those many, many, many feedback loops
into a single, larger feedback loop
where one person gets really, really stuck
because they hit a strange behavior of post-gress locking
that they weren't expecting.
They can then share that knowledge
with all the other people who are working with Postgres, and all of a sudden,
your organization has all got better,
rather than just like each person discovering it
as they make that individual mistake.
And that sounds kind of silly,
but actually I think that's how most organizations
operate by default for most things.
So yeah, looking at that post-incident stuff,
thinking about how do you choose the right level of process?
Because again, if we're talking about trying to encourage people to declare more incidents,
a surefire way of making that not happen is to give them a 10 page form they have to fill
out at the end of every incident, because then they're just not going to do it. And
then like the final bit of the guide is all focusing on kind of insights and how do you
like understand your incidents at an aggregate level. So I guess that was some of the stuff
we were talking about.
I just particularly draw attention to workload.
So that is how much time are my teams spending in incidents?
And what features are those incidents on?
And particularly for us as a growth stage company,
that means that the opportunity cost of our engineering teams is really high.
I live in that dashboard with my team.
That's the thing that I look at every single week to make sure that we're in a good spot and that there is an investment that's being left
on the table. So yeah, there's some highlights, but also the guide will do a much better job than
my memory of it. Well, if you take the offer, I would like to invite you back for another episode.
I think the whole post-incident process and what an organization can learn from incidents,
I'm sure with a couple of good examples from your career, would be very much appreciated.
Yeah, I'd absolutely love to.
That stuff is super interesting and gets a very bad reputation.
Cool.
I think with this, Brian, time to close this episode.
We covered a lot of ground.
We have a lot of links for people in the description
of the podcast and we have the verbal agreement that we get Lisa back at a future episode.
Brian, anything else? And we ended on the idea of looking at how incidents,
what you can learn from incidents as an organization and analyzing
and looking for those anti-patterns, which is pretty much how we started this podcast
10 years ago now.
So I forgot to mention at the beginning, this episode is airing right around our 10 year
anniversary of when our first episode aired.
So Lisa, thank you for being part of history for us, at least. Andy and
I probably had no clue we'd be still doing this 10 years later, but it's been 10 years of amazing
guests like you, amazing information sharing. As we started, similar to the incident stuff,
we started looking at a lot of performance anti-patterns and how we can learn from them
and how to stop them propagating from the next technology to the next, which is very similar and in the exact same realm as what
you want to do with the incidents when you have them, learn from them, share them so other people
can avoid them. So we're coming full circle, I think, on a topic here. But again, thank you
to all of the listeners who've, if anyone's been with us since the
beginning, thank you.
But especially Lisa, thank you.
It's an honor to have you on as our 10-year anniversary guest, especially for such a fun
podcast on such a...
It's hard not to be excited about this topic because it hits Andy and I exactly where we've
always had our cares and passions of analyzing
this stuff and sharing the data and making sure people are learning from it and growing.
And it's awesome to see the concept of incident management tools helping propagate the idea of
share people, share what went wrong, don't hide behind what went wrong because it's a learning for everybody else. So that's my spiel. Thanks so much for having me. It's been really fun.
Thank you. Appreciate it. All right. Thank you everybody. Bye-bye.