Screaming in the Cloud - Replay - Finding a Common Language for Incidents with John Allspaw
Episode Date: November 26, 2024On this Screaming in the Cloud Replay, Corey is joined by John Allspaw, Founder/Principal at Adaptive Capacity Labs. John was foundational in the DevOps movement, but he’s continued to brin...g much more to the table. He’s written multiple books and seems to always be at the forefront. Which is why he is now at Adaptive Capacity Labs. John tells us what exactly Adaptive Capacity Labs does and how it works and how he convinced some heroes to get behind it. John brings a much-needed insight into how to get multiple people in an organization on the same level when it comes to dealing with incidents. Engineers and non. John points out the issues surrounding public vs. private write-ups and the roadblocks they may prop up. Adaptive Capacity Labs is working towards bringing those roadblocks down, tune in for how!Show Highlights(0:00) Introduction(0:59) The Duckbill Group sponsor read(1:33) What is Adaptive Capacity Labs and the work that they do?(3:00) How to effectively learn from incidents(7:33) What is the root of confusion in incident analysis(13:20) Identifying if an organization has truly learned from their incidents(18:23) Gitpod sponsor read(19:35) Adaptive Capacity Lab’s reputation for positively shifting company culture(24:22) What the tech industry is missing when it comes to learning effectively from the incidents(28:44) Where you can find more from John and Adaptive Capacity LabsAbout John AllspawJohn Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John’s publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.” His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement.John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund UniversityLinksThe Art of Capacity Planning: https://www.amazon.com/Art-Capacity-Planning-Scaling-Resources/dp/1491939206/Web Operations: https://www.amazon.com/Web-Operations-Keeping-Data-Time/dp/1449377440/The DevOps Handbook: https://www.amazon.com/DevOps-Handbook-World-Class-Reliability-Organizations/dp/1942788002/Adaptive Capacity Labs: https://www.adaptivecapacitylabs.comJohn Allspaw Twitter: https://twitter.com/allspawRichard Cook Twitter: https://twitter.com/ri_cookDave Woods Twitter: https://twitter.com/ddwoods2Original Episodehttps://www.lastweekinaws.com/podcast/screaming-in-the-cloud/finding-a-common-language-for-incidents-with-john-allspaw/SponsorsThe Duckbill Group: duckbillgroup.com Gitpod: http://www.gitpod.io/
Transcript
Discussion (0)
but we're used to talking about networks and applications and code and network.
We're not used to talking about and even have vocabulary for what makes something confusing,
what makes something ambiguous. And that is what makes for effective incident analysis.
Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by John Allspaugh,
who's, well, he's done a lot of things. He was one of the founders of the DevOps movement,
although I'm sure someone's going to argue with that. He's also written a couple of books,
The Art of Capacity Planning and Web Operations and The Forward of the DevOps Handbook. But he's
also been the CTO at Etsy and has gotten his master's in human
factors and system safety from Lund University before it was the cool thing to do. And these
days, he is the founder and principal at Adaptive Capacity Labs. John, thanks for joining me.
Thanks for having me. I'm excited to talk with you, Corey.
This episode is sponsored in part by my day job, the Duck Bill Group. Do you have a horrifying AWS bill?
That can mean a lot of things.
Predicting what it's going to be, determining what it should be, negotiating your next long-term contract with AWS,
or just figuring out why it increasingly resembles a phone number, but nobody seems to quite know why that is.
To learn more, visit duckbillgroup.com. Remember,
you can't duck the duck bill bill. And my CEO informs me that is absolutely not our slogan.
So let's start at the beginning here. So what is Adaptive Capacity Labs? It sounds like an
experiment in autoscaling, as is every use of autoscaling, but that's neither here nor there,
and I'm guessing it goes deeper. Yeah, yeah. So I managed to trick or let's say convince some of my heroes, Dr. Richard Cook
and Dr. David Woods. These folks are what you would call heavies, right? In the human factors,
system safety and resilience engineering world, Dave Woods is credited with creating the field of resilience
engineering. So what we've been doing for the past, since I left Etsy, is bringing perspectives,
techniques, approaches to the software world that are, I guess, some of the most progressive practices that other safety-critical domains,
aviation and power plants and all of the stuff that makes news. And the way we've been doing that
is largely through the lens of incidents. And so we do a whole bunch of different things, but
that's the sort of the core of what we do is activities and projects for clients that
have a concern around incidents. Both, are we learning well? Can you tell us that? Or can you
tell us how do you understand incidents and analyze them in such a way that we can learn from them
effectively? Generally speaking, my naive guess based upon the times I spent working in various
operations role has been, great, so how do we learn from incidents?
Like, well, if you're like most of the industry, you really don't.
You wind up blaming someone in a meeting that's called blameless.
So instead of using the person's name, you use a team or a role name.
And then you wind up effectively doing a whole bunch of reactive process work that in a long enough timeline, not enough incidents, ossifies you into a whole bunch of process and procedure that is just horrible. And then how do you learn from
this? Well, by the time it actually becomes a problem, you've rotated CIOs four times and
there's no real institutional memory here. Great. That's my cynical approach. And I suspect it's not
entirely yours because if it were, you wouldn't be doing a business in this because otherwise it
would be this wonderful choreographed song and dance number of, doesn't it suck to be you, dot, dot,
and that's it. I suspect you do more as a consultant than that. So what is my lived
experience of terrible companies differing in what respects from the folks you talk to?
Oh, well, I mean, just to be blunt, you're absolutely spot on. The industry's terrible.
Well, crap. I mean, look, the good news
is there are inklings. There are signals for some organizations that have been doing the things that
they've been told to do by some book or website that they read. And they're doing all of the
things and they realize, all right, well, whatever we're doing
doesn't seem to be, it doesn't feel we're doing all the things, checking the boxes,
but we're having incidents. And even more disturbing to them is we're having incidents
that seem as if it'd be one thing to have incidents that were really difficult, hairy,
complicated and complex. And certainly
those happen, but there is a view that they're just simply not getting as much out of these
sometimes pretty traumatic events as they could be. And that's all that's needed. Yeah.
In most companies, it seems like on some level, you're dealing with every incident that looks a
lot like that. It's, sure, it was a certificate
expired, but then you wind up tying it into all the relevant things that are touching that. It
seems like it's an easy, logical conclusion. Oh, wow, it turns out in big enterprises, nothing is
straightforward or simple. Everything becomes complicated. And issues like that happen
frequently enough that it seems like the entire career can be spent in pure firefighting reactive mode.
Yeah, absolutely.
And again, I would say that just like these other domains that I mentioned earlier,
there's a lot of sort of intuitive perspectives that are, let's just say, sort of unproductive. And so in software, we write software.
It makes sense if all of our discussions after an incident,
trying to make sense of it, is entirely focused on the software did this. And Postgres has this
weird thing. And Kafka has this tricky bit here. But the fact of the matter is people and engineers and non-engineers are struggling when an incident arises, both in terms of what the hell is happening and generating hypotheses and working through whether that hypothesis is valid or not, adjusting it if signals show up that it's not.
And what can we do?
What are some options?
If we do feel like we're on a good, productive thread about what's happening,
what are some options that we can take? That opens up a doorway for a whole variation of other
questions. But the fact of the matter is handling incidents, understanding really effectively time-pressured problem-solving,
almost always amongst multiple people with different views, different expertise, and
piecing together across that group what's happening, what to do about it, and what are
the ramifications of doing this thing versus that thing. This is all what we would call above the line work. This is expertise. It shows up in how people weigh ambiguities and things are uncertain.
And that doesn't get this lived experience that people have.
We're not used to talking about, we're used to talking about networks and applications
and code and network.
We're not used to talking about and even have vocabulary for what makes something confusing,
what makes something ambiguous. And that is what makes for effective incident analysis.
Do you find that most of the people who are confused about these things tend to be more
aligned with being individual contributor type engineers who are effectively boots on the ground,
for lack of a better term? Is it high level executives who are effectively boots on the ground, for lack
of a better term? Is it high level executives who are trying to understand why it seems like
they're constantly getting paraded in the press? Or is it often folks somewhere between the two?
Yes. Right? Like there is something that you point out, which is this contrast between
boots on the ground, hands on keyboard, folks who are resolving incidents, who are
wrestling with these problems, and leadership. And sometimes leadership who remember their glory days
of being a individual contributor sometimes, you know, are a bit miscalibrated. They still believe
they have a sufficient understanding of
all the messy details when they don't. And so, I mean, the fact of the matter is there's, you know,
the age-old story of Timmy stuck in a well, right? There's the people trying to get Timmy out of the
well, and then there's what to do about all of the news reporters surrounding the well, asking for
updates and questions of how did Timmy get in the well.
These are two different activities. And I'll tell you pretty confidently, if you get Timmy out of
the well pretty fluidly, if you can set situations up where people who ostensibly would get Timmy
out of the well are better prepared with anticipating Timmy's going to be in the well
and understanding all the various options and tools to get Timmy out of the well, the more you
can set up those and have those conditions be in place, there's a whole host of other problems that
simply don't go away. And so these things kind of get a bit muddled. And so when we say learning from incidents, I would separate that
very much from what you tell the world externally from your company about the incident, because
they're not at all the same. Public write-ups about an incident are not the results of an analysis.
It's not the same as an internal review or the review to be sort of effective. Why? Well, first thing is you
never see apologies on internal post-incident reviews, right? Because who are you going to
apologize to? It's always fun watching the certain level of escalating transparency as you go
through the spectrum of the public explanation of an outage to ones you put internal to customers,
to ones you show under NDA to
special customers, to the ones who are basically partners who are going to fire you contractually
if you don't, to the actual internal discussion about it. And watching that play out is really
interesting. As you wind up seeing the things that are buried deeper and deeper, they wind up
with this flowery language on the outside, and it gets more and more transparent to the end. It's
someone tripped and hit the emergency power switch in a data center.
And it's this great list of how this stuff works.
Yeah, and to be honest,
it would be strange and shocking if they weren't different, right?
Because like I said,
the purpose of a public write-up
is entirely different than an internal write-up.
And the audience is entirely different.
And so that's why they're cherry-picked. There's a whole bunch of things that aren't included in a public write-up and the audience is entirely different. And so that's why they're cherry picked. There's a whole bunch of things that aren't included in a public write-up because
the purpose is I want a customer or a potential customer to read this and feel at least a little
bit better. I want them to at least get this notion that we've got a handle on it. Wow,
that was really bad, but nothing to see here, folks. It's all been taken care of.
But again, this is very different. The people inside the organization, even if it's just sort
of tacit, right? They've got a knowledge, tenured people who have been there for some time,
see connections, even if they're not made explicit between one incident to another incident,
to that one that happened. Remember that one that happened three years ago, that big one? Oh, sorry, you're new. Oh, let me tell you the story. Oh, it's about
this and blah, blah, blah. And who knew that Unix pipes only passes 4K across it, blah, blah, blah,
something, some weird esoteric thing. And so our focus largely, although we have done projects
with companies about trying to be
better about their external language about it, the vast majority of what we do and where
our focus is, is to capture the richest understanding of an incident for the broadest audience.
And like I said at the very beginning, the bar is real low.
There's a lot of, I don't want to say falsehoods, but certainly a lot of myths that just don't play out in the data about whether people are learning.
Whenever we have a call with a potential client, we always ask the same question,
ask them about what their post-incident activities look like. And they tell us and
throw in some cliches and never want a crisis go to waste. And, oh, yes. And we always try to
capture the learnings and we put them in a document. And we always ask the same question,
which is, oh, so you put these documents, these sort of write-ups in an area. Oh, yes,
we want that to be shared as much as possible. And then we say, who reads them? And that tends to put a bit of a pause
because most people have no idea whether they're being read or not. And the fact is, when we look,
very few of these write-ups are being read. Why? I'll be blunt, because they're terrible.
There's not much to learn from there because they're not written to be read. They're written
to be filed. And so we're
looking to change that. And there's a whole bunch of other things that are sort of unintuitive,
but just like all of the perspective shifts, DevOps and continuous deployment,
they sound obvious, but only in hindsight after you get it. That's characterization of our work.
It's easy to wind up from the outside seeing a scenario where things go super well in an
environment like that, where, okay, we brought you in as a consultant.
Suddenly we have better understanding about our outages.
Awesome.
But outages still happen.
And it's easy to take a cynical view of, okay, so other than talking to you a lot, we say
the right things, but how do we know that companies are actually learning from what
happened as opposed to just being able to tell better stories about pretending to
learn?
Yeah, yeah.
And this is, I think, where the world of software has some advantages over other domains.
The fact is software engineers don't pay any attention to anything they don't think the
attention is warranted, right?
Or they're not being judged or scored or rewarded for.
And so there's no single signal that a company is learning from
incidents. It's more like a constellation, like a bunch of smaller signals. So for example,
if more people are reading the write-ups, if more people are attending group review meetings,
in organizations that do this really well, engineers who start attending
meetings, we ask them, well, why are you going to this meeting? And they'll report, well, because
I can learn stuff here that I can't learn anywhere else. Can't read about it in a run book, can't
read about it on the wiki, can't read about it in an email or hear about it in all hands. And
that they can see a connection between even incidents handled in some distant group,
they can see a connection to their own work. And so those are the sort of signals we've written
about this on our blog. Those are the sort of signals that we know that progress is building
momentum. But like a big part of that is capturing this, again, this experience. You usually will see there's a timeline and this
is when Memcache did X and this alert happened and then blah, blah, blah, right? But very rarely
are captured the things that when you ask an engineer, tell me your favorite incident story.
People who will even describe themselves, oh, I'm not really a storyteller, but listen to this.
And they'll include parts that make for a good story, right?
Social construct is if you're going to tell a story,
you've got the attention of other people.
You're going to include the stuff that was not usually kept
or captured in write-ups.
For example, like what was confusing?
A story that tells about what was
confusing. Well, and then we looked and it said zero tests failed. This is an actual case that
we looked at. It says zero tests failed. And so, okay. So then I deployed. Well, the site went down.
Okay. Well, so what's the story there? Well, listen to this. As it turns out, at a fixed font,
zeros, like in Courier or whatever, have a slash through it.
And at a small enough font, a zero with a slash through it looks a lot like an eight.
There were eight tests failed, not zero, right?
So that's about sort of the display.
And so those are the types of things that make a good story.
We all know stories like this, right?
The Norway problem with YAML.
You ever heard of the Norway problem?
Not exactly.
I'm hoping you'll tell me.
Well, it's excellent.
And of course, it works out that the spec for YAML will evaluate the value no, N-O,
to false, as if it was a Boolean.
Yes, true.
But if your YAML contains a list of abbreviations for countries,
then you might have Ireland, Great Britain, Spain, US, false, instead of Norway. And so that's just
an unintuitive surprise. And so those are the types of things that don't typically get captured
in incident write-ups. There might be a sentence like,
there was a lack of understanding. Well, that's unhelpful at best. Don't tell me what wasn't
there. Tell me what was there. There was confusion. Great. What made it confusing? Oh yeah,
NO is both no and the abbreviation for Norway. Red herrings is another great example. Red herrings
happen a lot. They tend to stick in people's memories, and yet they never really get captured.
But it's like one of the most salient aspects of a case that ought to be captured, right? People
don't follow red herrings because they know there are a red herring. They follow red herrings because
they think it's going to be productive. So therefore, you better describe for all your colleagues what brought you
to believe that this was productive, right? Turns out later, you find out later that it wasn't
productive. Those are sort of some of the examples. And so if you can capture what's difficult,
what's ambiguous, what's uncertain, and what made it difficult,
ambiguous, uncertain.
That makes for good stories.
If you can enrich these documents, it means people who maybe don't even work there yet,
when they start working there, they'll be interested.
They have a set expectation.
They'll learn something by reading these things.
This episode is brought to you by Gitpod. Do you ever feel like you spend more time fighting your dev environment than actually coding? expectation they'll learn something by reading these things. your laptop, and the cloud. Describe your dev environment as code and streamline development workflows with automations.
The click of a button,
you get a perfectly configured environment
and you can automate tasks and services
like seeding your database,
provisioning infrastructure,
running security or scanning tools,
or any other development workflows.
You can self-host Gitpod in your cloud account
for free in under three minutes
or run Gitpod Desktop locally on your computer.
Gitpod's automated, standardized development environments are the fastest and most secure
way to develop software.
They're trusted by over 1.5 million developers, including some of the largest financial institutions
in the world.
Visit gitpod.io and try it for free with your whole team.
And also, let me know what
you think about it. I've honestly been looking for this for a while, and it's been on my list
of things to try. I'll be trying it this week. Please reach out. Let me know what you think.
There's an inherent cynicism around, well, at least from my side of the world,
around any third party that claims to fundamentally shift significant aspects of company culture. And the counterargument to that is that you and Dora and a whole bunch of other folks have had
significant success with doing it. It's just very hard to see that from the outside. So I'm curious
as to how you wind up telling stories about that, because the problem is inherently whenever you
have an outsider coming into an enterprise-style environment, oh, cool, what are they going to be able to change?
And it's hard to articulate that value and not, well, given what you do, to be direct,
come across as an engineering apologist, where it's, well, engineers are just misunderstood,
so they need empathy and psychological safety and blameless postmortems.
And it sounds a lot to crappy executives, if I'm being direct, that, oh, in other words, I just can't ever do anything negative to engineers who, from my perspective,
have just failed me or are invisible and there's nothing else in my relationship with them.
Or am I oversimplifying? No, no, I actually think you're spot on. I mean, that's the thing is that
if you're talking with leaders, remember, aka people who are, even though they're tasked with providing the resources and setting conditions for practitioners, the hands-on folks to get their work done, they're quite happy to talk about these sort of abstract concepts like psychological safety and other sorts of hand-waving stuff. What is actually pretty magical about incidents is that these are grounded,
concrete, messy phenomena that practitioners have and will remember their sometimes visceral
experiences, right? And so that's why we don't do theory at Dr. Kavasi Labs. We understand the theory,
happy to talk to you about it, but it doesn't mean as much without sort of the practicality.
And the fact of the matter is, is that the engineer apologist is, if you didn't have the
engineers, would you have a business? That's the flip side. This is like the core unintuitive part of the field of resilience engineering, which
is that Murphy's law is wrong.
What could go wrong almost never does, but we don't pay much attention to that.
And the reason why you're not having nearly as many incidents as you could be is because
despite the fact that you make it hard to learn from incidents, people are actually
learning, but they're just learning out of view from leaders. When we go to an organization and
we see that most of the people who are attending post-incident review meetings are managers,
that is a very particular signal that tells me that the real post-incident review is happening
outside that meeting. It probably happened before that meeting. And
those people are there to make sure that whatever group that they represent in their organization
isn't unnecessarily given the brunt of the bottom of a bus. And so it's a political due diligence.
But the notion that you shouldn't punish or be harsh on engineers for making mistakes completely
misses the point.
The point is to set up the conditions so that engineers can understand the work that they
do.
And if you can amplify that, as Andrew Schaefer has said, you're either building a learning
organization or you're losing to someone who is.
And a big part of that is you need people. You have to set up conditions for people to give
detailed story about their work, what's hard. This part of the code base is really scary,
right? All engineers have these notions. This part is really scary. This part is really not that big
of a deal. This part is somewhere in between, but there's no place for that outside of the informal
discussions. But I would assert that if you can capture that, the organization will be better
prepared. The thing that I would end on that is that it's a bit of a rhetorical device to get this across.
But one of the questions we'll ask is, how can you tell the difference between a difficult case,
a difficult incident handled well, or a straightforward incident handled poorly?
And from the outside, it's very hard to tell the difference.
Oh, yeah. Well, certainly, if what you're doing is averaging how long these things
take. But the fact of the matter is that all the people who were involved in that, they know the
difference between a difficult case handled well and a straightforward one handled poorly. They
know it, but there's nowhere, there's no place to give voice to that lived experience. So on the
whole, what is the tech industry missing when it comes
to learning effectively from the incidents that we all continually experience and what feels to
be far too frequently? They're missing what is captured in that age-old parable of the blind
men and the elephant, right? And I would assert that these blind men that the king sends out,
go find an elephant and come back and tell
me about the elephant. They come back and they all have their all valid perspectives. And they
argue about, no, an elephant's this big flexible thing. And everyone's, oh, no, an elephant is this
big wall. And no, an elephant is this big flappy thing. If you were to sort of make a synthesis
of their different perspectives, you'd have a richer picture and understanding of an elephant. You cannot legislate, and this is where what you brought up, you okay, we need to have some sort of root cause
analysis done within 72 hours of an event.
Well, if your goal is to find gaps and come up with remediation items, that's what you're
going to get.
Remediation items might actually not be that good because you've basically contained the
analysis time.
Which does sort of feel on some level, like it's very much aligned
as from a viewpoint of, yeah, remediation items may not be useful as far as driving lasting change,
but without remediation items, good luck explaining to your customers that will never,
ever, ever happen again. Yeah, of course. Well, you notice something about those public write-ups.
You'll notice that they don't tend to link to previous incidents that have similarities
to them, right?
Because that would undermine the whole purpose, which is to provide confidence.
And a reader might actually follow a hyperlink to say, wait a minute, you said this wouldn't
happen again.
It turns out it would.
Of course, that's horseshit.
But you're right.
And there's nothing wrong with remediation items. But if that's the goal, then that goal is, you know, what you look for is what you find.
And what you find is what you fix. If I said, here's this really complicated problem,
and I'm only giving you an hour to describe it, and it took you eight hours to figure out the
solution, well, then what you come up with in an hour is not
actually going to be all that good. So then the question is, how good are the remediation items?
And quite often what we see is, and I'm sure you've had this experience, an incident's been
resolved and you and your colleagues are like, wow, that was a huge pain in the ass. Oh, dude,
I didn't see that coming. That was weird. Yeah. And, you know, one of you might
say, you know what, I'm just going to make this change because I don't want to be woken up tonight
or I know that making this change is going to help things. Right. I'm not waiting for the
postmortem. I'm just going to we're just going to do that. Is that good? Yeah. OK. Yeah. Please do
it. Quite frequently, those things, those actions, those aren't listed as action
items. And yet it was a thing so important that it couldn't wait for the postmortem,
arguably the most important action item, and it doesn't get captured that way. We've seen this
take place. And so again, in the end, it's about those sort of the lived experience.
The lived experience is what fuels how reliable you are
today, right? You don't go to your senior technical people and say, hey, listen, we got to do this
project. We don't know how. We want you to figure out, we're going to like, let's say, we're going
to move away from this legacy thing. So I want you to get in a room, come up with two or three
options, gather a group of folks who know what they're talking about,
get some options, and then show me what the options are. Oh, and by the way, I'm prohibiting
you from taking into account any experience you've ever had with incidents. It sounds ridiculous when
you would say that. And yet, that is what fuels. So if you can fuel people's memory, you can't say
you've learned something if you can't remember it. At least that's what my kids' teachers tell me. And so, yeah, you have to capture the lived
experience and including what was hard for people to understand. And those make for good stories.
That makes for people reading them. That makes for people to have better questions about it.
That's what learning looks like. If people want to learn more about what you have to say and how you view these things,
where can they find you?
You can find me and my colleagues at adaptivecapacitylabs.com, where we talk all about this stuff on our blog.
And myself and Richard Kut and Dave Woods are also on Twitter as well.
And we'll, of course, include links to that in the show notes.
John, thank you so much for taking the time to speak with me today. I really appreciate it. Yeah, thanks. Thanks for
having me. I'm honored. John Allspaugh, co-founder and principal at Adaptive Capacity Labs. I'm cloud
economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please
leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast,
please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with a comment giving
me a list of suggested remediation actions that I can take to make sure it never happens again.