Screaming in the Cloud - How to Investigate the Post-Incident Fallout with Laura Maguire, PhD
Episode Date: February 8, 2022About LauraLaura leads the research program at Jeli.io. She has a Master’s degree in Human Factors & Systems Safety and a PhD in Cognitive Systems Engineering. Her doctoral work focus...ed on distributed incident response practices in DevOps teams responsible for critical digital services. She was a researcher with the SNAFU Catchers Consortium from 2017-2020 and her research interests lie in resilience engineering, coordination design and enabling adaptive capacity across distributed work teams. As a backcountry skier and alpine climber, she also studies cognition & resilient performance in high risk, high consequence mountain environments.  Links:Howie: The Post-Incident Guide: https://www.jeli.io/howie-the-post-incident-guide/Jeli: https://www.jeli.ioTwitter: https://twitter.com/lauramdmaguire
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
Today's episode is brought to you in part by our friends at Minio,
the high-performance Kubernetes native object store that's built for the multi-cloud,
creating a consistent data storage layer for your public cloud instances,
your private cloud instances, your private
cloud instances, and even your edge instances, depending upon what the heck you're defining
those as, which depends probably on where you work.
It's getting that unified is one of the greatest challenges facing developers and architects
today.
It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run
any workload, and the speed to run any workload,
and the footprint to run anywhere. And that's exactly what Minio offers. With superb read
speeds in excess of 360 gigs and a 100 megabyte binary that doesn't eat all the data you've got
on the system, it's exactly what you've been looking for. Check it out today at
min.io slash download and see for yourself. today at min.io slash download and see for
yourself. That's min.io slash download, and be sure to tell them that I sent you.
This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing
DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be
used as a pivot point
to get access into your environment. They've also gone deep in depth with a bunch of other
approaches to how DevOps and security are inextricably linked. To learn more, visit
sysdig.com and tell them I sent you. That's S-Y-S-D-I-G dot com. My thanks to them for their
continued support of this ridiculous nonsense.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
One of the things that's always been a treasure and a joy in working in production environments is things breaking.
What do you do after the fact?
How do you respond to that incident?
Now, very often in my experience, you dive directly
into the next incident because no one has time to actually fix the problems, but just spend their
entire careers firefighting. It turns out that there are apparently alternate ways. My guest
today is Laura McGuire, who leads the research program at Jelly, and her doctoral work focused on distributed incident response in DevOps teams
responsible for critical digital services. Laura, thank you for joining me.
Happy to be here, Corey. Thanks for having me.
I'm still just trying to wrap my head around the idea of there being a critical digital service,
as someone whose primary output is, let's be honest, shitposting. But that's right. People
do use the internet for things that are a bit more serious than making jokes that are at least
funny only to me. So what got you down this path? How did you get to be the person that you are in
the industry and standing in the position you hold? Yeah, I have had a long, circuitous route
to get to where I am today. But one of the common threads is about safety and risk and how do people manage safety and risk. I started off in natural resource industries in mountain safety, trying to understand how do we stop things from crashing, from breaking, from exploding, from catching fire? And how do we
help support the people in those environments? And when I went back to do my PhD, I was tossed
into the world of software engineers. And at first I thought, now, what do firefighters, pilots,
you know, emergency room physicians have to do with software engineers
and risk in software engineering. And it turns out there's actually a lot. There's a lot in
common between the types of people who handle real-time failures that have widespread consequences
and the folks who run continuous deployment environments. And so one of the things that
the pandemic did for us is it made it immediately apparent that digital service delivery is a
critical function in society. Initially, we'd been thinking about these kinds of things as being
financial markets, as being availability of electronic health records, communication
systems for disaster recovery. And now we're seeing things like communication and collaboration
systems for schools, for businesses. This helps keep society functioning.
What makes part of this field so interesting is that the evolution in the space, where back when I first started my
career about a decade and a half ago, there was a very real concern in my first Linux admin gig
when I accidentally deleted some of the data from the data warehouse that, oh, I don't have a job
anymore. And I remember being surprised and grateful that I still did because, oh, you just
learned something. You're going to do it again? No. Well, not like that exactly, but probably some other way. Yeah. And we have
evolved so far beyond that now to the point where when that doesn't happen after an incident,
it becomes almost noteworthy in its own right and it blows up on social media.
So the Overton window of what is acceptable disaster response and incident management and
how we learn from those things has dramatically shifted even in the relativelyton window of what is acceptable disaster response and incident management and how we learn from those things has dramatically shifted, even in the relatively brief window of 15 years.
And we're starting to see now almost a next generation approach to this.
One thing that you were, I believe, the principal author behind is Howie, the post-incident guide, which is a thing that you have up on JELI.io.
That's J-E-L-I.io, talking about how to run post-incident guide, which is a thing that you have up on jeli.io, that's j-e-l-i.io,
talking about how to run post-incident investigations. What made you decide to
write something like this? Yeah, so what you described at the beginning there about this
kind of shift from blameless, blameful type approaches to incident response to thinking more broadly
about the system of work, thinking about what does it mean to operate in continuous deployment
environments is really fundamental. Because working in these kinds of worlds, we don't have
an established knowledge base about how these systems work, about how they break, because
they're continuously changing.
The knowledge, the expertise required to manage them is continuously changing.
And so that shift towards a blameless or blame-aware post-incident review is really important
because it creates this environment where we can actually share knowledge, share expertise,
and distribute more of our understandings
of how these systems work and how they break. So that kind of led us to create the Howey Guide,
the Howey Got Here post-incident guide. And it was largely because companies were kind of coming
from this position of, we find the person who did the thing that broke the system,
and then we can all rest easy and move forward. And so it was really a way to provide some
foundation. We introduced some ideas from the resilience engineering literature, which has
been around for the last 30 or 40 years. It's kind of amazing on some level how tech as an industry has always
tried to reinvent things from first principles. We figured out long before we started caring about
computers in the way we do that when there was an incident, the right response to get the learnings
from it for things like airline crashes, always a perennial favorite topic in this space for
conference talks, is to make sure that everyone can report what happened in a safe way that's non-accusatory. But even in the early 2010s, I was still working in environments
where the last person to break production or break the bill had the shame trophy hanging out on their
desk, and it would stay there until the next person broke it. And it was just a weird, perverse
incentive where it's, oh, if I broke something, I should hide it. That is absolutely
the most dangerous approach because when things have broken, yes, it's generally a bad thing.
So you may as well find the silver lining in it from my point of view and figure out, okay,
what have we learned about our systems as a result of the way that these things break?
And sometimes the things that we learn are in fact fact, not that deep, or there's not a whole lot of learnings about it, such as when the entire county loses power, computers don't work so well.
Oh, okay, great, we have learned that.
More often, though, there seem to be deeper learnings.
And I guess what I'm trying to understand is I have a relatively naive approach on what the idea of incident response should look like.
But it's basically based on the last time I touched things that were production-looking,
which was six or seven years ago. What is the current state of the art that the
advanced leaders in the space, as they start to really look at how to dive into this? Because I'm
reasonably certain it's not still the, oh, you know, you can learn things when your computers
break. What is pushing the envelope these days? Yeah. So it's kind of interesting. You brought up
incident response because incident response and incident analysis or the sort of like,
what do we learn from those things are very tightly coupled. What we can see when we look
at someone responding in real time to a failure is it's difficult to detect all of the signals.
They don't pop up and wave a little flag and say like, I am what's broken. There's multiple
compounding and interacting factors. So there's difficulty in the detection phase. Diagnosis is
always challenging because of how the systems are interrelated. And then the repair is never straightforward.
But when we stop and look at these kinds of things after the fact, a really common theme emerges,
and that it's not necessarily about a specific technical skill set or understanding about the system. It's about the shared, distributed understanding of that. And so to put that
in plain speak, it's what do
you know that's important to the problem? What do I know that's important to the problem? And then
how do we collectively work together to extract that specific knowledge and expertise and put
that into practice when we're under time pressure, when there's a lot of uncertainty, when we've got
the VP DMing us and being like,
when's the system going to be back up? And Twitter's exploding with unhappy customers.
So when we think about the cutting edge of what's really interesting and relevant, I think
organizations are starting to understand that it's how do we coordinate and we collaborate
effectively. And so using incident analysis as a way to recognize not only the
technical aspects of what went wrong, but the social aspects of that as well, and the teamwork
aspects of that is really driving some innovation in this space. It seems to me on some level that
the increasing sophistication of what environments look like is also potentially
driving some of these things. I mean, again, when you have three web servers and one of them is
broken, okay, it's a problem. We should definitely jump on that and fix it. But now you have
thousands of containers running hundreds of microservices for some godforsaken reason,
because what we decided this thing that solves the problem of 500 engineers working on the same
repository is a political problem. So now we're going to use microservices for everything because, you know, people. Great. But then it becomes this really
difficult-to-identify problem of what is actually broken. And past a certain point of scale,
it's no longer a question of, is it broken, so much as how broken is it at any given point in
time? And getting real-time observability into what's going on does pose more than a little
bit of a challenge. Yeah, absolutely. So the more complexity that you have in the system,
the more diversity of knowledge and skill sets that you have. One person is never going to know
everything about the system, obviously. And so you need kind of variability in what people know, how current that knowledge is.
You need some people who have legacy knowledge. You have some people who have bleeding edge.
My fingers were on the keyboard just moments ago. I did the last deploy. That kind of variability
in whose knowledge and skill sets you have to be able to bring to bear to the problem in front of you.
One of the really interesting aspects when you step back and you start to look really carefully about how people work in these kinds of incidents is you have folks that are jumping, get things
done, probe a lot of things. They look at a lot of different areas, trying to gather information
about what's happening. And then you have people who sit back and they kind of take a bit of a
broader view and they're trying to understand where are people trying to find information?
Where might our systems not be showing us what's going on? And so it takes this combination of
people working in the problem directly and people
working on the problem more broadly to be able to get a better sense of how it's broken,
how widespread is that problem, what are the implications, what might repair actually look
like in this specific context.
Do you suspect that this might be what gives rise sometimes to, it seems,
middle management's perennial quest to build the single pane of glass dashboard of, wow,
it looks like you're poking around through 15 disparate systems trying to figure out what's
going on. Why don't we put that all on one page? It's like, great, let's go tilt at that windmill
some more. It feels like it's very aligned with what you're saying. And I just, I don't know where
the pattern comes from. I just know I see it all the time and it drives me up a wall.
Yeah, I would call that pattern pretty common across many different domains that work in very
complex adaptive environments. And that is like, it's an oversimplification. We want the world to
be less messy, less unstructured, less ad hoc than it often is when you're working at
the cutting edge of whatever kind of technology or whatever kind of operating environment you're in.
There are things that we can know about the problems that we are going to face,
and we can defend against those kinds of failure modes effectively. But to your point, these are very
largely unstructured problem spaces when you start to have multiple interacting failures happening
concurrently. And so Ashby, who back in 1956 started talking about sort of control systems,
really hammered this point home when he was
talking about if you have a world where there's a lot of variability in this case, how things are
going to break, you need a lot of variability in how you're going to cope with those potential
types of failures. And so part of it is, yes, trying to find the right dashboard or the right set of metrics that are going to tell us about the system performance.
But part of it is also giving the responders the ability to, in real time, figure out what kinds of things they're going to need to address the problem. So there's this tension between wanting to structure unstructured problems,
put those all in a single pane of glass. And what most folks who work at the front lines of these
kinds of worlds know is it's actually my ability to be flexible and to be able to adapt and to be
able to search very quickly to gather the information and the people that I need that are what's really
going to help me to address those hard problems. Something I've noticed for my entire career,
and I don't know if it's just unfounded arrogance and I'm very much on the wrong
side of the Dunning-Kruger curve here, but it always has struck me that the corporate response
to any form of outage is generally trending toward, oh,
we need a process around this, where it seems like the entire idea is that every time a thing
happens, there should be a documented process and a runbook on how to perform every given task.
With the ultimate milestone on the hill that everyone's striving for is, ah, with enough
process and enough runbooks, we can then eventually get rid of all the people
who know all this stuff works and basically staff it up with people who'd know how to follow a
script and run, push the button when told to by the instruction manual. And that's always rankled
as someone who got into this space because I enjoy creative thinking. I enjoy looking at the
relationships between things. Cost and architecture are the same thing. That's how I got into this.
It's not due to an undying love of spreadsheets on my part. That's my business partner's problem. But it's this idea
of being able to play with the puzzle. And the more you document things with process, the more
you become reliant on those things, on some level, it feels like it ossifies things to a point where
change is no longer easily attainable. Is that actually what happens, or am I just wildly
overstating the case? Either
is possible, or a third option, too. You're the expert. I'm just here asking ridiculous questions.
Yeah, well, I think it's a balance between needing some structure, needing some guidelines
around expected actions to take place. This is for a number of reasons. One, we talked about earlier
about how we need multiple diverse perspectives. So
you're going to have people from different teams, from different roles in the organization,
from different levels of knowledge participating in an incident response. And so because of that,
you need some form of script, some kind of process that creates some predictability,
creates some common
ground around how is this thing going to go? What kinds of tools do we have at our disposal to be
able to either find out what's going on, fix what's going on, get the right kinds of authority to be
able to take certain kinds of actions? So you need some degree of process around that. But I agree with you that too much process and the idea that we can actually apply operational procedures to these kinds of environments is completely counterproductive. kind of saying, well, you didn't follow those rules and that's why the incident went the way
it did, as opposed to saying, oh, these rules actually didn't apply in ways that really matter
given the problem that was faced. And there was no latitude to be able to adapt in real time or
to be able to improvise, to be creative in how you're thinking about the problem. And so you've
really kind of put the responders
into a bit of a box
and not given them productive avenues
to kind of move forward from.
So having worked in a lot of
very highly regulated environments,
I recognize there's value in having prescription,
but it's also about enabling performance
and enabling adaptive performance in real time when you're
working at the speeds and the scales that we are in this kind of world.
This episode is sponsored by our friends at Oracle HeatWave, a new high-performance query
accelerator for the Oracle MySQL database service, although I insist on calling it MySquirrel.
While MySquirrel has long been the world's most popular open source database,
shifting from transacting to analytics required way too much overhead and, you know, work. With
HeatWave, you can run your OLAP and OLTP, don't ask me to pronounce those acronyms ever again,
workloads directly from your MySquirrel database and eliminate the time-consuming data movement and integration work, while also performing 1,100
times faster than Amazon Aurora and two and a half times faster than Amazon Redshift at a third
the cost. My thanks again to Oracle Cloud for sponsoring this ridiculous nonsense.
And let's be fair here, I am setting up something of a false dichotomy. I'm not suggesting that the
answer is,
oh, you either are mired in process
or it is the complete Wild West.
If you start a new role and, oh, great,
how do I get started?
What's the onboarding process?
Like step one, write those docs for us.
Or how many times have we seen the pattern
where day one onboarding is,
well, here's the GitHub repo
and there are some docs there
and update it as you go
because this stuff is constantly in motion. That's a terrible first-time experience for a lot of folks. So there has to be
something that starts people off in the right direction, sort of a quick guide to this is
what's going on in the environment, and here are some directions for exploration. But also,
you aren't going to be able to get that to a level of granularity where it's going to be
anything other than woefully out of date in most environments without resorting to draconian
measures. I feel like the answer is somewhere in the middle, and where that lives depends upon
whether you're running Twitter for pets or a nuclear reactor control system.
Yeah, and it brings us to a really important point of organizational life, which is that we are always operating under constraints.
We are always managing tradeoffs in this space.
It's very acute when you're in an incident and you're like, do I bring the system back up, but I still don't know what's wrong?
Or do I leave it down a little bit longer and I can collect more information about the nature of the problem that I'm facing.
But more chronic is the fact that organizations are always facing this need to build the next thing, not focus on what just happened.
You talked about the next incident starting and jumping in before we can actually really
digest what just happened with the last incident.
These kinds of pressures and constraints
are a very normal part of organizational life. And we are balancing those trade-offs between
time spent on one thing versus another as being innovating, learning, creating change within our
environment. The reason why it's important to surface that is that it helps
change the conversation when we're doing any kind of post-incident learning session. It's like, oh,
it allows us to surface things that we typically can't say in a meeting. Well, I wasn't able to
do that because I know that that team has a code freeze going on right now, or we don't have the right
type of service agreement to get our vendor on the phone, so we had to sit and wait for
the ticket to get dealt with. Those kinds of things are very real limiters to how people
can act during incidents and yet don't typically get brought up because they're just kind of
chronic, everyday things that people deal with.
As you look across the industry, what do you think that organizations are getting,
I guess, the most wrong when it comes to these things today? Because most people are no longer
in the era of, all right, who's the last person to touch it? Well, they're fired.
But I also don't think that they're necessarily living the envisioned reality that you describe in the Howey Guide as well as the areas of research
you're exploring. What's the most common failure mode? I'm going to tweak that a little bit to make
it less about the failure mode and more about the challenges that I see organizations facing,
because there are many failure modes. But some common issues that we
see companies facing is they're like, okay, we buy into this idea that we should start looking at the
system, that we should start looking beyond the technical thing that broke and more broadly at
how did different aspects of our system interact. And I mean both people as a part of the system,
I mean process as part of the system, as well as the software itself. And so that's a big part of
why we wrote the Howey Guide is because companies are struggling with that gap between, okay,
we're not entirely sure what this means to our organization, but we're willing to take steps to
get there. But there's a big gap between
recognizing that and jumping into the academic literature that's been around for many,
many years from other kinds of high-risk, high-consequence type domains. So I think
some of the challenges they face is actually operationalizing some of these ideas,
particularly when they already have processes and practices in place.
There's ideas that are very common throughout an organization that take a long time to shift
people's thinking around the implicit biases or orientations towards a problem that we as
individuals have. All of those kinds of things take time. You mentioned the Overton window, and that's a great example of it is intolerable in some organizations to have a discussion about what do people know and not know about different aspects of the system, because there's an assumption that if you're the engineer responsible for that, you should know everything. So those challenges, I think, are quite limiting to helping organizations move forward.
Unfortunately, we see not a lot of time being put into really understanding how an incident
was handled.
And so typically, reviews get done on the side of the desk.
They get done with a minimal amount of effort, and then the learnings that come out of them are quite shallow.
Is there a maturity model where it makes sense to begin investing in this, whereas if you do it too
quickly, you're not really going to be able to ship your MVP and see what happens? If you go too
late, you have a globe-spanning service that winds up being
down all the time, so no one trusts it. What is the sweet spot for really starting to care about
incident response? In other words, how do people know that it's time to start taking this stuff
more seriously? Ah, well, you have kids? Oh, yes. One in four. Oh, yeah. Demons. Little demons,
who I love very much.
They look angelic, or I don't know what you're talking about. Would you not teach them how to learn or not teach them about the world until they started school?
No, but it would also be considered child abuse at this age to teach them about the AWS bill. So
there is a spectrum as far as what is appropriate learnings at what stage.
Yeah, absolutely. So that's a really good point is that depending on where you are at in your operation, you might not have the resources to be able to launch full-scale investigations. You may
not have the complexity within your system, within your teams, and you don't have the legacy to sort of draw through, to pull through, that requires
large-scale investigations with multiple investigators. That's really why we were
trying to make the Howey Guide very applicable to a broad range of organizations is here are the
tools, here are the techniques that we know can help you understand more about the environment
that you're operating in, the people that you're working with, so that you can level up over time.
You can draw more and more techniques and resources to be able to go deeper on those
kinds of things over time. It might be appropriate in an early stage to say, hey, let's do these
really informally. Let's pull the team together,
talk about how things got set up, why choices were made to use the kinds of components that we use,
and talk a little bit more about why someone made a decision they did. That might be low risk when
you're small because you all know each other. Largely, you know the decisions. Those conversations can be more frank
as you get larger, as more people you don't know are on those types of calls. You might need to
handle them differently so that people have psychological safety to be able to share what
they knew and what they didn't know at the time. It can be a graduated process over time, but we've
also seen very small early stage companies really treat this seriously right from the get go. At Jelly, I mean, one of our core fundamentals is learning, right? And so we do, we spend time on sharing with each other, oh, my mental model about this was X. Is that the same as what you have? No. And then we can kind of parse what's
going on between those kinds of things. So I think it really is an orientation towards learning that
is appropriate at any size or scale. I really want to thank you for taking the time to speak
with me today. If people want to learn more about what you're up to, how you view these things,
and possibly improve their own position on these areas, where can they find you?
So we have a lot of content on jelly.io. I am also on Twitter at... Oh, that's always a mistake.
Laura M.D. McGuire. And I love to talk about this stuff. I love to hear how people
are interpreting kind of some of the ideas that are in the resilience engineering space.
Should I say, tweet at me? Or is that dangerous, Corey?
It depends. I find that listeners to this show are all far more attractive than the average and
good people through and through. At least that's what I tell the sponsors. So yeah,
it should be just fine. And we will, of course, include links to those in the show notes.
Sounds good.
Thank you so much for your time. I really appreciate it. Thank you. It's been a pleasure. Laura McGuire, researcher at Jelly.
I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this
podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've
hated this podcast, please leave a five-star review on your podcast platform of choice,
along with an
angry, insulting comment that I will read just as soon as I get them all to display on my single
pane of glass dashboard. If your AWS bill keeps rising and your blood pressure is doing the same,
then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less
horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business
and we get to the point. Visit duckbillgroup.com to get started.
This has been a HumblePod production.
Stay humble.