Screaming in the Cloud - How to Investigate the Post-Incident Fallout with Laura Maguire, PhD

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Today's episode is brought to you in part by our friends at Minio, the high-performance Kubernetes native object store that's built for the multi-cloud,

Starting point is 00:00:40 creating a consistent data storage layer for your public cloud instances, your private cloud instances, your private cloud instances, and even your edge instances, depending upon what the heck you're defining those as, which depends probably on where you work. It's getting that unified is one of the greatest challenges facing developers and architects today. It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload, and the speed to run any workload,

Starting point is 00:01:05 and the footprint to run anywhere. And that's exactly what Minio offers. With superb read speeds in excess of 360 gigs and a 100 megabyte binary that doesn't eat all the data you've got on the system, it's exactly what you've been looking for. Check it out today at min.io slash download and see for yourself. today at min.io slash download and see for yourself. That's min.io slash download, and be sure to tell them that I sent you. This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point

Starting point is 00:01:45 to get access into your environment. They've also gone deep in depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That's S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense. Welcome to Screaming in the Cloud. I'm Corey Quinn. One of the things that's always been a treasure and a joy in working in production environments is things breaking. What do you do after the fact?

Starting point is 00:02:20 How do you respond to that incident? Now, very often in my experience, you dive directly into the next incident because no one has time to actually fix the problems, but just spend their entire careers firefighting. It turns out that there are apparently alternate ways. My guest today is Laura McGuire, who leads the research program at Jelly, and her doctoral work focused on distributed incident response in DevOps teams responsible for critical digital services. Laura, thank you for joining me. Happy to be here, Corey. Thanks for having me. I'm still just trying to wrap my head around the idea of there being a critical digital service,

Starting point is 00:03:00 as someone whose primary output is, let's be honest, shitposting. But that's right. People do use the internet for things that are a bit more serious than making jokes that are at least funny only to me. So what got you down this path? How did you get to be the person that you are in the industry and standing in the position you hold? Yeah, I have had a long, circuitous route to get to where I am today. But one of the common threads is about safety and risk and how do people manage safety and risk. I started off in natural resource industries in mountain safety, trying to understand how do we stop things from crashing, from breaking, from exploding, from catching fire? And how do we help support the people in those environments? And when I went back to do my PhD, I was tossed into the world of software engineers. And at first I thought, now, what do firefighters, pilots, you know, emergency room physicians have to do with software engineers

Starting point is 00:04:05 and risk in software engineering. And it turns out there's actually a lot. There's a lot in common between the types of people who handle real-time failures that have widespread consequences and the folks who run continuous deployment environments. And so one of the things that the pandemic did for us is it made it immediately apparent that digital service delivery is a critical function in society. Initially, we'd been thinking about these kinds of things as being financial markets, as being availability of electronic health records, communication systems for disaster recovery. And now we're seeing things like communication and collaboration systems for schools, for businesses. This helps keep society functioning.

Starting point is 00:04:58 What makes part of this field so interesting is that the evolution in the space, where back when I first started my career about a decade and a half ago, there was a very real concern in my first Linux admin gig when I accidentally deleted some of the data from the data warehouse that, oh, I don't have a job anymore. And I remember being surprised and grateful that I still did because, oh, you just learned something. You're going to do it again? No. Well, not like that exactly, but probably some other way. Yeah. And we have evolved so far beyond that now to the point where when that doesn't happen after an incident, it becomes almost noteworthy in its own right and it blows up on social media. So the Overton window of what is acceptable disaster response and incident management and

Starting point is 00:05:44 how we learn from those things has dramatically shifted even in the relativelyton window of what is acceptable disaster response and incident management and how we learn from those things has dramatically shifted, even in the relatively brief window of 15 years. And we're starting to see now almost a next generation approach to this. One thing that you were, I believe, the principal author behind is Howie, the post-incident guide, which is a thing that you have up on JELI.io. That's J-E-L-I.io, talking about how to run post-incident guide, which is a thing that you have up on jeli.io, that's j-e-l-i.io, talking about how to run post-incident investigations. What made you decide to write something like this? Yeah, so what you described at the beginning there about this kind of shift from blameless, blameful type approaches to incident response to thinking more broadly about the system of work, thinking about what does it mean to operate in continuous deployment

Starting point is 00:06:32 environments is really fundamental. Because working in these kinds of worlds, we don't have an established knowledge base about how these systems work, about how they break, because they're continuously changing. The knowledge, the expertise required to manage them is continuously changing. And so that shift towards a blameless or blame-aware post-incident review is really important because it creates this environment where we can actually share knowledge, share expertise, and distribute more of our understandings of how these systems work and how they break. So that kind of led us to create the Howey Guide,

Starting point is 00:07:14 the Howey Got Here post-incident guide. And it was largely because companies were kind of coming from this position of, we find the person who did the thing that broke the system, and then we can all rest easy and move forward. And so it was really a way to provide some foundation. We introduced some ideas from the resilience engineering literature, which has been around for the last 30 or 40 years. It's kind of amazing on some level how tech as an industry has always tried to reinvent things from first principles. We figured out long before we started caring about computers in the way we do that when there was an incident, the right response to get the learnings from it for things like airline crashes, always a perennial favorite topic in this space for

Starting point is 00:08:00 conference talks, is to make sure that everyone can report what happened in a safe way that's non-accusatory. But even in the early 2010s, I was still working in environments where the last person to break production or break the bill had the shame trophy hanging out on their desk, and it would stay there until the next person broke it. And it was just a weird, perverse incentive where it's, oh, if I broke something, I should hide it. That is absolutely the most dangerous approach because when things have broken, yes, it's generally a bad thing. So you may as well find the silver lining in it from my point of view and figure out, okay, what have we learned about our systems as a result of the way that these things break? And sometimes the things that we learn are in fact fact, not that deep, or there's not a whole lot of learnings about it, such as when the entire county loses power, computers don't work so well.

Starting point is 00:08:50 Oh, okay, great, we have learned that. More often, though, there seem to be deeper learnings. And I guess what I'm trying to understand is I have a relatively naive approach on what the idea of incident response should look like. But it's basically based on the last time I touched things that were production-looking, which was six or seven years ago. What is the current state of the art that the advanced leaders in the space, as they start to really look at how to dive into this? Because I'm reasonably certain it's not still the, oh, you know, you can learn things when your computers break. What is pushing the envelope these days? Yeah. So it's kind of interesting. You brought up

Starting point is 00:09:29 incident response because incident response and incident analysis or the sort of like, what do we learn from those things are very tightly coupled. What we can see when we look at someone responding in real time to a failure is it's difficult to detect all of the signals. They don't pop up and wave a little flag and say like, I am what's broken. There's multiple compounding and interacting factors. So there's difficulty in the detection phase. Diagnosis is always challenging because of how the systems are interrelated. And then the repair is never straightforward. But when we stop and look at these kinds of things after the fact, a really common theme emerges, and that it's not necessarily about a specific technical skill set or understanding about the system. It's about the shared, distributed understanding of that. And so to put that

Starting point is 00:10:24 in plain speak, it's what do you know that's important to the problem? What do I know that's important to the problem? And then how do we collectively work together to extract that specific knowledge and expertise and put that into practice when we're under time pressure, when there's a lot of uncertainty, when we've got the VP DMing us and being like, when's the system going to be back up? And Twitter's exploding with unhappy customers. So when we think about the cutting edge of what's really interesting and relevant, I think organizations are starting to understand that it's how do we coordinate and we collaborate

Starting point is 00:11:00 effectively. And so using incident analysis as a way to recognize not only the technical aspects of what went wrong, but the social aspects of that as well, and the teamwork aspects of that is really driving some innovation in this space. It seems to me on some level that the increasing sophistication of what environments look like is also potentially driving some of these things. I mean, again, when you have three web servers and one of them is broken, okay, it's a problem. We should definitely jump on that and fix it. But now you have thousands of containers running hundreds of microservices for some godforsaken reason, because what we decided this thing that solves the problem of 500 engineers working on the same

Starting point is 00:11:41 repository is a political problem. So now we're going to use microservices for everything because, you know, people. Great. But then it becomes this really difficult-to-identify problem of what is actually broken. And past a certain point of scale, it's no longer a question of, is it broken, so much as how broken is it at any given point in time? And getting real-time observability into what's going on does pose more than a little bit of a challenge. Yeah, absolutely. So the more complexity that you have in the system, the more diversity of knowledge and skill sets that you have. One person is never going to know everything about the system, obviously. And so you need kind of variability in what people know, how current that knowledge is. You need some people who have legacy knowledge. You have some people who have bleeding edge.

Starting point is 00:12:32 My fingers were on the keyboard just moments ago. I did the last deploy. That kind of variability in whose knowledge and skill sets you have to be able to bring to bear to the problem in front of you. One of the really interesting aspects when you step back and you start to look really carefully about how people work in these kinds of incidents is you have folks that are jumping, get things done, probe a lot of things. They look at a lot of different areas, trying to gather information about what's happening. And then you have people who sit back and they kind of take a bit of a broader view and they're trying to understand where are people trying to find information? Where might our systems not be showing us what's going on? And so it takes this combination of people working in the problem directly and people

Starting point is 00:13:26 working on the problem more broadly to be able to get a better sense of how it's broken, how widespread is that problem, what are the implications, what might repair actually look like in this specific context. Do you suspect that this might be what gives rise sometimes to, it seems, middle management's perennial quest to build the single pane of glass dashboard of, wow, it looks like you're poking around through 15 disparate systems trying to figure out what's going on. Why don't we put that all on one page? It's like, great, let's go tilt at that windmill some more. It feels like it's very aligned with what you're saying. And I just, I don't know where

Starting point is 00:14:04 the pattern comes from. I just know I see it all the time and it drives me up a wall. Yeah, I would call that pattern pretty common across many different domains that work in very complex adaptive environments. And that is like, it's an oversimplification. We want the world to be less messy, less unstructured, less ad hoc than it often is when you're working at the cutting edge of whatever kind of technology or whatever kind of operating environment you're in. There are things that we can know about the problems that we are going to face, and we can defend against those kinds of failure modes effectively. But to your point, these are very largely unstructured problem spaces when you start to have multiple interacting failures happening

Starting point is 00:14:55 concurrently. And so Ashby, who back in 1956 started talking about sort of control systems, really hammered this point home when he was talking about if you have a world where there's a lot of variability in this case, how things are going to break, you need a lot of variability in how you're going to cope with those potential types of failures. And so part of it is, yes, trying to find the right dashboard or the right set of metrics that are going to tell us about the system performance. But part of it is also giving the responders the ability to, in real time, figure out what kinds of things they're going to need to address the problem. So there's this tension between wanting to structure unstructured problems, put those all in a single pane of glass. And what most folks who work at the front lines of these kinds of worlds know is it's actually my ability to be flexible and to be able to adapt and to be

Starting point is 00:15:59 able to search very quickly to gather the information and the people that I need that are what's really going to help me to address those hard problems. Something I've noticed for my entire career, and I don't know if it's just unfounded arrogance and I'm very much on the wrong side of the Dunning-Kruger curve here, but it always has struck me that the corporate response to any form of outage is generally trending toward, oh, we need a process around this, where it seems like the entire idea is that every time a thing happens, there should be a documented process and a runbook on how to perform every given task. With the ultimate milestone on the hill that everyone's striving for is, ah, with enough

Starting point is 00:16:41 process and enough runbooks, we can then eventually get rid of all the people who know all this stuff works and basically staff it up with people who'd know how to follow a script and run, push the button when told to by the instruction manual. And that's always rankled as someone who got into this space because I enjoy creative thinking. I enjoy looking at the relationships between things. Cost and architecture are the same thing. That's how I got into this. It's not due to an undying love of spreadsheets on my part. That's my business partner's problem. But it's this idea of being able to play with the puzzle. And the more you document things with process, the more you become reliant on those things, on some level, it feels like it ossifies things to a point where

Starting point is 00:17:18 change is no longer easily attainable. Is that actually what happens, or am I just wildly overstating the case? Either is possible, or a third option, too. You're the expert. I'm just here asking ridiculous questions. Yeah, well, I think it's a balance between needing some structure, needing some guidelines around expected actions to take place. This is for a number of reasons. One, we talked about earlier about how we need multiple diverse perspectives. So you're going to have people from different teams, from different roles in the organization, from different levels of knowledge participating in an incident response. And so because of that,

Starting point is 00:17:59 you need some form of script, some kind of process that creates some predictability, creates some common ground around how is this thing going to go? What kinds of tools do we have at our disposal to be able to either find out what's going on, fix what's going on, get the right kinds of authority to be able to take certain kinds of actions? So you need some degree of process around that. But I agree with you that too much process and the idea that we can actually apply operational procedures to these kinds of environments is completely counterproductive. kind of saying, well, you didn't follow those rules and that's why the incident went the way it did, as opposed to saying, oh, these rules actually didn't apply in ways that really matter given the problem that was faced. And there was no latitude to be able to adapt in real time or to be able to improvise, to be creative in how you're thinking about the problem. And so you've

Starting point is 00:19:04 really kind of put the responders into a bit of a box and not given them productive avenues to kind of move forward from. So having worked in a lot of very highly regulated environments, I recognize there's value in having prescription, but it's also about enabling performance

Starting point is 00:19:22 and enabling adaptive performance in real time when you're working at the speeds and the scales that we are in this kind of world. This episode is sponsored by our friends at Oracle HeatWave, a new high-performance query accelerator for the Oracle MySQL database service, although I insist on calling it MySquirrel. While MySquirrel has long been the world's most popular open source database, shifting from transacting to analytics required way too much overhead and, you know, work. With HeatWave, you can run your OLAP and OLTP, don't ask me to pronounce those acronyms ever again, workloads directly from your MySquirrel database and eliminate the time-consuming data movement and integration work, while also performing 1,100

Starting point is 00:20:06 times faster than Amazon Aurora and two and a half times faster than Amazon Redshift at a third the cost. My thanks again to Oracle Cloud for sponsoring this ridiculous nonsense. And let's be fair here, I am setting up something of a false dichotomy. I'm not suggesting that the answer is, oh, you either are mired in process or it is the complete Wild West. If you start a new role and, oh, great, how do I get started?

Starting point is 00:20:32 What's the onboarding process? Like step one, write those docs for us. Or how many times have we seen the pattern where day one onboarding is, well, here's the GitHub repo and there are some docs there and update it as you go because this stuff is constantly in motion. That's a terrible first-time experience for a lot of folks. So there has to be

Starting point is 00:20:49 something that starts people off in the right direction, sort of a quick guide to this is what's going on in the environment, and here are some directions for exploration. But also, you aren't going to be able to get that to a level of granularity where it's going to be anything other than woefully out of date in most environments without resorting to draconian measures. I feel like the answer is somewhere in the middle, and where that lives depends upon whether you're running Twitter for pets or a nuclear reactor control system. Yeah, and it brings us to a really important point of organizational life, which is that we are always operating under constraints. We are always managing tradeoffs in this space.

Starting point is 00:21:30 It's very acute when you're in an incident and you're like, do I bring the system back up, but I still don't know what's wrong? Or do I leave it down a little bit longer and I can collect more information about the nature of the problem that I'm facing. But more chronic is the fact that organizations are always facing this need to build the next thing, not focus on what just happened. You talked about the next incident starting and jumping in before we can actually really digest what just happened with the last incident. These kinds of pressures and constraints are a very normal part of organizational life. And we are balancing those trade-offs between time spent on one thing versus another as being innovating, learning, creating change within our

Starting point is 00:22:21 environment. The reason why it's important to surface that is that it helps change the conversation when we're doing any kind of post-incident learning session. It's like, oh, it allows us to surface things that we typically can't say in a meeting. Well, I wasn't able to do that because I know that that team has a code freeze going on right now, or we don't have the right type of service agreement to get our vendor on the phone, so we had to sit and wait for the ticket to get dealt with. Those kinds of things are very real limiters to how people can act during incidents and yet don't typically get brought up because they're just kind of chronic, everyday things that people deal with.

Starting point is 00:23:07 As you look across the industry, what do you think that organizations are getting, I guess, the most wrong when it comes to these things today? Because most people are no longer in the era of, all right, who's the last person to touch it? Well, they're fired. But I also don't think that they're necessarily living the envisioned reality that you describe in the Howey Guide as well as the areas of research you're exploring. What's the most common failure mode? I'm going to tweak that a little bit to make it less about the failure mode and more about the challenges that I see organizations facing, because there are many failure modes. But some common issues that we see companies facing is they're like, okay, we buy into this idea that we should start looking at the

Starting point is 00:23:52 system, that we should start looking beyond the technical thing that broke and more broadly at how did different aspects of our system interact. And I mean both people as a part of the system, I mean process as part of the system, as well as the software itself. And so that's a big part of why we wrote the Howey Guide is because companies are struggling with that gap between, okay, we're not entirely sure what this means to our organization, but we're willing to take steps to get there. But there's a big gap between recognizing that and jumping into the academic literature that's been around for many, many years from other kinds of high-risk, high-consequence type domains. So I think

Starting point is 00:24:36 some of the challenges they face is actually operationalizing some of these ideas, particularly when they already have processes and practices in place. There's ideas that are very common throughout an organization that take a long time to shift people's thinking around the implicit biases or orientations towards a problem that we as individuals have. All of those kinds of things take time. You mentioned the Overton window, and that's a great example of it is intolerable in some organizations to have a discussion about what do people know and not know about different aspects of the system, because there's an assumption that if you're the engineer responsible for that, you should know everything. So those challenges, I think, are quite limiting to helping organizations move forward. Unfortunately, we see not a lot of time being put into really understanding how an incident was handled. And so typically, reviews get done on the side of the desk.

Starting point is 00:25:46 They get done with a minimal amount of effort, and then the learnings that come out of them are quite shallow. Is there a maturity model where it makes sense to begin investing in this, whereas if you do it too quickly, you're not really going to be able to ship your MVP and see what happens? If you go too late, you have a globe-spanning service that winds up being down all the time, so no one trusts it. What is the sweet spot for really starting to care about incident response? In other words, how do people know that it's time to start taking this stuff more seriously? Ah, well, you have kids? Oh, yes. One in four. Oh, yeah. Demons. Little demons, who I love very much.

Starting point is 00:26:31 They look angelic, or I don't know what you're talking about. Would you not teach them how to learn or not teach them about the world until they started school? No, but it would also be considered child abuse at this age to teach them about the AWS bill. So there is a spectrum as far as what is appropriate learnings at what stage. Yeah, absolutely. So that's a really good point is that depending on where you are at in your operation, you might not have the resources to be able to launch full-scale investigations. You may not have the complexity within your system, within your teams, and you don't have the legacy to sort of draw through, to pull through, that requires large-scale investigations with multiple investigators. That's really why we were trying to make the Howey Guide very applicable to a broad range of organizations is here are the tools, here are the techniques that we know can help you understand more about the environment

Starting point is 00:27:25 that you're operating in, the people that you're working with, so that you can level up over time. You can draw more and more techniques and resources to be able to go deeper on those kinds of things over time. It might be appropriate in an early stage to say, hey, let's do these really informally. Let's pull the team together, talk about how things got set up, why choices were made to use the kinds of components that we use, and talk a little bit more about why someone made a decision they did. That might be low risk when you're small because you all know each other. Largely, you know the decisions. Those conversations can be more frank as you get larger, as more people you don't know are on those types of calls. You might need to

Starting point is 00:28:11 handle them differently so that people have psychological safety to be able to share what they knew and what they didn't know at the time. It can be a graduated process over time, but we've also seen very small early stage companies really treat this seriously right from the get go. At Jelly, I mean, one of our core fundamentals is learning, right? And so we do, we spend time on sharing with each other, oh, my mental model about this was X. Is that the same as what you have? No. And then we can kind of parse what's going on between those kinds of things. So I think it really is an orientation towards learning that is appropriate at any size or scale. I really want to thank you for taking the time to speak with me today. If people want to learn more about what you're up to, how you view these things, and possibly improve their own position on these areas, where can they find you? So we have a lot of content on jelly.io. I am also on Twitter at... Oh, that's always a mistake.

Starting point is 00:29:13 Laura M.D. McGuire. And I love to talk about this stuff. I love to hear how people are interpreting kind of some of the ideas that are in the resilience engineering space. Should I say, tweet at me? Or is that dangerous, Corey? It depends. I find that listeners to this show are all far more attractive than the average and good people through and through. At least that's what I tell the sponsors. So yeah, it should be just fine. And we will, of course, include links to those in the show notes. Sounds good. Thank you so much for your time. I really appreciate it. Thank you. It's been a pleasure. Laura McGuire, researcher at Jelly.

Starting point is 00:29:51 I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry, insulting comment that I will read just as soon as I get them all to display on my single pane of glass dashboard. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business

Starting point is 00:30:35 and we get to the point. Visit duckbillgroup.com to get started. This has been a HumblePod production. Stay humble.

Screaming in the Cloud - How to Investigate the Post-Incident Fallout with Laura Maguire, PhD

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.