Screaming in the Cloud - The Art of Effective Incident Response with Emily Ruppe

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode is sponsored in part by our friends at Logicworks. Getting to the cloud is challenging enough for many places,

Starting point is 00:00:36 especially maintaining security, resiliency, cost control, agility, etc., etc., etc. Things break, configurations drift, technology advances, and organizations, frankly, need to evolve. How can you get to the cloud faster and ensure you have the right team in place to maintain success over time? Day two matters. Work with a partner who gets it. Logicworks combines the cloud expertise and platform automation to customize solutions to meet your unique requirements. Get started by chatting with a cloud specialist today at snark.cloud slash logicworks. That's snark.cloud slash logicworks. And my thanks to them for sponsoring this ridiculous podcast. Cloud native just means you've got more components or microservices than anyone,

Starting point is 00:01:28 even a mythical TEDx engineer, can keep track of. With Ops Level, you can build a catalog in minutes and forget needing that mythical TEDx engineer. Now, you'll have a TEDx service catalog to accompany your TEDx service count. Visit OpsLevel.com to learn how easy it is to build and manage your service catalog to accompany your 10x service count. Visit OpsLevel.com to learn how easy it is to build and manage your service catalog. Connect to your Git provider and you're off to the races with service import, repo ownership, tech docs, and more. Welcome to Screaming in the Cloud. I'm Corey Quinn. My guest today is Emily Roop, who's a solutions engineer over at Jelly.io, but her entire career has generally focused around incident management. So I sort of view her as

Starting point is 00:02:14 being my eternal nemesis, just because I like to cause problems by and large, and then I make incidents for other people to wind up solving. Emily, thank you for joining me and agreeing to suffer my slings and arrows here. Yeah. Hey, I like causing problems too. I'm a solutions engineer, but sometimes we like to call ourselves problems engineers. I'm a problems architect is generally how I tend to view it, but doing the work, ah, one wonders. So you are a jelly where as of this recording, you've been for a year now. And before that, you spent some time over at Twilio slash SendGrid. Spoiler, it's kind of the same company given the way acquisitions tend to work and all. Now it is. Oh, yeah.

Starting point is 00:02:56 You were there during the acquisition. Yes, they acquired me. That's why they bought SendGrid. Indeed. It's a good reason to acquire a company. That one person I want to bring in. Absolutely. So you started with email and then effectively continued in that general direction, given that Twilio now has eaten that business whole. And that's where I started my career. The one thing I've learned about email systems is that they love to cause problems

Starting point is 00:03:19 because it's either completely invisible and no one knows, or suddenly an email didn't go through and everyone's screaming at you. And there's no upside, only down. So let me ask the obvious question I suspect I know the answer to here. What made you decide to get into incident management? Well, I joined SendGrid. Actually, I love mess. I run towards problems. I'm someone who really enjoys that. My ADHD, I hyper-focus. Incidents are like that perfect environment of just like all of the problems are laying themselves out right in front of you. The distraction is the focus. It's kind of a wonderful place where I really enjoy the flow of that. But I've started in customer support. I've been in technical support and customer. I used to work at the Apple store.

Starting point is 00:04:08 I worked at the Genius Bar for a long time, moved into technical support over the phone. And whenever things broke really bad, I really enjoyed that process and kind of getting involved in incidents. And I came, I was one of two weekend support people at SunGrid, came in during a time of change and growth. And everyone knows that growth, usually exponential growth, usually happens very smoothly and nothing breaks during that time. So no, there was a lot of incidents. And because I was on the weekend, one of the only people on the weekend, I kind of had

Starting point is 00:04:41 to very quickly find my way and learn, when do I escalate this? How do I make the determination that this is something that is an incident? And, you know, is this worth paging engineers that are on their weekend? And getting involved in incidents and being kind of that core communication between our customers and the engineers. For listeners who might not have been involved in sufficiently scaled out environments. That sounds counterintuitive, but one of the things that you learn, very often the hard way, has been that as you continue down the path

Starting point is 00:05:13 of building a site out and scaling it, it stops being an issue relatively quickly of is the site up or down, and instead becomes a question of how up is it? So it doesn't sound obvious until you've lived it, but declaring what is an incident versus what isn't an incident is incredibly nuanced, and it's not the sort of thing that lends itself to casual solutions. Because every time a customer gets an error, we should open an incident on that. Well, I've worked at companies that throw

Starting point is 00:05:41 dozens of 500 errors every second at their scale. You will never hire enough people to solve that if you do an incident process on even 10% of them. Yeah. So, I mean, it actually became something that I, when you join Twilio, they have you create a project using Twilio's API to earn your track jacket, essentially. It's kind of like an onboarding thing. And as they absorbed SendGrid, we all did that onboarding process. And mine was a number for support people to text and it would ask them six questions.

Starting point is 00:06:15 And if they answered yes to more than two of them, it would text back, okay, maybe you should escalate this. And the questions were pretty simple of, can emails be sent? Can customers log into their website? Are you able to view this particular part of the website? Because it is with email in particular, at SendGrid in particular, the bulk of it is the email API. So like the site being up or down was the easiest type of incident, the easiest thing to flex on, because that's so much easier to see. Being able to determine what percentage or what level, how many emails are

Starting point is 00:06:51 not processing? Are they getting stuck? Or is this the correct amount of things that should be bouncing because of IP reputation? There's a thousand different things. We had this visualization of this mail pipeline that was just a mess of all of these different pipes kind of connected together. And mail could get stuck in a lot of different places. So it was a lot of spending time trying to find that and segued into project management. I was a QA for a little while doing QA work, became a project manager and learned a lot about imposing process because you're supposed to and that sometimes imposing process on teams that are working well can actually destroy them. So I learned a lot of interesting things about process the hard way and during all of that time that I

Starting point is 00:07:37 was doing project management I kind of accidentally started owning the incident response process because a lot of people left. I had been a part of the incident analysis group as well. And so I kind of became the sole owner of that when Twilio purchased SendGrid. I found out they were creating an incident commander team and I just reached out and said, here's all of SendGrid's incident response stuff. We just created a new Slackbot. I just retrained the entire team on how to talk to each other and recognize when something might be an incident. Please don't rewrite all of this to be Twilio's response process. And Terry, the person who was putting together that team said, excellent, you're going to be welcome to Twilio incident command. This is your problem. And it's a lot worse than you

Starting point is 00:08:23 thought because here's all the rest of it. So yeah, it was really interesting experience coming into technically the same company, but an entirely different company and finding out, like really trying to learn and understand all of the differences and, you know, the different problems, the different organizational history, the like fascia that has been built up between some of these parts of the organization to understand why things are the way that they are within processes. It's very interesting, and I kind of get to do it now as my job. I get to learn about the full organizational subtext of all of these different companies to understand how incident response

Starting point is 00:09:02 works, how incident analysis works, and maybe some of the whys, like what are the places where there was a very bad incident? So we put in very specific, very strange process pieces in order to navigate that, or teams that are difficult to work with. So we've built up interesting process around them. It feels like that can almost become ossified if you're not careful, because you wind up with a release process that's 2,000 steps long and each one of them is there to wind up avoiding a specific type of failure that had happened previously. And this gets into a world where in so many cases there needs to be a level of dynamism to how you wind up going about your work. It feels almost like companies have this idealized vision of the

Starting point is 00:09:46 future where if they can distill every task that happens within the company down to a series of inputs and responses of scripts almost, you can either wind up replacing your staff with a bunch of folks who just work from a runbook and cost way less, or computers in the ultimate sense of things. But that's been teased for generations now. And I have a very hard time seeing a path where you're ever going to be able to replace the contextually informed level of human judgment that honestly has fixed every incident I've ever seen. Yeah. The problem comes down to, in my opinion, the fact that humans wrote this code. People with specific context and specific understanding of how the thing needs to work in a specific way and the shortcomings and limitations they have for the libraries they're using or the different things they're trying to integrate in, a human being is who's writing the code. Code is not being written by computers. It's being written by people who have understanding in subtext. And so when you have that code written, and then maybe that person leaves,

Starting point is 00:10:51 or that person joins a different team, and their focus and priority is on something else, there is still human subtest that exists within the services that have been written. We have it call in this specific way and timeout in this specific amount of time because when we were writing it, there was this ancient service that we had to integrate with. There's always just these little pieces of we had to do things because we were people trying to make connections with lines of code. We're trying to connect a bunch of things to do some sort of task. And we have a human understanding of how to get from A to B. And probably if a computer wrote this code, it would work in an entirely different way. So in order to debug a problem, the humans usually need some sort of context. Like,

Starting point is 00:11:37 why did we do this the way that we did this? And I think it's a really interesting thing that we're finding that it is very hard to replace humans around computers, even though intellectually we think like this is all computers, but it's not. It's people convincing computers to do things that maybe they shouldn't necessarily be doing. Sometimes they're things that computers should be doing maybe, but a lot of the times it's kind of a miracle that any of these things continue to work on a given basis. And I think that it's very interesting when I think we think that we can take people out of it. The problem I keep running into, though, the more I think about this and the more I see it out there

Starting point is 00:12:17 is I don't think that it necessarily did incident management any favors when it was originally cast as the idea of blamelessness and blameless postmortems. Just because it seems an awful lot to me, like the people who were the most advocate champions of approaching things from a blameless perspective and having a blameless culture are the people who would otherwise

Starting point is 00:12:38 have been blamed themselves. So it really kind of feels on some broader level, like, oh, is this entire movement really just about being self-serving so that people don't themselves get in trouble? Because if you're not going to blame no one, you're going to blame me instead. I think that that on some level set up a framing that was not hugely helpful for folks with only a limited understanding of what the incident life cycle looks like. Yeah, I think we've evolved, right?

Starting point is 00:13:05 I think from the blameless, I think there was good intentions there, but I think that we actually missed the really big part of that boat that a lot of folks glossed over because then as it is now, it's a little bit harder to sell. When we're talking about being blameless,

Starting point is 00:13:22 we have to talk about circumventing blame in order to get people to talk about circumventing blame in order to get people to talk candidly about their experiences. And really it's, it's less about blaming someone and what they've done. Cause we, as humans blame, there's a great Brene Brown talk that she gives, I think it's a Ted talk about blame and how we as humans cannot physically avoid blaming, placing blame on things. It's about understanding where that's coming from and working through it. That is actually how we grow. And I think that there's, we're starting to kind of shift into this more blame aware culture, but I think the

Starting point is 00:13:55 hard pill to swallow about blamelessness is that we actually need to talk about the way that this stuff makes us feel as people, like feelings, like emotions. Talk about emotions during a technical incident review is not really an easy thing to get some tech executives to swallow or even engineers. There's a lot of engineers who are just kind of like, why do you care about how I felt about this problem? But in reality, you can't measure emotions as easily as you can measure meantime to resolution.

Starting point is 00:14:29 But meantime to resolution is impacted really heavily by, like, were we freaking out? Did we feel like we had absolutely no idea what we were trying to solve? Or did we understand this problem and we were confident that we could solve it? We just didn't, we couldn't find the specific place where this bug was happening. All of that is really interesting and important context about how we work together and how our processes work for us. But it's hard because we talk about our feelings. I think that you're onto something here because I look back at the, the key outages that really defined my perspective on things over the course of my career. And most of the early ones were beset by a sense of panic of, am I going to get fired for this? Because at the time I was firmly convinced

Starting point is 00:15:15 that, well, the root cause is me. I am the person that did the thing that blew up production. And while I am certainly not blameless in some of those things, I was never setting out with an intent to wind up tearing things down. So it was not that I was a bad actor subverting internal controls. Because in many companies, you don't need that level of rigor. Right. to wind up tearing things down when I did not mean to. So there were absolutely systemic issues there. But I still remember that rising tide of panic. Like, should I be focusing on getting the site back up or updating my resume? Which of these is going to be the better longer-term outcome? And now that I've been in this industry long enough and have seen enough of these, you almost don't feel the blood pressure rise anymore when you wind up having it, something gets panicky, but it takes time and nuance to get there. Yeah. Well, and it's also in order to best understand how you got in that situation,

Starting point is 00:16:16 like, were you willing to tell people that you were absolutely panicked? Would you have felt comfortable? Like if someone was saying like, okay, so what happened? How did walk me through what you were experiencing? Would you have said like, I was scared out of my goddamn mouth? Were you absolutely panicking? Or did you feel like you had some like grasping at some straws? Like, where were you? Because uncovering that for the person who is experiencing that in the issue in the incident can help understand what resources did they feel like they knew where to go to or where did they go to? Like what resource did they decide in the middle of this panic haze to grasp for? Is that something that we should start using as, hey, if it's your first time on call, this is a great thing to pull into because that's where instinctively you went.

Starting point is 00:17:03 Like there's so much that we can learn from the people who are experiencing this massive amount of panic during the incident but sometimes we will if we're being quote-unquote blameless gloss over your entire your involvement in that entirely because we don't want to blame cory for this thing happening instead we'll say an engineer made a decision and that's fine. We'll move past that. But there's so much wealth of information there. Well, I wound up in postmortems later when I ran teams. I said, okay, so an engineer made a mistake. It's like, well, hang on. There's always more to it than that because we don't hire malicious people and the people we have are competent for their role. So that goes a bit beyond that. We

Starting point is 00:17:46 will never get into a scenario where people do not make mistakes in a variety of different ways. So that's not a helpful framing. It's a question of what, if they made a mistake, sure, what was it that brought them to that place? Because that's where it gets really interesting. The problem is when you're trying to figure out in a business context, why a customer is super upset if they're a major partner, for example, and there's a sense of, all right, we're looking for a sacrificial lamb or someone that we can blame for this because we tend to think in relatively straight lines. And in those scenarios, often a nuanced understanding of the systemic failure modes within your organization that might wind up being useful in the mid to long term are not helpful for

Starting point is 00:18:24 the crisis there. So trying to stuff too much into a given incident response might be a symptom there. I'm thinking of one or two incidents in the course of my later career that really had that stink to them, for lack of a better term. What's your take on the idea? I've been in a lot of incidents where it's the desire to be able to point and say a person made this mistake is high. It's definitely something that the organization, and I put the organization in quotes there and say technical leadership or maybe PR or the comms team said, like, we're going to say like a person made this mistake when in reality, I mean, nine times out of 10,

Starting point is 00:19:03 calling it a mistake is hindsight, right? Usually people, sometimes we know that we make a mistake and it's the recovery from that that is response. But a lot of times we are making an informed decision. You know, an engineer has the information that they have available to them at the time and they're making an informed decision. And oh, no, it does not go as we planned things in the system that we didn't fully understand are coexisting it's a perfect storm of these events in order to lead to

Starting point is 00:19:33 impact to this important customer for me i've been customer facing for a very long time and i feel like from my observation customers tend to like if you say like this person did something wrong versus we learned more about how the system works together and we understand now these kind of different pieces and mechanisms within our system are not necessarily single points of failure but points at which they interact that we didn't understand could cause impact before and now we have a better understanding of how our system works and we're making some changes to some pieces. I feel like personally, as someone who has had to say that kind of stuff to customers a thousand times saying it was a person who did this thing. It shows so much less understanding of the event and understanding of the system than actually

Starting point is 00:20:21 talking through the different components and different kind of contributing factors, what went wrong. So I feel like there's a lot of growth that we as an industry can go from blaming things on an intern to actually saying, no, we invested time in understanding how a single person could perform these actions that would lead to this impact.

Starting point is 00:20:40 And now we have a deeper understanding of our system is, in my opinion, builds a little bit more confidence from the customer side. This episode is sponsored in part by Honeycomb. I'm not going to dance around the problem. Your engineers are burned out. They're tired from pagers waking them up at 2 a.m.

Starting point is 00:20:59 for something that could have waited until after their morning coffee. Ring, ring, who's there? It's Nagios, the original Call of Duty. They're fed up with relying on two or three different monitoring tools that still require them to manually trudge through logs to decipher what might be wrong. Simply put, there is a better way.

Starting point is 00:21:18 Observability tools like Honeycomb, and very little else because they do admittedly set the bar, show you the patterns and outliers of how users experience your code in complex and unpredictable environments so you can spend less time firefighting and more time innovating. It's great for your business, great for your engineers, and most importantly, great for your customers. Try free today at honeycomb.io slash screaming in the cloud.

Starting point is 00:21:44 That's honeycomb.io slash screaming in the cloud. That's honeycomb.io slash screaming in the cloud. I think so much of this is, I mean, it gets back to your question to me that I sort of dodged. Was I willing to talk about my emotional state in these moments? And yeah, I was visibly sweating and very nervous. And I've always been relatively okay with calling out the fact that I'm not in a great place at the moment and I'm panicking. And it wasn't helped in some cases by, in those early days, the CEO of the company standing over my shoulder, coming down from the upstairs building, know what that, what was going on and everything had broken. And in that case, I was only coming in to do mop-up. I wasn't one of the factors contributing to this, at least not by a primary or secondary degree.

Starting point is 00:22:27 And it still was incredibly stress-inducing. So from that perspective, it feels odd. But you also talk about we in the sense of as an industry, as a culture and the rest. I'm going to push back on that a little bit because there are still companies today in the closing days of 2022 that are extraordinarily far behind where many of us are at the companies we work for. And they're still stuck in the

Starting point is 00:22:54 relative dark ages, technically, where, well, are VMs okay or should we stay on bare metal is still the era that they're in, let alone cloud, let alone containerization, let alone infrastructure as code, et cetera, et cetera. I'm unconvinced that they have meaningfully progressed on the interpersonal aspects of incident management when they've been effectively frozen in amber from a technical basis. I don't think that's fair. No, excellent. Let's talk about that.

Starting point is 00:23:23 I think just because an organization is still like maybe in DCs and using hardware and maybe hasn't advanced so thoroughly within the technical aspect of things, that doesn't necessarily mean that they haven't adopted new. point of clarification then on this, because what I'm talking about here is the fact there are companies who are that far behind on a technical basis. They are not necessarily one in the same. Because you're using older technology, that means your processes are stuck in the past too. But rather, just as there are companies that are ancient on the technology basis, there are also companies will be 20 years behind in learnings compared to how the more progressive folks have already internalized some of these things ages ago. Blamelessness is still in the future for them. They haven't gotten there yet. I mean, yeah, there's still places that are doing root cause analysis that are doing the five whys. And I think that we're doing our best. I mean, I think it really takes,

Starting point is 00:24:21 that's a cultural change. A lot of the actual change in approach of incident analysis and incident response is a cultural change. And I can speak from firsthand experience that that's really hard to do, especially from the inside. It's very hard to do. So luckily with the role that I'm in now at Jelly.io, get to kind of support those folks who are trying to champion a change like that internally. And right now, my perspective is just trying to generate as much material for those folks to send internally to say like, hey, there's a better way. Hey, there's a different approach for this that can maybe get us around these things that are difficult. I do think that there's this tendency, and I've used this analogy before, for us to think that our junk drawers are better than

Starting point is 00:25:12 somebody else's junk drawers. I see an organization as just a junk drawer, a drawer full of weird odds and ends and spilled glue and like a broken box of tacks. And when you pull out somebody else's junk drawer, you're like, this is a mess. This is an absolute mess. How can anyone live like this? But when you pull out your own junk drawer, like I know there are 17 rubber bands in this drawer somehow. I am going to just completely rifle through this drawer until I find those things that I know are in here. Just the difference of knowing where our mess is, knowing where the bodies are buried or the skeletons are in each closet, whatever analogy works best. But I think that some organizations have this thought process that by organizations, I mean executive leadership organizations are not an entity with an opinion. They're made up of a bunch of individuals doing the work that they need to do.

Starting point is 00:26:03 But they think that their problems are harder or more unique than at other organizations. And so it's a lot harder to kind of help them see that, yes, there is a very unique situation. The way that your people work together with their technology is unique to every single different organization, but it's not that those problems cannot be solved in new different ways. Just because we've always done something in this way does not mean that is the way that is serving us the best in this moment. So we can experiment and we can make some changes, especially with process, especially with the human aspect of things of how we talk to

Starting point is 00:26:39 each other during incidents and how we communicate externally during incidents. Those aren't hard coded. We don't have to do a bunch of code reviews and make sure it's working with existing integrations to be able to make those changes. We can experiment with that kind of stuff. And I really would like to try to encourage folks to do that, even though it seems scary, because incidents are... I think people think they're scary.

Starting point is 00:27:01 They're not. They're kind of fun. They seem to be. For a lot of folks, they are. Let's not be too dismissive on that. We were both talking about panic and the panic that we have felt during incidents. And I don't want to dismiss that and say that it's not real. But I also think that we feel that way because we're worried about how we're going to be judged for our involvement in them. We're panicking because, oh no, we have contributed to this in some way. And the fact that I don't know what to do, or the fact that I did something is going to reflect poorly on me,

Starting point is 00:27:31 or maybe I'm going to get fired. And I think that the panic associated with incidents also very often has to do with the environment in which you were experiencing that incident and how that is going to be accepted and discussed. Are you going to be blamed regardless discussed is, are you going to be blamed regardless of how, quote unquote, blameless your organization is? I wish there was a better awareness at a lot of these things, but I don't think that we

Starting point is 00:27:53 are at a point yet where we're there. No. How does this map what you do day to day over at Jelly.io? It is what I do every single day. So, I mean, I do a ton of different things. We're a very small startup, so I'm doing a lot. But the main thing that I'm doing is working with our customers to tackle these hurdles within each of their organizations. Our customers vary from very small organizations to very, very large organizations and working with them to find how to make movement, how to sell this internally, sell this idea of

Starting point is 00:28:27 let's talk about our incidents a little bit differently. Let's maybe dial back some of the hard-coded automation that we're doing around response and change that to speaking to each other, as opposed to we need 11 emails sent automatically upon the creation of an incident that will automatically map to these three pager duty schedules. And a lot more of it can be us working through the issue together and then talking about it afterwards, not just in reference to the root cause, but in how we interfaced, how did it go? How did response work as well as how did we solve the problem of the technical problem that occurred

Starting point is 00:29:05 so i kind of pinch myself i feel very lucky that i get to work with a lot of different companies to understand these these human aspects and the technical aspects of how to do these experiments and make some change within organizations to help make incidents easier that's the whole feeling right we were talking about the panic it doesn't need to be as hard as it feels sometimes. And I think that it can be easier than we let ourselves think. That's a good way of framing it. It just feels on so many levels, this is one of the hardest areas to build a company in because you're not really talking about fixing technical broken systems out there. You're talking about solving people problems. And I have some software that solves your people problems. I'm not sure if that's ever been true. Yeah, it's not the software that's

Starting point is 00:29:57 going to solve the people problems. It's building the skills. A lot of what we do is we have software that helps you immensely in the analysis process and build out a story as opposed to just building on a timeline trying to tell kind of the narrative of the incident because that's what works like anthropologically we've been conveying information through folklore through tales telling tales of things that happened in order to help teach people lessons is kind of how oral history has worked for thousands of years. And we aren't better than that just because we have technology. So it's really about helping people uncover those things by using the technology that we have, pulling in Slack transcripts and PagerDuty alerts and Zoom transcripts and all of this different information that we have available to us and help people tell that story and convey that story to the folks that

Starting point is 00:30:49 were involved in it, as well as other people within your organization who might have similar things come up in the future. And that's how we learn. That's what we teach, but that's what we learn. I feel like there's a big difference. I'm understanding there's a big difference between being taught something and learning something because you usually have to earn that knowledge when you learn it. You can be taught something a thousand times and then you learn that once. And so we're trying to use those moments that we actually learn it where we earn that hard earned information through an incident and tell those stories and convey that. And our team, the solutions team is in there helping people build these skills, teaching people

Starting point is 00:31:25 how to talk to each other and really find out this information during incidents and then after them. I really want to thank you for being as generous with your time as you have been. If people want to learn more, where's the best place to find you? Oh, I was going to say Twitter, but... Yeah, that's a big open question these days, isn't it? Assuming it's still there at the time this episode airs, it might be a few days between now and then. Where should they find you on Twitter with a big asterisk next to it?

Starting point is 00:31:57 It's at TheMortalEmily, which I started this by saying I like mess, and I'm someone who loves incidents, so I'll be on Twitter. We're there to watch it all burn. Oh, I feel terrible saying that. Actually, if any Twitter engineers are listening to this, someone has found that the TLS certificate is going to expire at the end of this year. Please check Twitter for where that TLS certificate lives so that you all can renew that.

Starting point is 00:32:26 Also, Jelly.io, we have a blog that a lot of us write. Our solutions team, and honestly, a lot of us, we tend to hire folks who have a lot of experience in incident response and analysis. I've never been a solutions engineer before in my life, but I've done a lot of incident response. So we put up a lot of stuff and our goal is to build resources that are available to folks who are trying to make these changes happen,

Starting point is 00:32:50 who are in those organizations where they're still doing 5Ys and RCAs and are trying to convince people to experiment and change. We have our How We Guide, which is available for free. It's How We Got Here, which is like a full free incident analysis guide

Starting point is 00:33:03 and a lot of cool blogs and stuff there. So if I'm on Twitter, we're writing things there. We will, of course, put links to all of that in the show notes. Thank you so much for your time today. It's appreciated. Thank you, Corey. This was great. Emily Roop, Solutions Engineer at Jelly.io. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this episode, please leave a five-star review on your podcast platform of choice, along with an angry comment talking about how we've gotten it wrong, and it is always someone's fault. The Duckbill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point.

Starting point is 00:34:09 Visit duckbillgroup.com to get started.

Your Ad Here

Screaming in the Cloud - The Art of Effective Incident Response with Emily Ruppe

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.