Screaming in the Cloud - Non-Incidentally Keeping Tabs on the Internet with Courtney Nash

Episode Date: October 5, 2021

About CourtneyCourtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by h...ow people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.Links:Verica: https://www.verica.ioTwitter: https://twitter.com/courtneynashEmail: courtney@verica.io

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode is sponsored in part by our friends at Jellyfish. So you're sitting in your office chair, bleary-eyed,
Starting point is 00:00:35 parked in front of a PowerPoint, and oh, my sweet feathery Jesus, it's the night before the board meeting, because of course it is. As you slot that crappy screenshot of traffic light colored Excel tables into your deck or sift through endless spreadsheets looking for just the right data set, have you ever wondered why is it that sales and marketing get all this shiny,
Starting point is 00:00:55 awesome analytics and insight tools, whereas engineering basically gets left with the dregs? Well, the founders of Jellyfish certainly did. That's why they created the Jellyfish Engineering Management Platform, but don't you dare call it JEMP. Designed to make it simple to analyze your engineering organization, Jellyfish ingests signals from your tech stack, including JIRA, Git, and collaborative tools. Yes, depressing to think of those things as your tech stack, but this is 2021. And they use that to create a model that accurately reflects just how the breakdown of engineering work aligns with your wider business objectives. In other words, it translates from code into spreadsheet. When you have to explain what you're doing
Starting point is 00:01:40 from an engineering perspective to people whose primary IDE is Microsoft PowerPoint, consider Jellyfish. That's jellyfish.co and tell them Corey sent you. Watch for the wince. That's my favorite part. This episode is sponsored in part by our friends at VMware. Let's be honest, the past year has been far from easy due to, well, everything. It caused us to rush cloud migrations and digital transformation, which of course means long hours refactoring your apps, surprises on your cloud bill, misconfigurations, and headaches for everyone trying to manage disparate and fractured cloud environments. VMware has an answer for this. With VMware's multi-cloud solutions, organizations have the choice, speed, and control
Starting point is 00:02:26 to migrate and optimize applications seamlessly without recoding, take the fastest path to modern infrastructure, and operate consistently across the data center, the edge, and any cloud. I urge you to take a look at vmware.com slash go slash multi-cloud. You know my opinions on multi-cloud by now, but there's a
Starting point is 00:02:46 lot of stuff in here that works on any cloud. But don't take it from me. That's vmware.com slash go slash multi-cloud, all one word. And my thanks to them again for their sponsorship of my ridiculous nonsense. Welcome to Screaming in the Cloud. I'm Corey Quinn. Periodically, websites like to fall into the sea and explode. And it's just sort of a thing that we've accepted happens. Well, most of us have. My guest today is Courtney Nash, Internet Incident Librarian at Verica. Courtney, thank you for joining me. Hi, Corey. Thanks so much for having me.
Starting point is 00:03:23 So I'm going to assume that my intro is somewhat accurate, that we sort of accepted that sites will crash into the sea, the internet will break, and then everyone tears their hair out and complains on Twitter, assuming that's not the thing that fell over this time. But what does an internet incident librarian do? Yeah, I'll come back to the first part about how some people have accepted it and some people haven't, I think is the interesting part. So technically, I think my official real title is like research analyst or something really boring. But I have a background in the cognitive sciences and also in technology. And I'm really have always been fascinated by how these socio-technical systems work. And so as an internet incident librarian, I am doing a number of things to try to better understand both for myself and obviously the company I work for, but for the industry as a whole, what do we really know about how incidents happen, why they happen, when they happen, and what do we do when they happen? And how do we
Starting point is 00:04:25 learn from that? So one of the first things that I'm doing along those lines is actually collecting a database of all of the public write-ups of incidents that happen at companies that are software related. So there's already bodies of work of people who collect airline incidents and other kinds of things. And we don't have that as an industry, which I think is I want to solve that problem. Because I think other industries that have spent some time introspecting about why things fall down or when things fall down and how they fall down. Take the airline industry, for example. Planes don't really fall out of the sky very often. No, when it does, it makes news and everyone's scared about flying. But at the same time,
Starting point is 00:05:10 it's yeah. Do you have any idea how many people die in car crashes in a given hour? Yeah, yeah. And we'll come back to how the media covers things in a minute, because that is definitely something I have opinions about. But, you know, I'm not trying to say like, I want to create the NTSB of the internet. I don't think that's quite the same thing. And I really want something in the spirit of software and the internet and open source that's more collaborative. And it's very open to all of us. So the first step is to just get them in one place. There is no single place where you could go and say, oh, where are all of the X incident reports? Where are all the ones that Microsoft's written?
Starting point is 00:05:46 And also Amazon or Google or, you know, whoever. They have them, but they hide them so thoroughly. It turns out that they don't really put that in big letters on their corporate blog with links to it. And when you look at one incident report, they don't say, here, look at our previous incident reports. They really should, but no one does. And I think that's fascinating, right? Because there's a precedent. So there's two precedents. They don't say, here, look at our previous incident reports. They really should, but no one does.
Starting point is 00:06:06 And I think that's fascinating, right? Because there's a precedent. So there's two precedents. And I just gave you basically kind of one side of the two, which is the airline industry has done this, and it's not like people don't fly, right? So a lot of internet companies, a lot of software-based companies seem to be afraid of what their customers or what the stock market or what folks will think. Mind you, these are publicly traded airline companies, right? People aren't going to stop using Amazon just because you give more of this information
Starting point is 00:06:39 out, right? And so I think that piece is, I would love to see that stop being the case. Because the flip side of the coin is that this is a rising tide lifts all boats kind of thing, which granted, not all companies agree on, especially really big ones, because their boats are already mowing all the little ones out of the ocean. But that's another story. Sure. But it's also, it's easy to hide an outage.
Starting point is 00:07:02 Our site is down for, you can say three days. Great. If a customer didn't try to access the site at all during those three days, was the site really down in the first place? Oh, the tree in the forest of internet outages. Yes, it's true. Although I think the companies are, they know that people go complain on social media, right? I think there's more and more of that happening now. It's not like you can hide it as easily as you could have before Twitter or Instagram. Right, whereas a plane falls out of the sky, generally it's one of those things that people notice.
Starting point is 00:07:34 Yeah, even if you weren't interested in that flight at all. Right, when it lands in your garden, you sort of have a comment on this. Yeah, the pieces fall out of the sky. That has happened. But I think the other flip side of that coin I already mentioned is the sort of safety of airline industry has increased so significantly over the past, you know, whatever, 30, 40 years because of this concerted effort. And the other piece of it then is as an industry, as technologists, as people who use software to run their businesses, some of those things are now safety critical. This comes back to the whole software is some of those things are now safety critical. This comes
Starting point is 00:08:07 back to the whole software is sort of running the world now, right? Planes now actually could fall out of the sky because of software, not just because of hardware failures. And nuclear power plants are run by software and your electronic grid and your, you know, healthcare systems, heart rate monitors, insulin pumps. There are a lot of really critical things. And, you know, now our phone services and our internet stuff is so entwined in our lives that people can't be on their Zoom calls. People can't run their businesses. So this stuff has a massive impact on people's lives. It's no longer just pictures of cats on the internet, which admittedly, we've really honed the machine for that. No, but now when software goes down, the biggest arguments people make, the stories people
Starting point is 00:08:50 tell is, oh, well, it meant that the company lost this much money during that time frame. And great, maybe we can argue about, is that really true or is it not? It depends entirely on the company's business model, but I don't like to tend to accept those things at face value. But yeah, that's sort of the small scale thing, especially when you start getting to these massive platform providers. There are a lot of second and third order effects that are a lot more interesting slash important to people's lives than, well, we couldn't show ads to people for an hour and a half.
Starting point is 00:09:23 Right. Yes, absolutely. So T-Mobile had this outage. What is it? How is time? Time's still not working very well for me. I'm trying to remember if it was earlier this year, if it was last year. I think it was 2020. And you're like, T-Mobile, okay, whatever, you know, like cell phones, yada, yada. 911 stopped working. And it was a fascinating outage because these are now actually regulated industries that are heavily software backed. There was a government investigation into that the same way we have, you know, NTSB investigations into airline accidents. And they looked at all of those kind of second or third order effects of like people who, you know, a grandma who was stranded on the road, people who couldn't call 911, like those kinds of things that are really
Starting point is 00:10:10 significant impacts on people's lives. And the second order effect is like, oh yeah, AWS goes down, like you said, and Amazon or people like to say Jeff Bezos, I guess now are they going to complain about how much money Andy loses? I guess so. But what lives on AWS? That's crazy to think about, right? The more I learn the answer to that question, the more disturbed I become. Well, you'd probably know a better answer to that question than a lot of people. They have the big companies they can talk about.
Starting point is 00:10:37 What's really interesting is the companies that they don't and can't. An easy example, financial services is an industry that is notorious for never granting logo rights. Like at some point, they'll begrudgingly admit, yes, our multinational bank does use computers. But it's always like pulling teeth. And I get it in some level. The entire philosophy of a lot of these companies is risk mitigation rather than growth and advancing the current awareness of knowledge.
Starting point is 00:11:04 But it does become a problem. Yeah, it's interesting. rather than growth and advancing the current awareness of knowledge. But it does become a problem. Yeah, it's interesting. I need more data, which we'll get to. Help me, people. But I am able to start seeing some of those interesting graphs of kind of these cascading effects of these kinds of outages. And so I strongly believe that we need to talk about them more, that more companies need to
Starting point is 00:11:27 write them up and publish them and be a lot more transparent about it. And I think there's a number of companies that are sort of showing the way there that, and it has to do with your first question, which is we've all sort of accepted this, right? But I disagree with that. I think those of us who are super close to these kinds of complex, dynamic, distributed systems totally know that they're going to fail. And that's not shocking, nor the case of incompetence. We are building systems that are so big and so complex, no one person, no 10x engineer out there could possibly model or hold the whole thing in their head, especially because it's not even just your systems we were just talking about, right?
Starting point is 00:12:08 Like your stuff's on GitHub, it's on AWS, there's like three other upstream providers, there's this API from over there. These systems are too intricate, too complex. They're going to fail. So back to why all these things failed simultaneously, and it comes out, it's a northern woods, middle of nowhere, backhoe incident. That's right. If we look at the natural food chain of things, fiber optic cable has a natural predator in the form of a backhoe, to the point where if I'm ever lost in the woods, I will drop a length of fiber, kick some dirt over it, wait a few minutes, a backhoe will be along to sever it, then I can follow the backhoe back to civilization. They don't teach that one in the Boy Scout manual, but they really should. Yeah. Yeah. Oh my gosh. There was a beaver outage in Canada, which is, God, that's the most Canadian thing ever.
Starting point is 00:12:56 Can you come up with a more Canadian story than that? I would posit you could not, but give it a shot. No, probably not. Anywho. So I think, like I was saying, those of us close to it accept that, understand it, and are trying to now think about like, okay, well, how do we change our approach and our philosophy about this, knowing that things will fall down? But I think if you look at a lot of the rest of the world, people are still like, what are those idiots doing over there? Why did their site fall down?
Starting point is 00:13:22 Oh my God. Right? The general population is the worst on stuff like this. The absolute worst. The media is the worst. It's how did they wind up going down? Yeah, because this stuff is complicated. Back when I was getting started in tech,
Starting point is 00:13:38 I thought the whole thing worked on magic. So I started figuring out how different pieces of it worked. And now I'm convinced it runs on magic. The most amazing thing is this all works together because spit and duct tape and bailing wire holding this stuff together would be an upgrade from a lot of the stuff that currently exists in the real world. And it's amazing. You want to know the secret, Corey? You know what holds it all together? Hit me with it. Hope? Tears? People. Technology is soylent green, Corey.
Starting point is 00:14:09 It's soylent green. It's made of people. And that's the thing that always bugs me on Twitter. The whole hug ops movement has it right. When you see a big provider taking an outage, all their competitors are immediately there with, man, hope things get back together soon. Best of luck. Let us know if we can help. And that's super reassuring because today it's their outage. Tomorrow it's yours. Yep. there with, man, hope things get back together soon. Best of luck. Let us know if we can help.
Starting point is 00:14:28 And that's super reassuring because today it's their outage, tomorrow it's yours. And once in a blue moon, you see someone who's relatively new to the industry starting to try and market their stuff based on someone else's outage. And they basically get their butts fed to them just because it's not what you do and it's not how we operate. And it's one of the few moments where I look at this and realize that maybe people's inherent nature isn't all terrible. Oh, I would hope that that would be something that comes out of all of this. Yeah. No one goes to work at their day job doing what we do to suck, right? To do a bad job.
Starting point is 00:15:03 Unless you're in Facebook's ethics department, I completely agree with you. Yes. All right. There are a few caveats to that probably, but you know, we all want to show up and like do good stuff. Like no, like, so nobody's going in trying to take the site down, barring bad actor stuff that's not relevant. When Azure takes an outage, AWS is not sitting there going, ah, we're going to win more cloud deals because of this, because they're smarter than that. It's no, people are going to look at this and say, we, A, think about that, and B, how we react to them. And we will develop much more useful models of our safety boundaries, right? That's really it. You don't know. No one at any of these companies hardly knows.
Starting point is 00:16:02 If you're five steps from the cliff, five feet, driving a Ferrari 90 miles an hour towards the edge of it. We don't know. It's amazing to me just how much in the dark we are as an industry and how much of the world we're running. So I think this is one tiny first little step in what could be, you know, sort of a sea change about how all of this works. So that's a big part of why I'm doing what I'm doing. Well, let's talk about something else you're doing. So tell me a little bit about Void. Yeah. So that's the first iteration of this. So it's the Verica Open Incident Database. I feel like I have to say this almost every time. John Alspa would like me to say that it's the Verica Open Incident Report Database,
Starting point is 00:16:50 but Void is way cooler than... Void? Yeah, that sounds like you're trying to make fun of someone ineffectively. Yeah, and there's a reason why he's not in marketing. But what this is, is a collection of all of the publicly available incident reports in one place, easily searchable. You can search by company, you can search by technology, you can filter things by the types of sort of kinds of failure modes that we're seeing. And it's, I hope, valuable to a wide swath of folks, both technologists and otherwise, researchers, media and press types, analysts and whatnot. And my biggest desire is that people will look at it, realize how incomplete it is, and then help me fill it. Help me fill the void, people. I think I have right now at the time we're talking about 1,700, maybe 1,800 of these. And they run the gamut. And I know some people who like
Starting point is 00:17:45 to quibble about language. And I am one of those people having been an editor in various flavors of my life. Not all of these are what a lot of people directly related to sort of incident management and whatnot would call incident reports. I wanted to collect a corpus that reflects all of the public information about software related incidents. So it's anything from tweets, either from a company or just from people, to a status page, to a media article, a news article, an online article, to a full-blown deep dive retrospective or postmortem from a company that really does go into detail. It's the whole gamut. It's all of those things. I have no opinionated take on that. I want that all to be available to people. And we've collected some
Starting point is 00:18:31 metadata on all of the incidents as well. So we're collecting the obvious things like when did it happen? What date was it? If we can figure it out or if it's explicit, how long was it? And so those kinds of things. And then we collect some metadata. Like I said, we add some tags. Was this a complete production outage? Was it a partial outage? Those kinds of things. And this is all directly just taken from the language of the report. And we're not trying, like I said, we're trying not to have any sort of really subjective takes on any of that, but it's a bit of metadata that helps people spelunk some of this stuff. So if it is the kind of report, these are usually from like a status page
Starting point is 00:19:06 or a company post about it, what kinds of things were involved in this outage? So sometimes you'll get lucky and the company will tell you it was DNS because, you know, it's always DNS. On some level, it always is. That's why DNS is my database. It's a database problem.
Starting point is 00:19:21 It's a database problem. And sometimes you get even more detail. And so we will put as much of that that's in the report into a set of metadata about these things. So I think there's some fascinating, really easy things that I've already seen from some of these data. And we kind of hit on one of these, which is the way that companies themselves talk about these outages versus the way that press and media and other types of organizations
Starting point is 00:19:43 talk about these things. So I think there's a whole bunch of really fascinating analysis that's going to be available to nerdy, research-minded type folks like myself. I think it's a place, though, that where technologists can also kind of go and spelunk things that they're interested in, looking for patterns. And I think it's really, there's an opportunity for experts in the field to add insights to what we can discern from these public, you know, sort of incident reports. There are like two orders abstracted from what happened internally, but I think there still is a lot that we can learn from those. So the first iteration of the void
Starting point is 00:20:15 will allow people to get a first look at some of the data and to help me hopefully add to it, grow that corpus over time. and we'll see where that goes. This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of Hello World demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking, databases, observability, management, and security. And let me be observability, management, and security. And let me be clear here, it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or
Starting point is 00:20:57 spin up an autonomous database that manages itself, all while gaining the networking, load balancing, and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build. With Always Free, you can do things like run small-scale applications or do proof-of-concept testing without spending a dime. You know that I always like to put asterisk next to the word free. This is actually free. No asterisk. Start now. Visit snark.cloud slash oci-free. That's snark.cloud slash oci-free. I love the idea of having a centralized place where outages, postmortems, root cause analyses, I'll let you tear into that in a minute, and other things that
Starting point is 00:21:43 are all tied to where can I find a list of outages? Because companies list these on their websites. They put them in blog posts. And it's always very begrudging. They don't link them from any other place. You have to know the magic incantation to find the buried link on their site. Having something that is easily searchable for outages is really something that's kind of valuable. Yeah.
Starting point is 00:22:05 And I mean, some of them are like, I'm looking at you, Microsoft. I like you for a lot of reasons, but hey, I have to scroll your status page. I can't link directly to their write-ups. And this is Azure. And please stop. Make it easier.
Starting point is 00:22:21 You're driving me crazy. I don't even have a data model to figure out how to make this work for people other than like taking screenshots of them. So yeah, so there's shades of gray and black and how much they'll share or how easy it is to find these things. So it'll be interesting to see if there's any less than positive reactions to all of this being available in one place. I'm anticipating at least a little bit of that. There is one other type of metadata that we collect for the void, and that is the type of analysis that is conducted if it is clear what that type of analysis is. And some companies explicitly say, or call it an RCA. We did a root cause analysis. There's a few other types. Some people talk about having a contributing factors analysis. Most people don't consider a formal analysis type, but I am trying to collect and categorize
Starting point is 00:23:10 these because I do think there are some fascinating implications buried therein. And I would like to see if I can keep track of whether or not those change over time. And yes, you've hit on one of my favorite hot take soapbox things, which is root cause. Please take it away. Yeah, well, and anyone who's close to these systems and has watched these things fall down has the inherent sense that there is no root cause, right? Like, let's great. One of my favorite ones, human error. We don't have enough hours for this, Corey.
Starting point is 00:23:44 I'm sorry. That's one of my favorite other ones. But let's say somebody fat fingers a config change, right? Which happens. That was fundamentally the S3 service disruption back in 2017 that took down S3 for hours on end. And took down so many other people that relied on S3. Everything was tied to that. And that's an interesting question. When something like that hits, does that mean that everything it takes down gets its own entry in void? I hope so. If everybody writes them up, then yes. So if S3 goes down and you go down and you write it up, you put it in the void, then we can see those things, which would be so cool. But let's go
Starting point is 00:24:20 back to the fat fingered config file, which if you haven't ever done, you're lying, first of all. Or you haven't been allowed to touch anything large and breakable yet, which either way, you're lying on some level. So please. Yeah, I mean, I took down Holloway's homepage when it was on Hacker News because I'm a yaml. So anywho, even if you fat-finger a config change, that's not the root cause because you have this system wherein a fat finger configure
Starting point is 00:24:46 change can take down S3. That is a very big, complex, and I might add, socio-technical system. There are decisions that were made long ago about why it was structured that way or why this happens that way or what kinds of checks and balances you have. It's just, get over it, people. There is no root cause. These are complex, highly dynamic systems that when they fail, they fail in unpredictable and weird ways because we've built them that way. They're complex because you're successful at pushing the envelope and your safety boundaries. So if we could get past the root cause thing as an industry, I mean, I could probably just retire happy, honestly.
Starting point is 00:25:27 I'm a simple woman. Can we just get one thing, people? First of all, then it gives non-technologists, people outside of our bubble, the media, you can't hang it on these things anymore. You know, we all have to then sort of grapple with the complexity, which admittedly, humans, not big fans of, but... People want simple stories, simple narratives.
Starting point is 00:25:50 And people say, oh, remember the S3 outage? They don't want to sit there and have to recount 50,000 different details. They want to say, oh, yeah, it took down a few big sites like Instagram, United Airlines, and it was a real mess. The end. They want something that fits in a tweet, not something that fits in a thesis. Well, and if you have a single root cause, then you can fix the root cause and it will never happen again. Right? That's the theory. If we're just a little bit more careful, we're never going to have outages anymore. Yeah. If we could just train those humans to not try to make the best possible
Starting point is 00:26:21 high quality decision they could possibly make in that situation, given the information they have at the time, then we'll do better. But I mean, that's why your systems stay up most of the time, if you think about it. It's shocking how well these things actually work the vast majority of the time. And that's what we could learn from this too. We could, you know, oh, if we would write near misses up, please. I mean, if I could have one more wish, I think one of the coolest things the airline industry and the government side of that did was start writing up near misses. Wow, what do we learn from when we're successful versus trying to, you know, like, spelunk and nitpick the failures? Most of us aren't so good at the whole introspection part. We need failures. We
Starting point is 00:27:02 need painful outages to really force us to make difficult, introspective, soul-searching decisions and learn from them. Yeah. And I don't disagree with that. I just wish one of the things we would learn is that we should study our successes too. There's more to be mined from our successes if we can figure out how to do that than there is from our failures. So I have a metadata category in the void called near miss. And oh man, I really wish people would write those up more. I mean, I think there's like five things in there that I found so far because the humans hold these systems together, right? We make these things work the vast majority of the time. That's why there is
Starting point is 00:27:42 no root cause. And even when we're involved in these things, we're also involved in preventing them, or solving them, or remediating them. So yeah, there's no root cause. Humans aren't the problem. Those are my big hot button ones. I really wish more places would embrace that. Even Amazon uses the root cause terminology internally,
Starting point is 00:28:01 and I'm not gonna sit here and tell them how to run large things at scale. That's what I pay them to figure out for me. But I can't shake the feeling that by using that somewhat reductive terminology, that they're glossing over an awful lot of things the rest of us could really benefit from. Well, so the question then, one of the other things that I look at is personally, when I read and analyze these incident reports, these public ones a lot, I always ask myself, who's the audience for this? And there are different audiences for different types of incident reports and different things.
Starting point is 00:28:32 You know, the vast majority of them are for customers, partners, investors. The stock market. Yes, yes. You know, they're not actually for the organization. There's usually an internal one that we don't get to see. Maybe that's for the organization. There's usually an internal one that we don't get to see, maybe that's for the organization, but a lot of places feel that if you have a process and a template and a checklist
Starting point is 00:28:54 and a list of action items at the end, then you've done the right thing, right? You've had your incident, you've talked about it, you've got your action items, move on. Right, and it always seems with companies that as you get further into the company, the more honest and transparent the actual analysis is. Like at some point you wind up with the, like they're very public and very cagey and under NDA, they open up a little bit
Starting point is 00:29:16 more and a little bit more. And finally, when you work there on their executive team, it turns out the actual thing was, well, Dewey was carrying an arm full of boxes and the data center trip went cascading face first into the EPO cutoff switch that cut power to the entire facility. The cager they get, the, I guess, not to be unkind here, but the more ridiculous, whatever the actual answer is. It's one of those things where really someone tripped and hit a button. You didn't have a plan for that. Well, not really. We sort of assumed that people would. Why would't have a plan for that. Well, not really. We sort of assumed that people would- Why would you have a plan for that, right?
Starting point is 00:29:47 Right. Why would you have a plan for that the first time? Yeah. I mean, so imagine that. Imagine this exercise, sitting down in a room with a bunch of people and going, what are all the things that could go wrong? I mean, ain't nobody got time for that. That's not how it works. You all have other jobs to do too and systems to build and pressures and customers and partners and features to build. So admit that and acknowledge that you just won't know all of the antecedents and how do you respond when things happen,
Starting point is 00:30:19 which is a whole other, you know. I know you told me you recorded an episode with Dr. Christine Maslach on burnout, which I'm so happy you did. And there's, there's a whole nother piece of sort of incidents and incident response and burning people out and blaming people and all that stuff. That's a whole nother part. It sounds like you might, you know, probably not incidents with her, but still these things take a toll on people and people who, like I said, show up every day, really hoping to do their best job and go up a ladder and get a promotion and, you know, whatever. So I think not just treating those things as checklists has broader implications as
Starting point is 00:30:56 well, just for the well-being of your organization. On some level, the biggest problem that I think we've run into is that, as you said, it all comes down to people. Unfortunately, legally, we can't patch those yet. liberal arts degree. Come on, help me out, people. Because there's so much of these socio-technical systems where the socio part of it is more relevant than the actual technical part. I believe you're right. For better or worse, there's no way around it. Thank you so much for taking the time to speak with me. If people want to learn more about what you're up to, where can they find you? And we will, of course, throw a link to Void in the show notes. Yeah, I also like to talk on Twitter like you do.
Starting point is 00:31:49 I'm not as good at it as you are, but I try. So yeah, I'm Courtney Nash on Twitter. And at Verica, you can find me at Verica as well, Courtney at Verica.io. And those are the best ways to find me, I would say. And yeah, please, people, write up your incidents, send them to the void, and let's all learn and get better together, please. Thank you so much for taking the time to speak with me today. I really do appreciate it.
Starting point is 00:32:17 Thank you for having me on. I know, do people say this? I'm like, yeah, big fan, but I am. I'm a big fan. Oh, dear Lord, find better things to listen to. My God. But it's been a treat. Thank you. Courtney Nash, Internet Incident Librarian at Verica. I'm cloud economist, Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with a comment making it very clear that for whatever reason the website is down, it is most certainly not your fault. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group.
Starting point is 00:33:08 We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.