Screaming in the Cloud - Non-Incidentally Keeping Tabs on the Internet with Courtney Nash
Episode Date: October 5, 2021About CourtneyCourtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by h...ow people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.Links:Verica: https://www.verica.ioTwitter: https://twitter.com/courtneynashEmail: courtney@verica.io
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This episode is sponsored in part by our friends at Jellyfish.
So you're sitting in your office chair, bleary-eyed,
parked in front of a PowerPoint,
and oh, my sweet feathery Jesus,
it's the night before the board meeting,
because of course it is.
As you slot that crappy screenshot
of traffic
light colored Excel tables into your deck or sift through endless spreadsheets looking for just the
right data set, have you ever wondered why is it that sales and marketing get all this shiny,
awesome analytics and insight tools, whereas engineering basically gets left with the dregs?
Well, the founders of Jellyfish certainly did. That's why they created
the Jellyfish Engineering Management Platform, but don't you dare call it JEMP. Designed to make it
simple to analyze your engineering organization, Jellyfish ingests signals from your tech stack,
including JIRA, Git, and collaborative tools. Yes, depressing to think of those things as your
tech stack, but this is 2021. And they use that to create a model that accurately reflects just how the
breakdown of engineering work aligns with your wider business objectives. In other words,
it translates from code into spreadsheet. When you have to explain what you're doing
from an engineering perspective to people whose primary IDE is Microsoft PowerPoint,
consider Jellyfish. That's jellyfish.co and tell them Corey sent you. Watch for the wince.
That's my favorite part. This episode is sponsored in part by our friends at VMware.
Let's be honest, the past year has been far from easy due to, well, everything. It caused us to rush cloud migrations and digital
transformation, which of course means long hours refactoring your apps, surprises on your cloud
bill, misconfigurations, and headaches for everyone trying to manage disparate and fractured
cloud environments. VMware has an answer for this. With VMware's multi-cloud solutions,
organizations have the choice, speed, and control
to migrate and optimize applications seamlessly
without recoding,
take the fastest path to modern infrastructure,
and operate consistently across the data center,
the edge, and any cloud.
I urge you to take a look at
vmware.com slash go slash multi-cloud.
You know my opinions on multi-cloud by now, but there's a
lot of stuff in here that works on any cloud. But don't take it from me. That's vmware.com slash go
slash multi-cloud, all one word. And my thanks to them again for their sponsorship of my ridiculous
nonsense. Welcome to Screaming in the Cloud. I'm Corey Quinn. Periodically, websites like to fall into the sea and explode.
And it's just sort of a thing that we've accepted happens.
Well, most of us have.
My guest today is Courtney Nash, Internet Incident Librarian at Verica.
Courtney, thank you for joining me.
Hi, Corey. Thanks so much for having me.
So I'm going to assume that my intro is
somewhat accurate, that we sort of accepted that sites will crash into the sea, the internet will
break, and then everyone tears their hair out and complains on Twitter, assuming that's not
the thing that fell over this time. But what does an internet incident librarian do? Yeah,
I'll come back to the first part about how some people have accepted it and some people haven't, I think is the interesting part. So technically, I think my official real title is like research analyst or something really boring. But I have a background in the cognitive sciences and also in technology. And I'm really have always been fascinated by how these socio-technical systems work. And so as an
internet incident librarian, I am doing a number of things to try to better understand both for
myself and obviously the company I work for, but for the industry as a whole, what do we really
know about how incidents happen, why they happen, when they happen, and what do we do when they happen? And how do we
learn from that? So one of the first things that I'm doing along those lines is actually collecting
a database of all of the public write-ups of incidents that happen at companies that are
software related. So there's already bodies of work of people who collect airline incidents and other kinds of things.
And we don't have that as an industry, which I think is I want to solve that problem.
Because I think other industries that have spent some time introspecting about why things fall down or when things fall down and how they fall down.
Take the airline industry, for example.
Planes don't really fall out of the sky very often.
No, when it does, it makes news and everyone's scared about flying. But at the same time,
it's yeah. Do you have any idea how many people die in car crashes in a given hour?
Yeah, yeah. And we'll come back to how the media covers things in a minute,
because that is definitely something I have opinions about. But, you know, I'm not trying
to say like, I want to create the NTSB of the internet. I don't think that's quite the same thing. And I really want something in the spirit
of software and the internet and open source that's more collaborative. And it's very open
to all of us. So the first step is to just get them in one place. There is no single place
where you could go and say, oh, where are all of the X incident reports?
Where are all the ones that Microsoft's written?
And also Amazon or Google or, you know, whoever.
They have them, but they hide them so thoroughly.
It turns out that they don't really put that
in big letters on their corporate blog with links to it.
And when you look at one incident report,
they don't say, here, look at our previous incident reports.
They really should, but no one does.
And I think that's fascinating, right? Because there's a precedent. So there's two precedents. They don't say, here, look at our previous incident reports. They really should, but no one does.
And I think that's fascinating, right?
Because there's a precedent.
So there's two precedents.
And I just gave you basically kind of one side of the two, which is the airline industry has done this, and it's not like people don't fly, right?
So a lot of internet companies, a lot of software-based companies seem to be afraid of what their
customers or what the stock market or what folks will think.
Mind you, these are publicly traded airline companies, right?
People aren't going to stop using Amazon just because you give more of this information
out, right?
And so I think that piece is, I would love to see that stop being the case.
Because the flip side of the coin is that this is a rising tide lifts all boats kind of thing,
which granted, not all companies agree on, especially really big ones,
because their boats are already mowing all the little ones out of the ocean.
But that's another story.
Sure.
But it's also, it's easy to hide an outage.
Our site is down for, you can say three days.
Great. If a customer didn't try to access the site at all during those three days,
was the site really down in the first place?
Oh, the tree in the forest of internet outages. Yes, it's true. Although I think the companies
are, they know that people go complain on social media, right? I think there's more and more of
that happening now. It's not like you can hide it as easily as you could have before Twitter or Instagram.
Right, whereas a plane falls out of the sky,
generally it's one of those things that people notice.
Yeah, even if you weren't interested in that flight at all.
Right, when it lands in your garden, you sort of have a comment on this.
Yeah, the pieces fall out of the sky. That has happened.
But I think the other flip side of that coin I already mentioned is the sort of safety of airline
industry has increased so significantly over the past, you know, whatever, 30, 40 years because
of this concerted effort. And the other piece of it then is as an industry, as technologists,
as people who use software to run their businesses, some of those things are now
safety critical. This comes back to the whole software is some of those things are now safety critical. This comes
back to the whole software is sort of running the world now, right? Planes now actually could fall
out of the sky because of software, not just because of hardware failures. And nuclear power
plants are run by software and your electronic grid and your, you know, healthcare systems, heart rate monitors, insulin pumps.
There are a lot of really critical things. And, you know, now our phone services and our internet
stuff is so entwined in our lives that people can't be on their Zoom calls. People can't run
their businesses. So this stuff has a massive impact on people's lives. It's no longer just
pictures of cats on the internet, which admittedly, we've really honed the machine for that.
No, but now when software goes down, the biggest arguments people make, the stories people
tell is, oh, well, it meant that the company lost this much money during that time frame.
And great, maybe we can argue about, is that really true or is it not?
It depends entirely on the company's business model, but I don't like to tend to accept those things at face value.
But yeah, that's sort of the small scale thing,
especially when you start getting to these massive platform providers.
There are a lot of second and third order effects
that are a lot more interesting slash important to people's lives
than, well, we couldn't show ads to people for an hour and a half.
Right. Yes, absolutely. So T-Mobile had this outage. What is it? How is time? Time's still
not working very well for me. I'm trying to remember if it was earlier this year, if it
was last year. I think it was 2020. And you're like, T-Mobile, okay, whatever, you know, like
cell phones, yada, yada. 911 stopped working. And it was a fascinating outage because these are now
actually regulated industries that are heavily software backed. There was a government
investigation into that the same way we have, you know, NTSB investigations into airline accidents.
And they looked at all of those kind of second or third order effects of like people who,
you know, a grandma who was stranded on the road, people who couldn't call 911, like those kinds of things that are really
significant impacts on people's lives. And the second order effect is like, oh yeah, AWS goes
down, like you said, and Amazon or people like to say Jeff Bezos, I guess now are they going to
complain about how much money Andy loses? I guess so. But what lives on AWS?
That's crazy to think about, right?
The more I learn the answer to that question,
the more disturbed I become.
Well, you'd probably know a better answer to that question than a lot of people.
They have the big companies they can talk about.
What's really interesting is the companies
that they don't and can't.
An easy example, financial services is an industry
that is notorious for never granting logo rights.
Like at some point, they'll begrudgingly admit, yes, our multinational bank does use computers.
But it's always like pulling teeth.
And I get it in some level.
The entire philosophy of a lot of these companies is risk mitigation rather than growth and advancing the current awareness of knowledge.
But it does become a problem. Yeah, it's interesting. rather than growth and advancing the current awareness of knowledge.
But it does become a problem.
Yeah, it's interesting.
I need more data, which we'll get to.
Help me, people.
But I am able to start seeing some of those interesting graphs of kind of these cascading effects of these kinds of outages.
And so I strongly believe that we need to talk about them more,
that more companies need to
write them up and publish them and be a lot more transparent about it. And I think there's a number
of companies that are sort of showing the way there that, and it has to do with your first
question, which is we've all sort of accepted this, right? But I disagree with that. I think
those of us who are super close to these kinds
of complex, dynamic, distributed systems totally know that they're going to fail. And that's not
shocking, nor the case of incompetence. We are building systems that are so big and so complex,
no one person, no 10x engineer out there could possibly model or hold the whole thing in their
head, especially because it's not even just your systems we were just talking about, right?
Like your stuff's on GitHub, it's on AWS, there's like three other upstream providers,
there's this API from over there. These systems are too intricate, too complex. They're going to
fail. So back to why all these things failed simultaneously, and it comes out, it's a
northern woods, middle of nowhere, backhoe incident.
That's right.
If we look at the natural food chain of things, fiber optic cable has a natural predator in the form of a backhoe, to the point where if I'm ever lost in the woods, I will drop a length of fiber, kick some dirt over it, wait a few minutes, a backhoe will be along to sever it, then I can follow the backhoe back to civilization. They don't teach that one in the Boy Scout manual, but they really should.
Yeah. Yeah. Oh my gosh. There was a beaver outage in Canada, which is,
God, that's the most Canadian thing ever.
Can you come up with a more Canadian story than that? I would posit you could not,
but give it a shot.
No, probably not. Anywho. So I think, like I was saying, those of us close to it accept that, understand it,
and are trying to now think about like, okay, well, how do we change our approach and our
philosophy about this, knowing that things will fall down?
But I think if you look at a lot of the rest of the world, people are still like,
what are those idiots doing over there?
Why did their site fall down?
Oh my God.
Right?
The general population is the worst on stuff like this.
The absolute worst.
The media is the worst.
It's how did they wind up going down?
Yeah, because this stuff is complicated.
Back when I was getting started in tech,
I thought the whole thing worked on magic.
So I started figuring out how different pieces of it worked.
And now I'm convinced it runs on magic.
The most amazing thing is this all works together because spit and duct tape and
bailing wire holding this stuff together would be an upgrade from a lot of the stuff that currently
exists in the real world. And it's amazing. You want to know the secret, Corey? You know
what holds it all together? Hit me with it. Hope? Tears? People.
Technology is soylent green, Corey.
It's soylent green.
It's made of people.
And that's the thing that always bugs me on Twitter.
The whole hug ops movement has it right.
When you see a big provider taking an outage, all their competitors are immediately there with, man, hope things get back together soon.
Best of luck.
Let us know if we can help.
And that's super reassuring because today it's their outage. Tomorrow it's yours. Yep. there with, man, hope things get back together soon. Best of luck. Let us know if we can help.
And that's super reassuring because today it's their outage, tomorrow it's yours.
And once in a blue moon, you see someone who's relatively new to the industry starting to try and market their stuff based on someone else's outage. And they basically get
their butts fed to them just because it's not what you do and it's not how we operate.
And it's one of the few moments
where I look at this and realize that maybe people's inherent nature isn't all terrible.
Oh, I would hope that that would be something that comes out of all of this.
Yeah.
No one goes to work at their day job doing what we do to suck, right? To do a bad job.
Unless you're in Facebook's ethics department,
I completely agree with you. Yes. All right. There are a few caveats to that probably, but
you know, we all want to show up and like do good stuff. Like no, like, so nobody's going in trying
to take the site down, barring bad actor stuff that's not relevant. When Azure takes an outage,
AWS is not sitting there going, ah, we're going to win more cloud deals because of this, because
they're smarter than that. It's no, people are going to look at this and say, we, A, think about that,
and B, how we react to them. And we will develop much more useful models of our safety boundaries,
right? That's really it. You don't know. No one at any of these companies hardly knows.
If you're five steps from the cliff, five feet,
driving a Ferrari 90 miles an hour towards the edge of it. We don't know. It's amazing to me
just how much in the dark we are as an industry and how much of the world we're running.
So I think this is one tiny first little step in what could be, you know, sort of a sea change about how
all of this works. So that's a big part of why I'm doing what I'm doing. Well, let's talk about
something else you're doing. So tell me a little bit about Void. Yeah. So that's the first iteration
of this. So it's the Verica Open Incident Database. I feel like I have to say this almost
every time. John Alspa would like me to say that it's the Verica Open Incident Report Database,
but Void is way cooler than... Void? Yeah, that sounds like you're trying to make fun of someone
ineffectively. Yeah, and there's a reason why he's not in marketing. But what this is, is a
collection of all of the publicly available incident reports in one place, easily searchable.
You can search by company, you can search by technology, you can filter things by the types of sort of kinds of failure modes that we're seeing.
And it's, I hope, valuable to a wide swath of folks, both technologists and otherwise, researchers, media and press types, analysts and whatnot.
And my biggest desire is that people will look at it, realize how incomplete it is,
and then help me fill it. Help me fill the void, people. I think I have right now at the time we're
talking about 1,700, maybe 1,800 of these. And they run the gamut. And I know some people who like
to quibble about language. And I am one of those people having been an editor in various flavors
of my life. Not all of these are what a lot of people directly related to sort of incident
management and whatnot would call incident reports. I wanted to collect a corpus that
reflects all of the public information about software related incidents. So it's anything from tweets, either from a company
or just from people, to a status page, to a media article, a news article, an online article,
to a full-blown deep dive retrospective or postmortem from a company that really does
go into detail. It's the whole gamut. It's all of those things. I have no
opinionated take on that. I want that all to be available to people. And we've collected some
metadata on all of the incidents as well. So we're collecting the obvious things like when did it
happen? What date was it? If we can figure it out or if it's explicit, how long was it? And so those
kinds of things. And then we collect some metadata. Like I said,
we add some tags. Was this a complete production outage? Was it a partial outage?
Those kinds of things. And this is all directly just taken from the language of the report.
And we're not trying, like I said, we're trying not to have any sort of really subjective
takes on any of that, but it's a bit of metadata that helps people spelunk some of this stuff.
So if it is the kind of report, these are usually from like a status page
or a company post about it,
what kinds of things were involved in this outage?
So sometimes you'll get lucky
and the company will tell you it was DNS
because, you know, it's always DNS.
On some level, it always is.
That's why DNS is my database.
It's a database problem.
It's a database problem.
And sometimes you get even more detail.
And so we will put as much of that that's in the report into a set of metadata about
these things.
So I think there's some fascinating, really easy things that I've already seen from some
of these data.
And we kind of hit on one of these, which is the way that companies themselves talk
about these outages versus the way that press and media and other types of organizations
talk about these things. So I
think there's a whole bunch of really fascinating analysis that's going to be available to nerdy,
research-minded type folks like myself. I think it's a place, though, that where technologists
can also kind of go and spelunk things that they're interested in, looking for patterns.
And I think it's really, there's an opportunity for experts in the field to add insights to what
we can discern from these public, you know,
sort of incident reports. There are like two orders abstracted from what happened internally,
but I think there still is a lot that we can learn from those. So the first iteration of the void
will allow people to get a first look at some of the data and to help me hopefully add to it,
grow that corpus over time. and we'll see where that goes. This episode is sponsored by our friends at Oracle Cloud.
Counting the pennies, but still dreaming of deploying apps instead of Hello World demos?
Allow me to introduce you to Oracle's Always Free tier.
It provides over 20 free services and infrastructure, networking, databases, observability, management,
and security. And let me be observability, management, and security.
And let me be clear here, it's actually free. There's no surprise billing until you intentionally
and proactively upgrade your account. This means you can provision a virtual machine instance or
spin up an autonomous database that manages itself, all while gaining the networking,
load balancing, and storage resources that somehow
never quite make it into most free tiers needed to support the application that you want to build.
With Always Free, you can do things like run small-scale applications or do proof-of-concept
testing without spending a dime. You know that I always like to put asterisk next to the word free.
This is actually free. No asterisk. Start now. Visit snark.cloud slash oci-free.
That's snark.cloud slash oci-free. I love the idea of having a centralized place where outages,
postmortems, root cause analyses, I'll let you tear into that in a minute, and other things that
are all tied to where can I find a list of outages?
Because companies list these on their websites.
They put them in blog posts.
And it's always very begrudging.
They don't link them from any other place.
You have to know the magic incantation to find the buried link on their site.
Having something that is easily searchable for outages is really something that's kind of valuable.
Yeah.
And I mean, some of them are like,
I'm looking at you, Microsoft.
I like you for a lot of reasons,
but hey, I have to scroll your status page.
I can't link directly to their write-ups.
And this is Azure.
And please stop.
Make it easier.
You're driving me crazy.
I don't even have a data model to figure out
how to make this work for people
other than like taking screenshots of them. So yeah, so there's shades of gray and black and how much they'll share or how easy it is to find these things. So it'll be interesting to see if there's any less than positive reactions to all of this being available in one place. I'm anticipating at least a little bit of that. There is one other type of metadata that
we collect for the void, and that is the type of analysis that is conducted if it is clear what
that type of analysis is. And some companies explicitly say, or call it an RCA. We did a
root cause analysis. There's a few other types. Some people talk about having a contributing
factors analysis. Most people don't consider a formal analysis type, but I am trying to collect and categorize
these because I do think there are some fascinating implications buried therein.
And I would like to see if I can keep track of whether or not those change over time.
And yes, you've hit on one of my favorite hot take soapbox things, which is root cause.
Please take it away.
Yeah, well, and anyone who's close to these systems and has watched these things fall down has the inherent sense that there is no root cause, right?
Like, let's great.
One of my favorite ones, human error.
We don't have enough hours for this, Corey.
I'm sorry.
That's one of my favorite other ones. But let's say somebody fat fingers a config
change, right? Which happens. That was fundamentally the S3 service disruption
back in 2017 that took down S3 for hours on end. And took down so many other people that relied on
S3. Everything was tied to that. And that's an interesting question. When something like that
hits, does that mean that everything it takes down gets its own entry in void? I hope so.
If everybody writes them up, then yes. So if S3 goes down and you go down and you write it up,
you put it in the void, then we can see those things, which would be so cool. But let's go
back to the fat fingered config file, which if you haven't ever done, you're lying, first of all. Or you haven't been allowed to touch anything large and breakable yet,
which either way, you're lying on some level.
So please.
Yeah, I mean, I took down Holloway's homepage
when it was on Hacker News because I'm a yaml.
So anywho, even if you fat-finger a config change,
that's not the root cause because you have this system
wherein a fat finger configure
change can take down S3. That is a very big, complex, and I might add, socio-technical system.
There are decisions that were made long ago about why it was structured that way or why this happens
that way or what kinds of checks and balances you have. It's just, get over it, people. There is no
root cause. These are complex,
highly dynamic systems that when they fail, they fail in unpredictable and weird ways
because we've built them that way. They're complex because you're successful at pushing
the envelope and your safety boundaries. So if we could get past the root cause
thing as an industry, I mean, I could probably just retire happy, honestly.
I'm a simple woman.
Can we just get one thing, people?
First of all, then it gives non-technologists,
people outside of our bubble, the media,
you can't hang it on these things anymore.
You know, we all have to then sort of grapple
with the complexity, which admittedly, humans, not big fans of, but...
People want simple stories, simple narratives.
And people say, oh, remember the S3 outage?
They don't want to sit there and have to recount 50,000 different details.
They want to say, oh, yeah, it took down a few big sites like Instagram, United Airlines, and it was a real mess.
The end.
They want something that fits in a tweet, not something that fits in a thesis. Well, and if you have a single root cause,
then you can fix the root cause and it will never happen again. Right?
That's the theory. If we're just a little bit more careful, we're never going to have outages
anymore. Yeah. If we could just train those humans to not try to make the best possible
high quality decision they could possibly make in that situation, given the information they have at the time, then we'll do better. But I mean,
that's why your systems stay up most of the time, if you think about it. It's shocking how well
these things actually work the vast majority of the time. And that's what we could learn from this
too. We could, you know, oh, if we would write near misses up, please. I mean, if I could have
one more wish, I think one of the coolest things the airline industry and the government side of
that did was start writing up near misses. Wow, what do we learn from when we're successful
versus trying to, you know, like, spelunk and nitpick the failures?
Most of us aren't so good at the whole introspection part. We need failures. We
need painful outages to really force us to make
difficult, introspective, soul-searching decisions and learn from them.
Yeah. And I don't disagree with that. I just wish one of the things we would learn is that we should
study our successes too. There's more to be mined from our successes if we can figure out how to do
that than there is from our failures. So I have a metadata category
in the void called near miss. And oh man, I really wish people would write those up more. I mean,
I think there's like five things in there that I found so far because the humans hold these systems
together, right? We make these things work the vast majority of the time. That's why there is
no root cause. And even when we're involved in these things,
we're also involved in preventing them,
or solving them, or remediating them.
So yeah, there's no root cause.
Humans aren't the problem.
Those are my big hot button ones.
I really wish more places would embrace that.
Even Amazon uses the root cause terminology internally,
and I'm not gonna sit here and tell them
how to run large things at scale.
That's what I pay them to figure out for me. But I can't shake the feeling that by using
that somewhat reductive terminology, that they're glossing over an awful lot of things the rest of
us could really benefit from. Well, so the question then, one of the other things that I look at is
personally, when I read and analyze these incident reports, these public ones a lot,
I always ask myself, who's the audience for this?
And there are different audiences for different types of incident reports and different things.
You know, the vast majority of them are for customers, partners, investors.
The stock market.
Yes, yes.
You know, they're not actually for the organization.
There's usually an internal one that we don't get to see. Maybe that's for the organization. There's usually an internal one that we don't get to see,
maybe that's for the organization,
but a lot of places feel that if you have a process
and a template and a checklist
and a list of action items at the end,
then you've done the right thing, right?
You've had your incident, you've talked about it,
you've got your action items, move on.
Right, and it always seems with companies
that as you get further into the
company, the more honest and transparent the actual analysis is. Like at some point you wind
up with the, like they're very public and very cagey and under NDA, they open up a little bit
more and a little bit more. And finally, when you work there on their executive team, it turns out
the actual thing was, well, Dewey was carrying an arm full of boxes and the data center trip
went cascading face first into the EPO cutoff switch that cut power to the entire facility.
The cager they get, the, I guess, not to be unkind here, but the more ridiculous, whatever the actual answer is.
It's one of those things where really someone tripped and hit a button.
You didn't have a plan for that.
Well, not really. We sort of assumed that people would. Why would't have a plan for that. Well, not really. We sort of
assumed that people would- Why would you have a plan for that, right?
Right. Why would you have a plan for that the first time?
Yeah. I mean, so imagine that. Imagine this exercise, sitting down in a room with a bunch
of people and going, what are all the things that could go wrong? I mean, ain't nobody got time for
that. That's not how it works. You all have other jobs to do too and systems to build and pressures
and customers and partners and features to build.
So admit that and acknowledge
that you just won't know all of the antecedents
and how do you respond when things happen,
which is a whole other, you know.
I know you told me you recorded an episode
with Dr. Christine Maslach on burnout, which I'm so happy you did. And there's, there's a whole nother piece
of sort of incidents and incident response and burning people out and blaming people and all
that stuff. That's a whole nother part. It sounds like you might, you know, probably not incidents
with her, but still these things take a toll on people and people who, like I said, show up every
day, really hoping to do their best job and go up a ladder and get a promotion and, you know, whatever.
So I think not just treating those things as checklists has broader implications as
well, just for the well-being of your organization.
On some level, the biggest problem that I think we've run into is that, as you said,
it all comes down to people. Unfortunately, legally, we can't patch those yet. liberal arts degree. Come on, help me out, people. Because there's so much of these socio-technical
systems where the socio part of it is more relevant than the actual technical part.
I believe you're right. For better or worse, there's no way around it.
Thank you so much for taking the time to speak with me. If people want to learn more about what
you're up to, where can they find you? And we will, of course, throw a link to Void in the show notes.
Yeah, I also like to talk on Twitter like you do.
I'm not as good at it as you are, but I try.
So yeah, I'm Courtney Nash on Twitter.
And at Verica, you can find me at Verica as well,
Courtney at Verica.io.
And those are the best ways to find me, I would say.
And yeah, please, people,
write up your incidents, send them to the void, and let's all learn and get better together,
please. Thank you so much for taking the time to speak with me today. I really do appreciate it.
Thank you for having me on. I know, do people say this? I'm like, yeah, big fan, but I am.
I'm a big fan. Oh, dear Lord, find better things to
listen to. My God. But it's been a treat. Thank you. Courtney Nash, Internet Incident Librarian
at Verica. I'm cloud economist, Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed
this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've
hated this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with a comment
making it very clear that for whatever reason the website is down, it is most certainly not your fault.
If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group.
We help companies fix their AWS bill by making it smaller and less horrifying.
The Duck Bill Group works for you, not AWS.
We tailor recommendations to your business,
and we get to the point.
Visit duckbillgroup.com to get started.