PurePerformance - Learning from Incidents is what good SREs do with Laura Nolan
Episode Date: January 16, 2023Incidents happen! And when asking Laura Nolan who was an SRE at Google and Slack, healthy organizations should take proper time to analyze and learn from them. This will improve future incident respon...se as well as overall system resiliency.Tune in to this episode and hear Laura’s tips & tricks what makes a good SRE organization. It starts with doing good write ups of incidents, doing your research on incident reports of software and services that you are looking into using. We also spent a good amount of time discussing root cause analysis where she highlighted an incident that happened at her time at Google and what she learned about outdated alerting.Thanks Laura for a great discussion and lots of insights.Here are the additional links we discussed during the podcastLaura on LinkedIn: https://www.linkedin.com/in/laura-nolan-bb7429/Laura on Twitter:https://twitter.com/lauraliftsIncident Template talk @ SRECon: https://www.usenix.org/conference/srecon22emea/presentation/nolan-breakWhat SRE could be talk @ SRECon: https://www.usenix.org/conference/srecon22emea/presentation/nolan-sreHowie Post-Incident Guide: https://www.jeli.io/howie/welcomeMy philosophy on Alerting article: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson.
And as always, I have my predictable co-host, Andy Grabner,
who's trying to throw me off as I say these words to you today, our dear listeners,
because he has a very low opinion of you all.
And he wants me to mess it up so that you have to suffer through my stumbling around. And I care about you though. I know Andy, Andy can be a jerk sometimes. It's the side of Andy you probably are not aware of how much of a jerk he is. He once
yelled at me for petting a puppy. I just want to let everybody know, okay? So anyhow, Andy,
mean guy, how are you doing today?
I'm really, you offended me when you said that I'm predictable.
Well, you did the same thing again.
Yeah, because then if I am predictable,
If you didn't, then I was then you were doing if you didn't do that today. So what
he does is he mocks me as I'm doing my intro, for the people who can't visualize
because it's audio only. So he was in a bind because if he mocked me, then he was being predictable.
And if he didn't mock me, he was caving into my challenge to not mock me.
So it was a catch 22, as they say.
Yeah, I tried to find a nice segue now to our topic today because predictability
has something to do actually with what we're going to talk about today because I want to make sure that system becomes, I think, more predictable
as they start failing. Is that a good segue?
Yeah, yeah, yeah. And you want to be able to understand what the past is showing
you about the future, hence the predictions, yes.
Yeah, exactly. And talking about predicting the future, learning from the past, learning from incidents, I think is one of the topics, is the main topic of today.
And we have a lot of incidents that happen on a regular basis.
But today we want to talk about experiences that Laura Nolan has.
And Laura, I think we've never met in person, unfortunately.
But Laura, I saw your work online.
I found you when I did some research on site reliability engineering. I saw you were speaking at SREcon and I'm pretty sure at many other conferences and online venues and also
on-site venues when we were back in the days when we were speaking at places then this strange
incident happened for the last two years um but laura instead of me trying to figure out how to
best introduce yourself please do us the favor uh introduce yourself to our audience hello well um
yes i'm laura i am i'm a software engineer um and i guess for long time, I've been very interested in failure, what it can tell
us about systems, interested in what we can learn about systems, I guess, more generally, right?
I mean, every system that we work with is different and unique in its own ways, but what can we learn
from one system and how it fails about other systems? And it turns out that there's actually quite a lot.
There's a whole system science that's been out there for the last number of years.
So I think we see things.
When we look at our software systems, our production systems,
we very often see repeated types of failures, sometimes within the same
system and sometimes things that look similar that happen in other people's systems. So I think
there's a lot that we can learn. This is not a new insight in software. This is something that
has been done in other industries, most famously, I guess, the airline industry. I mean, they spend
a lot of time learning from their incidents.
Incidents can teach us a lot.
So, yeah, that's something I'm very interested in.
It's not the only thing I do,
but it's certainly something I'm very interested in.
And, Laura, I think you don't do yourself justice
when you talk a little bit about your history.
I just have your LinkedIn profile open, and it's fascinating.
If I see who you worked for and what you did at this company.
You worked for Google as a site reliability engineer, as a staff site reliability
engineer. You worked at a company that Brian and I
rely on every day, 24-7. You worked at Slack.
And I'm pretty sure you have your fair share
of stories. And I'm pretty sure you have your fair share of stories.
And I'm really looking forward to now have some conversations on what we can learn from incidents and kind of what kind of practices you had and what kind of culture you had back in these organizations.
Right now, who do you work for right now?
I think you switched jobs recently.
I did.
In the last few months, I've switched to work for a company called Stanza.
Stanza is, I guess we're still fairly stealth mode in our startup.
We're in early days.
We are baking our first product.
Unsurprisingly, that product has definitely things to do with reliability
and helping you prevent and recover from incidents.
We're not an incident management software,
but it's something in that space.
All shall be unveiled in the coming months.
So that's very interesting work.
It gives me a chance to, I guess,
apply things that I've learned through the rest of my career into a product that we hope will be good
for everybody in the industry.
Awesome.
So, yes.
And you're correct.
I have many stories.
One of the things I used to do at Slack,
it wasn't my main job,
but I fairly often used to write engineering blog posts
about some of our incidents.
I think I've got three or four that are up there
on the Slack engineering blog.
They're quite long and detailed.
So those are worth a look
if you're interested in incident stories.
I'm not sure if there's a good timing,
but I think it kind of is a good timing because we at Dynatrace, we just
had an incident last week that also forced us.
We wrote a blog about it. We had, after
an update of our single sign-on, our
customers were no longer able to
log into our systems, which obviously
is an
incident we're not proud of, but at least
we could fix the problem
in almost no time.
And just as you just said,
on the engineering blog, it's like
you were covering these incidents
back then to really, A, I guess, show the world that you are open about this because failures can happen.
I think that's also part of kind of a blameless culture that you can just admit that failures happen.
And I think the more we share about what problems can happen in complex systems, the more we hopefully contribute to making the world a more resilient place
because others can learn from us.
And therefore, it's not just from our own failures that we learn.
My question to you would be, Laura,
how can you convince an organization to actually go down that route?
How can you actually, I mean, how can you be open about things
that people typically are not that proud of if something bad happens?
Yeah. Yeah, that's a great question.
I guess when we're looking at incidents, there are two kind of strands.
Companies will do an internal incident review, something that's just done internally.
Very often a lot of depth.
You might think about kind of organizational factors as well as technical factors, how people work together, all of these things. And then there's
the public blog post. So this is a really interesting beast. John Oldsball has talked
about this quite a lot. So he says that external and internal write-ups serve different purposes.
So he thinks that the internal write-up is generally,
hopefully geared towards learning and actually improvement.
Whereas very often the external blog post is an attempt to save face
or sort of make an extended apology
or sort of convince your customers in some way
that you're a responsible company trying to do the right things.
I think John isn't wrong, but weirdly enough, I think that the most impactful
external incident write-ups are actually the ones where people are not
trying to make it an apology or trying to make themselves look good, but actually the ones where
people are as honest as they possibly can be in that context.
And of course, there's all sorts of complications. I mean, organizations are interested in protecting
their trade secrets. Organizations don't want to create legal trouble for themselves.
So it's not easy to be completely sort of honest with your external incident reports. But I do think that it's possible to be open enough
and detailed enough to actually, you know,
to disseminate information through the industry.
And I think there's a lot of times when we do see that.
There's a lot of public incident reports that are out there
and some of them are amazingly detailed.
Some of them are amazingly useful.
And, you know, there are some times where, particularly when the case when you're dealing with open source
software you see the same things again and again um so i think that's you know i think that there
is a lot that we can learn from doing these public posts and i think that if they're done well they
can be really good for a company's image, particularly wherever you're dealing with
customers who are likely to be technical, you know, to be in software to sort of understand
the fact that failure happens. There's no organization that's so perfect that you don't
sometimes have a failure or have an incident. You know, as you said with your SSO, it's it's you know as you said with your sso it's how fast can you respond how fast can you actually recover from that um and can you learn from it and improve you know these
are more important metrics than do you have incidents or not so when it comes to there is um
the absolute worst thing i think that you can do is you can kind of come across as self-congratulatory
um a lot of people were for example looking at the big atlassian outage um they had that incident absolute worst thing I think that you can do is you can kind of come across as self-congratulatory.
A lot of people were, for example, looking at the big Atlassian outage. They had that incident,
it was last year. They had a data migration that went wrong and they had a very extended recovery period. But a lot of people thought that their blog posts that they did about that when they
described that incident and the recovery was
sort of an extended advertisement. So one thing that I saw that rubbed people up really the wrong way was when Atlassian said, and we used our own tracker products to help us recover faster.
People were upset about that because it had been such a protracted outage so that's definitely something i think should be avoided
but it also reminds me a little bit brian remember the days years ago and laura i'm not sure um
if if you have followed this but i would say if 10 years ago um when we were analyzing websites website performance and availability of of websites, especially around Thanksgiving and Black Friday and Cyber Monday.
Super Bowl, yeah.
So websites always went down.
And I remember a time when companies were kind of proud of their website crashing
because kind of like, hey, we did such a great job in marketing
so that
so many people tried to visit our site and our site crashed.
So they kind of turned it in a, I think it's similar what you just said, right?
Hey, we messed up, but we try to give it a positive spin.
And I don't think that's necessarily a bad thing about trying to give it a positive spin
as long as
you're still honest in the end and say it's genuine it's genuine exactly yeah
and big shout out to her you know our friend James Polly he was he was always
leading the vanguard of destroying people who'd be like oh yeah our site
was so popular our advertising was so popular our site went down he would like
flip on that you know he just go apeshit over that.
But the other thing I think you mentioned, which is really, really important, is one
thing, just even you being on the show, Andy and I doing the show, all the guests, all
the talks people do, is the IT community is so keen on sharing information and putting
stuff out there to help others.
Whether it's lessons learned, ways we had success,
ways we learned from our failures,
and the specific way you're talking about
when you're talking about increasing your credibility.
So long as you got yourself covered legally, right?
Because that's always the public post is,
the legal part's always going to be a big thing
because there's people waiting to find, you know,
find one little door to put their foot in to sue a company. But as long as you can cover that by additionally
putting as much information and explanation of what happened, how it was figured out,
how it was remediated, you're then sharing with the community once again for this really
pretty embarrassing incident. That, hey, we're
this big company or whatever, and we fell down, but we're going to share it all. So
not only did you fix the problem, get it back up and running, but you gave back to the community
as a result, which is just going to turn it into such a positive for everybody. And there's
no, except for the legal side, there's no good reason
to not do that because it's just going to increase your credibility. Whether you're
looking at company reputation or even big picture of recruiting, people are going to
look at that company and be like, hey, that was really cool. I think next time I'm on
the market, I might check that company out. I can't see a net bad about it, as you say,
especially the way our industry works.
And I think there are very tangible good things that come out of it as well.
For example, if I'm picking up a new tool or a new SaaS service to use, and it's going to be
part of my critical production systems, I search for other people's incident reports that mention
that particular product, because that gives you insight into the ways it fails and the way it works.
That's very hard to get out of manuals.
So you can get that information, you know, pitfalls to avoid good and bad ways to use particular tools and products. So that's directly valuable,
particularly in a world where, you know,
software is converging on using, I think,
a smaller and smaller number of fairly popular products and services and open source projects over the years.
I mean, look at the standardization on things like Kubernetes,
for example, and that's a community that's been very,
very open about incidents and patterns that are evolving.
And then that information can make itself into the product.
The product can become hardened against frequently seen failure modes.
It can make its way into training and really tangibly actually
sort of elevate all of us as a community.
You know, the key thing you said there too that struck me is when you're reading the details of it you understand how it was used or misused or whatever
that just goes back to a direct correlation to the idea of reading
product reviews as opposed to just looking at the stars you know or if
you're taking a go on on a vacation and you take,
oh, I got one star. Why? Oh, because I didn't like the way they made my bed. Okay, that's a
dumb one star review. I'm going to ignore that one. As opposed to I went in and the sheets were
soiled when I checked in. Okay, that's legitimate. Or someone talking about a product that,
I bought a lighter and I tried to burn down the building next to me
and it didn't light a big enough fire.
All right, well, that's not quite what you'd want to use
to start a fire like that.
So maybe you bought the wrong product to begin with.
So just the point being, I think context is key and that's what you were really getting at it's like what was the context
of those uses and when people have a bad experience or a good experience what's the context of that
because that may or may not apply to what your plans are and that's going to give you that
additional information in that and then the company obviously as you said can turn around
and be like hey actually we do want to fit into that context. Let's take that feedback and work with it.
That's a really, really great point.
Yeah, rambling now, as I do.
If podcasts aren't for rambling, then what are podcasts?
Oh, sorry, go on.
I was going to say, just an observation about context.
I think you're exactly right.
And that's something that we can get really richly
from a good incident write-up.
You know, a lot of context, you know,
for as long as I can go back and think of,
we've had public bug reports and things like that
on open source products.
But a well-written incident review
gives you a lot more of the story, you know,
not just, you know, we did this and we saw this bug,
but, you know, what's the architecture?
Why are you using it in this way?
And what changed to sort of trigger a problem?
You know, it's much richer than the need to forget.
Laura, speaking about good incident reports,
are you aware of any templates, any best practices?
Is there anything out there?
Because I assume maybe some of our listeners, they would like to start with this, especially in their company, kind of start first internal, then maybe public incident reviews.
Is there any place to get some guidance around it?
There absolutely are many places to get guidance.
I don't agree with them all, actually, weirdly enough.
So a really good one is this company called Jelly.io,
and they have a really good guide called the Howie Incident Review Guide,
so H-O-W-I-E.
And that won't lead you wrong.
That one's pretty good.
I think a lot of people and this is I gave a talk at SREcon 2022 just about this topic.
I think when a lot of people sit down to write an incident report, they tend to sit down with a very, very detailed template.
So you'll get standard incident review templates that will have a lot of it's like a big form.
You've got a big document and you'll have, you know,
start time of impact, end time of impact.
What was the impact?
Executive summary, you know, root cause.
And we should talk about root cause in a moment.
Then, you know, what was the resolution?
All these sorts of little snippets of information.
And the problem here is people will
sit down and they'll sort of approach it like you're filling in your tax form, like it's a chore,
like it's something you want to get through, like you just need to get all those boxes filled in
with something and get done with it. And I think that that's a real anti-passion because it leads
to documents coming out the end that have almost no value. I think a good, well-written incident is something that you can put aside
and maybe somebody joining your team in a year or two years,
if those systems are still there and still similar,
they should be able to read those documents and get some value out of it.
And that only happens if there's enough context, enough richness.
And that is something
that having a form with 20 or 30 sections that you have to fill in sort of mitigates against it.
So I think that people should actually sit down and first, you know, think about it as a story.
You know, where does the story start? Maybe it starts on at the time that the incident starts,
or maybe it starts three years before when you're making a design decision that turned out to be pivotal in that.
Or maybe it's six months beforehand when you decided to add a new feature, or maybe three months beforehand when you decided you were going to have a giant flash sale on this particular day and your system gets overwhelmed. So, you know, thinking about it as a story
makes you think about the, you know,
where to start, how much context to give.
And a good incident report, I think,
does need a lot of context about system architecture
and the reasons why things are how they are.
You know, these are all the things that,
you know, somebody picking that up
without a lot of context in a year or two years
are going
to need to make sense of it and it's it's a it's a big investment to to write those kinds of reports
and it's not only writing you also need to go and you need to talk to the people who are involved
get their stories weave those into the narrative because if you have six different people involved in an incident,
you're going to have six different perspectives,
six different understandings of what happened.
And, you know, they're not all going to be the same.
People have deeper understandings of particular subsets
or particular parts of the system,
particular parts of the thread of what happened in that incident.
And you've got to put them all together like a jigsaw puzzle. parts of the system, particular parts of the thread of what happened in that incident.
And you've got to put them all together like a jigsaw puzzle.
What's more, I mean, sometimes you get people who have different ideas about what actually happened.
That one is challenging.
I mean, whether or not there's one truth, I mean, there probably is one truth, but it
isn't always possible to go back and actually tell in detail who was right about a particular sequence of actions because you may not have records of what happened.
And frankly, a lot of incident response happens in people's brains.
What's pivotal is what do people understand about the incident as it happened?
What do people understand about the system?
What do they know and what do they not know? What made them think that something might have
been the cause and take a particular action? And, you know, we're humans, we're very complicated
bags of meat and we have all sorts of cognitive biases. You know, if I'm responding to an incident
that's happening in a particular system, I might be biased by the fact that six
months ago, I dealt with an incident that looked similar and had a particular cause.
So I might be predisposed to go looking at that particular part of the system that caused the
incident six months ago. But today, that might be something else entirely triggering these
similar symptoms. So my experience might lead me astray. So people are going to have
very different accounts of the incidents because of these different cognitive biases.
So you've got to put all this together into some form that is useful, but also rich,
also has a lot of context. And you and your organization are going to have to try and learn from that
because that's the most important thing that comes out of an incident.
I mean, it's not wrong to take out action items,
alerts to change, feature changes,
or sort of resilience changes to make in the code base.
That's not wrong.
But the big thing that organizations miss more often than not
is figuring out what can we actually take away from this
about the strengths, weaknesses of our systems
and also our processes, how we deal with things,
how we interact as an organization, how we share what we know.
Fundamentally, when we work in technology organizations
dealing with complex systems,
one of the big difficulties that we have
is that we have a lot of complexity
and everyone knows their own different slice of it.
And when things are so tightly coupled
as they can be in software,
in production software systems,
everything has an
impact on everything else. One of the key things is bringing together enough knowledge to actually
make sense of things, to understand problems that may arise or are arising and sort of get that
human side working optimally. Now people cannot see me me but I was just staring at you and like listening and taking it all in
while you were talking about this I also found the talk that you mentioned it was called break
free of the template incident write-ups they want to read I think that's the talk Laura correct yeah
so for everyone that is listening we will add
the link to this talk as well uh for it was srecon emia in 2022 um laura you mentioned
earlier already we we probably will talk about root cause as well and i think as you were explaining in your last explanation
about how important writing good stories and good incident reports,
you mentioned root cause.
We need to figure out, especially in complex systems,
what is the real root cause.
And the root cause for a problem that seems the same as one six months ago
might be completely different because the environment has changed,
the people, how they react to it has changed.
We are, Brian and I, are big on root cause.
I mean, we're working for an observability platform where one of our goals is
to make root cause detection easier.
Now, do we do an always great job?
I think we can always do better and we can learn.
From your perspective, what you have seen,
what do people typically get wrong when it comes to root cause
or where do they get lost?
Or just some tips on figuring out the root cause.
What are the things that you can tell us?
What I will say is that a lot of people in the industry
have moved away from thinking about a singular root cause to thinking about having multiple contributing factors that play a role in an incident.
So I think something that a lot of people say about complex systems is that they tend to be heavily defended.
And by that, I mean, we build a lot of stuff
to try and make our systems reliable.
So we build in redundancy
so that we can send traffic
to different machines if one fails.
We build in testing
so that we have automated tests
that run before we put
a new release of code out.
We have canarying
so that we can see if the metrics for our new release
seem weird. We have load shedding. We have all these things. So we've got heavy defenses.
So when something goes wrong, I guess the point about a single root cause is it's not normally
just that one thing that went wrong. I'll give you an example. Years and years and years ago,
while I was working at Google, I worked with this very, very big distributed database.
And the distributed database was, you know, a typical sort of versioned database, you know,
new pieces, new little packets of data were being pushed onto this thing all the time,
and they were sort of being merged with the existing data set.
And there was a periodic kind of merge process that would take two smaller chunks of data and sort of smoosh them together.
So trying to optimize the reads and the writes and all this kind of stuff. So for the client consistency part of that, what clients would do is they would make queries at a particular version of that data.
So if you're a client and you are doing some, you know, large, complicated queries or doing like trying to do some point in time snapshot of something, we had a lot of clients like that.
This was sort of a statistics database.
You would take the current kind of data version at the start, and then you would do your big complicated query looking at that data version throughout. Okay, so that's fine. One day,
we started getting complaints from people that they were getting stale data in their queries.
They weren't seeing what they believed to be the most up to date data that should be
in the system, and this took way too long to diagnose.
But it turned out that there was basically a place where the database was publishing
the most recent data version so that clients could read it and do their query
against the most recent, and this process that pushed that out had gotten broken somehow. And okay, that's
a root cause, right? But there are more root causes than that. So first off, why didn't we
get an alert about that? We had built an alert years beforehand, it got that alert somehow broke and so we didn't get an alert
that's the second root cause or contributing factor as most people would say thirdly why
didn't any of us think to go and look at this particular mechanism because none of us had heard
of it i mean logically this thing had to exist And when you sat down to think about how the whole thing worked, yes.
But this wasn't in any of the team's training that we had.
We had never done a Wheel of Misfortune exercise about this thing
or any practical dirt tests.
This thing had been there for years, and we had just forgotten about it.
And then one day, it silently broke,
and the alert that we'd put in years before failed failed and we had to figure this out from first principles.
So there's a whole bunch more contributing factors that went into this incident.
So if you're going to talk about, you know, what was the sort of proximate technical cause of something not working?
Sure, maybe there is one root cause, but when we think about the broader systems of,
you know, how do we, how are we enforcing the invariance in our systems, which is monitoring
and alerting, right? We alert when we think that some property of our system is not as it should
be. Most incidents involve some aspects of, you know, alerting not being the way it should be.
Extended incidents like that one typically involve some case, some reason where the responder's mental model is not actually in sync with the real system.
And people having trouble figuring out what is actually happening and what should be happening and what is the difference between those two things.
So there's a whole bunch of things that go into any sufficiently complicated or serious incident that it's very, very hard to reach it down to just,
well, this line of code here or this permission here.
And I think what most people who argue against a single root cause say is when you stop at one root cause, you're losing information.
There's all this other stuff around that single root cause that contributed into the incident as well.
And if we stop because we found the root cause, we don't think about, well, you know, why is it that that wasn't in our team training?
And why is it that we never did a will and misfortune exercise about this thing?
And those are questions that we should be asking.
So it's the same with the five whys.
I mean, you know, why only five?
Why not 10 whys?
Why not three whys?
So, and, you know, even with the sort of the broader contributing factors approach, you can certainly miss things.
But it's just, it encouraged you to take that sort of broader look at what happened.
And, you know, the sort of broader socio-technical systems, which means that the meaty humans that look after the computers,
you were equally important. Again, a fascinating thought for me is because you mentioned the alert that you set up years ago,
but it was never, I guess, updated, tested, validated, and therefore, you know, it contributed
to the alert not obviously going off and alerting it it correctly how can we fix this because i've
heard this a lot from from people we talked to over the years they're setting up alerts and
alert here and alert there they get forgotten they never really get tested and re-evaluated
um brian we had a we i think we had a conversation a year or two ago with anna medina she was back
then working for gremlin as a chaos engineer.
And we talked about test-driven operations.
Basically, really using kind of chaos engineering
also to continuously validate
that the alerts that you've set up are still working, right?
By bringing the system artificially into a chaotic situation.
And basically these alerts, whether it's alerts, thresholds,
SLOs, remediation books, they should all be considered
part of your code.
Because essentially most of them, they are code,
especially now as we're doing everything as code anyway,
also your monitoring configuration, you're alerting. So why don't we get better
in also including this in our testing?
Is there a way how we should also measure
kind of test coverage of our alerts
if our rules are still working,
if this even makes sense, right?
Because if you configure the alerts over years and years,
maybe 50% of them don't make sense anymore
because these measures, these metrics that you defined them on,
they are no longer either producing data
or completely different value ranges.
Absolutely.
I was just going to add to that, though.
You talk about configuring all these different alerts and everything,
and I guess keeping in the spirit of SRE,
shouldn't we instead be focusing on the end goal instead of alerting on subsystems?
So if your alerts, let's call them,
are based on availability, based on response time,
based on responsiveness and how much they're failing, basically based on whoever the consumer
is, whether it's another system or an actual user, a meatbag, as you say, if that's what your alerts
are based on, those are going to be a lot more consistent
and valid over time, unless you want to change, instead of three seconds, we're down to two
seconds now.
But as opposed to monitoring CPU or memory consumption or all these other things which
are going to be flipping around, if you have an end-user, end-cons end consumer focused set of alerts,
if you have observability set up in the entire back end then,
so that when one of these do trigger,
you can go through and look at everything else
and find out what's going on,
wouldn't that be a safer, more bulletproof, if you will,
way to set that up?
But I'm asking you specifically, Laura,
because you come from this SRE background, right?
Because this is not in my wheelhouse.
This is just my understanding of this.
I'm really interested on your take
on all those old system alerts
versus more of an SRE approach,
at least the SRE way I understand it,
which could be wrong.
You are not even slightly wrong about that.
So what this is,
is the difference between symptom-based
or SLO-based alerting,
if you're using SLOs, same idea,
versus your traditional sort of alerts that might be based on causes
for system problems, like, as you say, high CPU, that sort of thing.
There's a link.
If you're putting links on your podcast, there's one I'll send you.
Rob Owischuk is a person who's written a really good document on this. It's very long. There's one I'll send you. Rob Owischuk is a person
who's written a really good document on this.
It's very long.
It's very detailed.
What I would say is symptom-based alerts are good,
but you have to take a broad view
of what your users are expecting from your systems.
So in this case,
that particular story I was talking about,
having a fresh, recently published, up-to-date data version
published in this particular place
was actually part of the system's contract with its users.
So it's a valid thing to monitor about.
It's not the same as monitoring for CPU and so forth.
Other things that people typically want to monitor on,
things like the golden signals,
you know, are you serving requests without an excessive percentage of errors,
without excessive latency, that kind of thing. So those are golden signal type alerts,
as you say, as opposed to things like CPU and so forth. The reason that we don't want to typically monitor on CPU memory, that kind
of thing, is because, as you say, thresholds change over time. Those alerts can get very stale.
You can end up with either alerts that don't work, that don't reflect user experience, or you can end
up with alerts that are noisy, that are constantly paging your on-calls. And both of those are bad.
And the worst case is you can have both of those at once,
which is really, really bad.
So most people, I mean, there's definitely a strong trend
towards the symptom-based alerting model.
The trick here is you have to be clear about what your system is providing.
It's not always just ORPC errors and latency.
Freshness is something that people do frequently overlook, but it can be very important for a lot
of systems. Correctness more generally, if that makes sense. Yeah. And I think, Brian, in some of the last episodes,
we talked exactly about this, like using SLOs on kind of the system boundaries,
like the system boundaries, meaning as close as possible to the end consumer,
or if you have critical core systems, like an authentication system,
a storage system, and then on these system boundaries,
defining your SLOs and then alerting on those, but not on each individual CPU metric.
But when you explained this, Brian, I remember the conversation we had.
But then I thought, still, when we're building resilient systems, we try to become resilient to a problem of a downstream system we're depending on the problem is though when we
never re-evaluate that how do we measure the resiliency of that downstream system right
if we never evaluate re-evaluate the thresholds we set on when we flip over the traffic when we
scale up when we scale down when we maybe you know fall back to a cache versus going all the way
through the database.
I think if we then never revalue it, then we could actually end up in a situation where we should have been alerted much earlier, where the kind of self-healing resiliency
should have kicked in automatically, but it didn't because we never revisited all of these
settings, these self-healing things.
And therefore, all of a sudden, boom, everything fails because we just never revalued them.
That becomes tricky too, right?
Because where do you store all that information?
How do you track it all?
How do you even know what it's for?
Like the complexity of code, the complexity of the interdependent systems.
If you think about when you went from monoliths to microservices,
trying to keep track of what service is talking to what service
to what service to what service,
same thing goes into these thresholds and everything else.
Like, keeping track of that stuff is going to be very, very difficult.
Sorry, Laura, you were going to contribute something there too.
I didn't mean to.
I've got a couple of comments.
So first off, you're right.
There is a lot of information to track there.
There's a whole sort of emerging set of tools for doing that. Spotify has a service catalog product, which
I have entirely forgotten the name of. Backstage. Yes,
Backstage. Thank you very much. And there's a whole bunch of, I mean,
nearly every sort of observability tool that you see is starting to bring in
ideas now of showing your graph of service calls, all this kind
of stuff.
So the tooling is starting to creep in to make this easier for you.
But regarding SLOs and boundaries, there's an interesting thing there.
When you divide up an organization really rigidly and you say, okay, you own backend service X and you own a different backend service Y and X calls Y. There's a tendency for service X, depending on the organization and its culture
and everything like that. But I've seen situations where the people running service X say, well,
service Y has a five nines SLO. So we expect it to give five nines and to never be down effectively. And what
happens then is they stop thinking about, well, what can we do to be more reliable if service Y
goes down? Is there a way that we can use cache data or is there a way that we can do graceful
degradation and still do useful work? And sometimes when they say, okay, well, it has a five lines SLO, we're good, they skip over that part. So when you think about the system overall being
resilient or robust, you can't think about it just in terms of SLO math, like X calls Y, and,
you know, these have high SLOs, so the overall system will have a high SLO. You've got to think
about how can each service be more forgiving
to SLO transgressions from other services because they will happen.
And, you know, having thought about that and having thought about
how we can be more resilient in the face of failure is the way to get
to the most reliable possible overall system, not just the SLO.
So I think that's a pitfall that a lot of organizations do fall into. to get to the most reliable possible overall system, not just the SLOs.
So I think that's a pitfall that a lot of organizations do fall into and you can do better to...
SLOs are a great tool, but they shouldn't be
this rigid organizational boundary where it's just
this expectation and no sort of further thinking
about resilience.
It kind of reminded me of an analogy, right?
We are taking electricity for granted.
Why would I even start buying candles?
Other than for romantic reasons, obviously.
But I'm just saying, right?
But now we all of a sudden see that,
at least here in Europe, we don't know.
We need to get prepared for a situation where maybe electricity is not there
for a couple of hours or even longer. But I guess we never thought about it
because it was just always there. It can turn on the light
by clicking a switch.
I think, Laura, that the point you made, it sounds slightly familiar.
I don't know if it came up in the past or not, but if it did, I think we completely forgot about it.
So I think it's a fantastic reminder.
And what you talked about was, yes, you have your SLO.
But the level of forgiveness that the different services can have for other downstream services, right? If you have your first
service below the browser or below the web server, right, how forgiving can that
be if some downstream service is starting possibly causing an impact to
the SLO? Is there some sort of AB, some canary, some sort of switch that can go
into effect that can then make up for it.
Or if you know something else can be a problem, what can you do on these upstreet ones to
be more forgiving of it?
And I think the danger people might fall into with SLOs is they convert over, they get a
bunch of SLOs set up, and they pat themselves on the back.
But the SLOs are just a measure.
Still not doing anything for your system.
You're monitoring it better.
You're looking at it from a better context, but you haven't made any improvements.
Yeah, exactly.
And that's the part that's important.
You can have very, very shallow implementations of SLO where you just say,
okay, well, here's some numbers, here's some metrics, job's good. But you haven't actually done that work to think about, well, what actually do
my users expect from this system? So you can miss things like my client version data file from the
previous example. You can miss things like freshness requirements. You can miss a whole
host of things. And as I said, you lose that opportunity to be more forgiving. Because sometimes services do have options rather than
just failing because one of their downstream services is failing. You may have the option to
serve something static, serve something stale, omit a piece of data from your results.
In the worst case, you should at least fail fast
and let your user know that you have failed.
Whereas it's not uncommon to see services
that will try and call that failing backend service,
wait indefinitely, and then suddenly, you know,
your service has used up some resource,
be it memory, be it connections, be it threads,
be it something else.
And, you know, now your whole system is down and needs a reboot.
That's the worst of all worlds.
And you can put in all the SLOs in the world,
but if you don't do that work to avoid those scenarios,
you're not making your system more robust.
Hey, Laura, I hate to say we're getting close to the end of our recording,
and it's just fascinating to listen to you and hear from your experience.
I took a lot of notes that I will convert into a little description of the blog.
You also mentioned that you have a couple of links maybe that you want to share.
I tried to do some Googling on the site with some of the names you gave me,
but it would be great if you could send them over so everybody that is listening,
if you want to follow up on some of the blogs and some of the names you gave me, but it would be great if you could send them over. So everybody that is listening,
if you want to follow up on some of the blogs and some of the presentations that Laura has done,
find them in the summary of the podcast.
But I don't want to just cut you short and say that's it.
I want to still give you, especially,
oh, thank you for that.
Yeah, perfect.
So we'll add that link.
She was just sending something over to me.
Laura, is there
a final thought, anything we want
to make sure that our listeners
take away
from this podcast?
I think the big
thing that I think people should take away is that
in a healthy organization, an organization
that's learning,
that's growing,
it should be okay to take the time to properly analyze incidents,
particularly unusual ones, ones that are not well understood,
ones that were complicated, that involved a lot of different systems factors coming together.
A healthy organization should be able to take that time to do that analysis
and really do that learning. If your organization is pressuring you to, you know, come up with a detailed incident
review in three days, you know, that's not healthy. So if you're an engineering leader,
you know, think about giving people that time and space and think about finding ways to reward
people who are, you know, doing that work to keep your organization healthy and learning.
And, you know, if you're an engineer, you know,
be curious and, you know, try and learn.
That's all we can do every day.
Awesome.
And the link is called My Philosophy on Learning from Rob Evershoek.
I hope I kind of pronounced his name correctly.
Anyway, the link is a link to a Google Doc.
It's called My Philosophy on Alerting.
We'll share this as well.
Brian, anything from you?
No, I would just say, Laura,
I know, well, first of all, thank you.
This has been tremendous.
As always, Andy and I love this
because we get to learn so much as well.
But I am curious when your new company decides to be more public about what it is they're doing.
I'd love to maybe convene again because obviously you all see a problem out there and you're all coming up with a possible solution.
And based on this conversation, I think that would be really fascinating to hear more details about what the problem is
and what you guys are looking to do
to solve that in the industry.
So definitely keep in touch with us
because that might be another great show to follow up on.
Not so much for product placement really,
but really discussing about
that what is it you're looking to solve?
Like what have you all identified?
Because I'm sure everybody can benefit from it,
hopefully from your company, but others as well. But really, that's it. I think this has been fantastic
and thank you so much for taking the time.
Thank you for having me. I'm glad that all of our electricity stayed on for the hour.
Yeah, that's right. All right. Then, then Brian I want to say
thanks to you
this was
show number one
in 2023
after five years
oh yeah
that's right
but the first
recording
in 2023
that's right
because we recorded
the first show
in 2022
very much looking forward
to many more episodes
with you
and
getting close to 200
closer getting closer to 200.
Closer, I think.
Getting closer to 200. I think in the upper 170s.
Anyway, it's been awesome.
Thank you so much.
Thanks for everyone for listening.
And thank you again, Laura, for coming on.
Bye-bye, everyone.
Bye.