PurePerformance - From Postmortems to true SRE Culture with Steve McGhee
Episode Date: July 6, 2020Steve McGhee (@stevemcghee) is an expert in post mortems and SRE. He has learned the craft at Google, applied it at MindBody and is now sharing his experiences while back at Google to the larger SRE c...ommunity. Listen to this episode and learn more about how post mortem analysis can be the starting point of your SRE transformation. How it can help reliability engineering to build and engineer systems that fail gracefully instead of causing full crashes or outages.Steve also went into monitor what matters and only defining alerts on leading indicators with an expiration date – a fascinating concept to avoid a flood of custom alerting in production!If you want to learn more from Steve or SRE check out these additional resources he mentioned in the podcast: The SRE I aspire to be (SRECon19) and his 2 blog part series on blameless.com.https://twitter.com/stevemcgheehttps://www.youtube.com/watch?v=K7kD_JfRUY0https://www.blameless.com/blog/improve-postmortem-with-sre-steve-mcghee
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always my wonderful co-host Andy Grabner.
How are you doing today Andy?
Brian, hello. Very good.
Sunshine outside and not as high in temperatures as I know you told me earlier on your end.
But at least the rain is gone for one day,
even though we've been hoping for rain
because we had a too dry period.
And now I think the farmers are happy again.
Now it's time for some sun and some warmth.
You know, it's funny.
I do a lot with music.
I've really listened to music a lot.
And in that, I've been noticing a lot more lately
when people talk, their words trigger songs in my head. And I think there are about three or four that went off in my head as
you were just saying that. So that was just really disturbing to me because I don't know what's wrong
with me. So fortunately for everybody listening, I'm not going to start singing any of them.
I'll leave the singing to you, Andy, because I'm sure you have a better singing voice than I.
I'm glad your weather's fine. I was telling you before we started, we have this whole Miller moth problem in Denver
and it's really, really bad this year.
It's horrible.
So anybody who knows about these moths,
yeah, it's really, really, really rough right now
because not only are we stuck with kind of being
pseudo-quarantined and all,
but now you can't even turn on your lights at night
because they all come in your house.
So real fun times right now,
but it'll pass it.
This too shall pass as we say,
but we have better things going on.
Well,
yeah.
And then maybe also better weather and,
and no moth problems on wherever Steve is from.
Yes.
Steve,
who Steve,
who Steve,
Steve,
Steve jobs on the show today.
Nah,
I don't think so.
I can't hear to show back in here
so Steve
for bringing the show
from beyond
yeah
so
Steve McGee
hopefully I pronounced
this correctly
yep
cloud solutions architect
at least that's what
I read from
I'm looking at just
one of your blogs
on blameless.com
and Steve
you can probably just do a better job in explaining, but telling the world,
or at least the listeners, who you are, where you are from, and also how the weather is
and where you're from right now.
Yes, I'm Steve McGee.
You got it right.
First try.
Congratulations.
There's a secret little H in my name, which is completely silent.
You can just ignore it um so i live in san luis obispo california which is also known as slow or slo and i'm an sre and
i find that hilarious that i'm an sre that lives in slo um get that joke i think it's great
i actually visited that place in sixth grade yeah Yeah, it's got a mission right downtown.
I'm about 200 feet from the mission right now, actually, sitting in my car.
That's pretty awesome.
It's a beautiful day in Central California.
My background is I worked for Google for 11 or 12 years.
I was an SRE for about 10 of those and i worked on a bunch of stuff so like
android and play and compute engine and youtube and all all over the place um and then and then
i left i came here to beautiful central california to work for a company called mind body online
which is uh it's like an app that helps you find gyms and salons and haircuts
studios and stuff like that and i helped them like do the cloud thing uh so i became like a
cloud customer and then i'm back to google to basically help other customers other other
companies do exactly that like figure out what this whole cloud thing is and how to like use it properly
and my sort of like specific angle on that is like how to develop reliable systems on
public clouds so how to do sre on the cloud basically is kind of my goal so i help a bunch
of companies do that all day and it's pretty fun. Yeah. So awesome.
That's pretty cool. So you said you were at Google for 11 years before you did the other gig and now you're back.
So you've been doing SRE for what, 15 years?
Yeah. So I was when I first started Google, I was worked in the partner team, but I was working on reliability stuff then.
So there wasn't SRE back then, but looking back, it was awfully SRE shaped.
So I developed a monitoring system that let us understand when other companies weren't showing our ads or search results properly.
So one of my favorite stories from that is when MySpace, if you remember MySpace, it was a big deal at the time. They showed ads, Google ads, and we would find out when MySpace was down before they would.
And I remember calling their NOC saying, hey, your site is down.
And they said, no, it's not.
I said, yes, it is.
And they went, hang on a second.
And then they were, oh, you're right.
Hang on.
And they would go reboot something.
So we were kind of their pager.
Yeah, because you saw a drop
obviously of requests coming in from them yeah yeah because i mean if our ads weren't showing
on their site we were losing money and they were losing money so we you know it was like
our responsibility to make sure that they kept their stuff running which was kind of a funny
way of doing reliability i wouldn't recommend doing it that way ever again but
it worked but that's it but that's a great motivation if you are obviously not only a
cloud vendor but you're also one of the largest ad vendors in the world right and you want your
ads to be shown on websites that run on your cloud you better make sure that these sites are
ours are reliable and yeah one thing that has always been interesting we didn't call tom no i i
considered like hey is this tom but no it was it was it was the operation center uh but well one
thing that we've always kind of or the way i've tried to think about this kind of stuff at google
is that like forever google was an advertising company but like they also made like chrome and
like what what's the deal there like why would an advertising company make a web browser
but the idea was simply like if you make the whole internet better and more reliable people
are going to use it more to go you know do whatever it is they want to do on the internet
and that's that's just going to naturally drive up basically ad clicks right this this background
radiation of a way of making money on the internet so like i try to think that like in gcp like the
background radiation is not ad clicks but but it's VM hours, essentially.
And so if we make it easier for people to run businesses successfully on the cloud, we're just going to drive up more of that background radiation.
We're going to have more VM hours happen as a consequence of making the system better. If you can have an e-commerce business on the cloud
that is more successful than it would have been on-prem, it's better for you, but it's also better
for the cloud, and it's better for your customers, and it's a win-win-win
kind of situation. So that's the way I try to take it.
It's interesting you mention that, because I don't want to mention the competition by name, but there's
this monolithic, very old-school company
that's also a big competitor in the cloud.
We'll leave it at that.
But their model used to always be operating systems.
And as you said, the idea is if you make things easy
and usable, people will use it
and you can put new revenue models.
So I think you nailed it on that.
You really just nailed it on that. You
really just nailed it with what you said there in terms of make everything easy and people can
use it without thinking and the money will flow in. In fact, we see that all the time, right,
Andy, where people just spin things up, don't even realize they're still running and like,
oh, that's right. I have a VM running because it's too easy until the finance team comes
screaming at you. Yeah. We don't want that to happen either, too, because
then they get mad and
the pendulum swings the other way
and no one likes that.
Making sure that doesn't happen, too.
And Google Cloud
actually has a great modeling to say
you can save money on this VM
by reducing slides because you're not using it. I think
that was really, really innovative of you all to do.
Yeah, that was a fun one.
I was on the Compute Engine SRE team when that came out.
That came out of the Poland office,
and it was just one person in the Poland office
who just kind of had this idea.
We can just show these little pop-ups and say,
you know, you're not really using this VM.
What if you did something else?
And the original idea was to automatically change it
and be like all dynamic.
And we said, well, what if we just pop up like a little sticky note that says hey and so we did
that and just that was was a big deal so people loved it they still do hey steve so with all your
experience on sre's uh even though you live in slow but uh you are an sre um the you know reading it reading a blog i mean
there's a couple of things you put out there but i what i found interesting is your approach of
how do people how can people get started with sres by starting with something they should do
hopefully anyway in case something is not reliable which is post-mortems. And it would be great from your perspective
because we've been, Brian, right?
I think in the last couple of episodes,
we focused a lot on SLIs and SLOs and SRE.
We had a couple of your colleagues on
and some other folks talking about
site reliability engineering.
And the question then always comes up,
okay, but how do we get started?
What's a good model?
Because we're not born in the cloud.'re not cloud native yet i mean give me something
right and and it would be interesting from to hear from you because i believe you you did a
great job in the blog explaining the approach and um so i want to i want to hear from you how can
people how do people get started what do you advise thanks yeah i'm glad you like that that
article that was kind of a partnership with
Blameless. So this was while I was at MindBody actually. And it's entirely just out of the
experience at MindBody. So this wasn't a Google sanctioned official way to do anything. This was
just like my experience at the time. It happened to be published later once I was back at Google,
but it was actually, I believe the interview happened while I was unemployed actually between the two things.
But anyway, um, but the idea was essentially that like, you know, how did this happen everywhere
and everyone is always, you know, dealing with them.
And the idea of postmortems are pretty understandable.
Like you might call them retrospectives or learnings or, you know, there's all kinds
of words for them in Google SRE. We call them postmortems for a long time, which is maybe a little bit dark, but I think the idea comes across reasonably well.
But I found that in this one company that I was working for, MindBody, they were doing something that was pretty close to postmortems, but they were doing it inconsistently. And they were, you know, kind of everyone would have their own way of writing up the document, and then they wouldn't necessarily
always follow through. And some people would, and some people would kind of forget about it,
and blah, blah, blah. So basically, I just kind of introduced the idea of, well, we're already
doing something like this, why don't we just add a little more rigor to it. So I introduced like a
postmortem template, which was really, really simple.
And it's, you know, it's only like, one page long, and it's a couple headings, just sort of like to give you some sense of what to write, because I found a lot of people would just ask me like,
hey, you wrote this postmortem, and it looked pretty good. Like, how do I do that? And I just,
I just gave them, you know, a halfway filled in document and that it helped a
lot. Um, so being able to just start with something you're already doing and just improving it
slightly, uh, is a really good way to start improving a practice across the board. Um,
the other, the other thing you can do is, is if you already have seen good postmortems in practice,
like I had at Google for 10 years or so, um, I would just write postmortems in practice, like I had at Google for, you know, 10 years or so,
um, I would just write postmortems for incidents that I was kind of observing. And I would say,
Hey, uh, I'm, I'm not really, I wasn't really involved in, you know, the outage, but like, I'd like to write it down or help you write it down for you. Um, so I, I did write a few
postmortems for incidents that I was just kind of like sitting on the sidelines of. And then I made sure that they were, you know, well, well, more well written and that there are at least they were complete. But the most important thing that I've kind of found or like the most helpful thing that like in terms of outcomes from doing this was making sure that you write at least at least three types of
kind of i would call them bugs but you know just kind of tickets maybe you know action items
and those would be uh you know detection how did how did it how can we detect this better or sooner
or with more precision uh there's prevention how can we prevent this better or sooner or with more precision? There's prevention, how can we prevent
this from ever happening again? And then there's mitigation. So how can we, if this does happen
again, how can we survive it better? And so if you can at least write one of each of these,
you're in good shape. Generally, there's more of some than another. And often the prevention one
is like very difficult. But you know, you should still be able to write it down
and put it into a persistent ticket queue
or bug filing system of some kind.
I have heard from some companies,
they would do this and they just kind of say,
well, but we don't have a bug filing system
or a ticket queue that's like this.
And I say, well, okay, this is a good reason to get one.
Sometimes just doing this process sort of exposes other capabilities that are missing as well.
So it just starts to kick off a few things.
That excuse is like a kid saying, I can't eat my peas because I don't have a spoon.
Oh, here's a spoon.
There you go. Exactly.
I think it's interesting too because what you're saying about having those plans and having that edict in there,
I haven't worked on the, I've been in sales engineering since 2011, because what you're saying about having those plans and having that edict in there,
I haven't worked on the,
I've been in sales engineering since 2011,
but before that I was in the performance side and we started doing some post-mortems.
We were actually calling them back then,
but it was just a meeting
where we'd all get in a room and discuss what happened.
There was no organization.
There was no folk.
It was just
a bunch of people kind of partially blame game but i think they called it a post-mortem because
we were trying to figure out what happened but it was very very loose no focus no goals and i think
the key as you're saying there's to have some sort of goal even if it's you know something as simple
as that and yeah the other thing that's important to do is
like once you have the basics of oh we should be writing these down in a consistent way and having
like a template that at least has all of the fields that you care about and you start getting
some practice writing these uh the next step in this is then to have a uh a, almost, or a meeting.
So the suggestion that I have, I hate meetings,
but the suggestion I tend to put out there is
just put something that is regular, regular cadence,
like once a week, once a month, whatever it is that you think makes sense,
depending on how many outages you have, I guess.
And say, it's every Thursday at 10 o'clock,
we're going to have the postmortem review session.
And there is a standing agenda document.
And all it is, is there are, you know, it's a one hour meeting,
and there's three 20 minute slots.
And you just sign up for a slot.
That's it.
So if there's already three scheduled, then just wait for next week.
But what you do is you show up with your postmortem that you wrote,
you send it in ahead of time, and the sort of curators of the meeting will, you know, read it and kind of get a sense
of what happened. And then during that 20 minutes, all they do is you just kind of have a discussion
on either the document itself, where they'll say, well, you know, you forgot to, you know,
add these types of outages, or I mean, no resolutions are like, we don't really understand
the impact, you know, they'll sort of help you write a better document. But But even better is
they'll give you an opportunity to sort of explain what happened, and what you think the resolution
is, and you know, what the prevention steps and all these are to like, a semi third party,
who has almost no interest in the actual outcome, aside from making sure that it's a good learning
experience for everyone, and that the system becomes better over time. So this sort of forum
is, you know, it's still employees within your same organization, but no, they're not the authors
of the system that broke. And they're not the people who got paged during the incident. No,
they're, they're, they're semi uninterested.
They can be pretty
neutral parties, I guess.
Having a forum like that is super
helpful for driving up the quality of
these postmortems as well as
making sure that the outcomes
make sense. They're not just
super inside baseball
and are super, super deep.
They'll help you kind of focus on the
things that will actually help the company that's another kind of step you can take yeah one quick
question so i guess maybe that's part of that forum or that outside body but if you start writing
post-mortems you know with every problem that you see and if multiple people start writing it how do
you make sure that you're just not creating a large list of things that in the end nobody really
yeah not cares about but at least but how can you make sure that a you are catching duplicates how
can you actually then hey this happened before did happen before yeah is what is there are any
good systems for that or is this part of i don't know is part of the best practice part of the um the body you put in there's there's a couple
problems with this uh that are really easy to um fall into and you have to be careful of um one is
uh i don't want to spend that much time writing documents and like how much time i don't want to
make the perfect document every single time like we had an outage like what's the big deal like why
do i have to write you know people feel like they're being punished because they have to like write this big
document um and that that that is definitely not the case that's that's not the point you're making
but that's like a similar a similar point um having a good uh template helps a lot here and
then you can say like all you need to do is fill in these like 10 fields and you can be done in 10
minutes like and at the at the simplest case but then further along is like if it was more complicated sure you should spend more
time and explain it a little bit more um but but what you're asking about is is more like um
it's the same happening same uh event keeps happening over and over again like are we
catching it um or how do we know that the outcomes that were sort of the resolutions or bugs that we're filing aren't just being ignored?
How do we prioritize them?
Or how do we make sure that they're going into the right bucket?
And really what we're doing here now is we've moved into the realm of operations into software engineering.
So we're talking about we're raising essentially feature requests or actual bugs in our feature tracking system or our bug tracking system.
So the real question is, is your company good at prioritizing work?
And is your company able to take bugs from customers or from product managers or, you know, from internal
engineering or QA? How do you how do you do that process today? Like, how do you
determine if something is as a duplicate bug or overlapping or something like that?
And you just use the exact same system. So really, all we're talking about here is
just another form of defect. So this is just generic defect tracking. And again,
this is the spoon problem, right? So if you don't have a good way of tracking defects
and you try to do the system, it's not going to work. And it's not because this process
is broken. It's because you need a good defect tracking system.
Now, what if I don't meet, hopefully this isn't jumping ahead at all or properly,
which requires not a fix in the code,
but in the fix in your monitoring coverage or what might be acceptable limits,
jumping a little bit into the SRE side,
what is an acceptable limit of number of database calls that we can handle
and we weren't monitoring those and blah, blah, blah.
So that thing goes into fixing your monitoring and yeah you know is that something that you would
handle through like let's open a ticket to change the way we monitor so that we can avoid
doubles and all that too or totally um i mean i i don't think i don't distinguish between the
monitoring and the software because turns out monitoring is made out of software and what you
really care about is the entire system right like your ceo doesn't care if it was the ruby that broke or
the prometheus rule that doesn't exist like he wants to know why the system itself isn't working
you know like okay uh so the idea is like don't treat uh you know configuration and operations
and monitoring as like something that isn't software
because it is software. It should be tracked in a
similar kind of way. There's the phrase
configuration as code or configuration is code. I'm a
firm believer in that. Even better is when you have code that
outputs configuration that goes into other code.
There's code everywhere.
So this is, these are all just forms of defects.
And I think trying to introduce a different process for operations
is just going to be confusing.
So if you have like a consistent way of handling defects through your company,
whether you're writing YAML or Java is,
I think it's super important that you treat all these system defects similarly.
That way you can prioritize work in a consistent way.
Right. That's really interesting.
Never thought of...
So yeah, everything's a defect.
It just gets categorized to different teams.
Yep, totally.
Because you also find if you're changing a system, often you're like you're introducing a new piece of functionality into like the Java,
for example. And then you know, you say, well, actually, at the same time, we want to introduce
some monitoring. So you want to be able to introduce these things in a consistent way and
say, here's the new feature set. And here's the monitoring that goes with it. And like, if you
have to span multiple systems to do this, people are going to be less likely to do it just because it's annoying but
also they're there's going to be it's going to be a lot harder to enforce or to track that this is
happening so if you have sort of one system for doing both things you can say here's the new
feature and in the same you know change set or whatever the system is you know that you're using
calls it you know here are also the tests that with it, and here are the monitoring rules that go with it,
and here's the on-call team alias who should be in charge of it.
And it's all in one package suite.
So that way when something goes down,
you can track it back to that one set of changes and say,
ah, we introduced this feature,
but we forgot to add some bit of observability or something like that.
And I think another great advantage of that is you'll get to test your monitoring or your system change early on to make sure that, number one, it works.
Number two, that it doesn't introduce any other kind of problem.
If you suddenly say, hey, we're going to monitor heavily on something, that could cause an issue.
But if you put that in early in the system, that's part of what Andy and I talk about, the whole idea of shifting left for not only your code changes, but I shouldn't say Andy and I, everyone talks about it, right?
But also, your monitoring is code, your system, your code is code, your deployment is code. It gets tested throughout the entire cycle before you get to production to make sure that it's doing what it's supposed to do.
It's giving you the results or the information that you need to act upon.
And you're not just guessing and throwing in production and then the defect happens again because you picked the wrong metric to look at.
Yeah. metric to look at yeah um so one thing that you uh kind of i think you referred to before um where
it's kind of around the cause you think a little bit but like when when you do have oh no this is
from the article that you guys sent me about um dynatraces like auto healing capabilities and
things like that i thought that was a neat article um one thing that you referred to was about um sort of automated remediations um and i think
like uh like i like i said kind of in the email chain like i'm really leery about that concept
of automated remediation because it sounds it sounds too good to be true because like it is
like you can't you can't rely on it uh to work all the time for all the things um but one thing
that you can do which i think is kind of what you guys,
what your system does and is good at,
or at least the one that was described in that blog post,
I would describe them as short-term remediations.
So like if you have an on-call team
who is in one time zone
and something goes off in the middle of the night,
it's a total hassle to wake someone up
to do something dumb, right? To you know memory or like restart a process
so being able to distinguish like an outage uh when when you have an outage being able to
distinguish a remediation from like a quote real fix you know like a prevention mechanism or like a
an actual you know bug uh fix is super super important so um i think
automated remediation when it comes to these short-term fixes is great it's super uh helpful
it helps people like sleep at night and uh gives you signal into the system of what's kind of going
wrong and maybe it's not even middle of the night things sometimes it's just like during you know
high load or something like that but the the really, really important thing that, that, that doesn't happen
sometimes is people think that when the remediation happens, like, Oh, we have a remediation. Great.
We're done. Uh, like no, no way. Like you need to have a, basically like a due date on these things.
Like this is a short-term fix and it will actually expire in two weeks. This remediation will no
longer work two weeks from now. So, you know, software engineering, you, you have two weeks. This remediation will no longer work two weeks from now.
So software engineering,
you have two weeks to fix this properly.
So in Google, we had some things like this
where if you put one of these short-term fixes in place,
it has a required expiration field
and you can't put infinity in that field.
There was a maximum of two weeks or something like that.
I think that's super important to emphasize that these short-term fixes need to be enforced by not just good intentions, but also the automation of the system itself to say, like, required field is when when does this remediation get lifted
so if you can do that i think it's a really good incentive to actually fixing it properly which is
the only really righteous path forward yeah it also brian this reminds me a little bit um
one i think the first episode we had with uh guaranca bieto from facebook when she was still
working at facebook and she talked about the fixit ticket so she basically said it's allowed to deploy something not perfect into production but then you have a
you know like a due date until you have to you know optimize it in a way so that it's truly
fit for production um and i thought totally that was also interesting uh interesting content also
um yeah thanks for reading the blog. And I totally agree with you.
These are, you know, English is not my first language.
So whether you call it remediation or maybe mitigations, I think what you're talking is probably mitigating the impact of a problem long enough so that people get some time to truly look at the problem, find its root cause, and then fix it. It's like one mitigation would be we just restart your application server
every time it runs at 80% memory because we have a memory leak,
but it's obviously not a solution to the real problem.
Or like, I don't know, some other, as you said earlier,
scaling up in case of peak load,
but maybe we need to come up with a different solution to deal with this uh maybe we need to optimize the system to to be able to scale better or something
like that yeah i agree with you yeah i think it's really interesting because i can see people
abusing that right it's when you mention it when you talk about it that way steve
you know if you know one thing i think a lot of people encourage and we encourage, it's great to automate your playbooks for production outages.
Whatever your standard operating procedures are, automate that and get things back up and running.
And that then can be triggered based on metrics coming out of your observability tool, aka Dynatrace.
But I think you bring up a good point that people, especially if you have it really well locked down on the automation side, that people might just rely on that and keep on letting these things fix and not fix it. sort of expiration or if there's a way to put that into those playbooks um so that where where it does
apply people can leverage that is that like i could just see that being a real become a real
liability yeah just these scripts running all the time band-aiding all you know putting plugging all
the leaks yeah the problem is that it looks like an asset and it's actually a risk so when you have a dynamic system especially
a dynamic distributed system like when you get out of monolith land and you have like this
uh system that is always slightly broken and always slightly changing uh every band-aid that
you apply is like very specially shaped for the problem at hand during that moment in time and so when anything changes which it will
uh you know undoubtedly that band-aid is going to become a liability and you're not even going to
remember that it's there so being able to sort of like keep a running tally of how many band-aids
are you know in production at any given time and making sure that they get burned away with time is super critical, 100%.
So, I mean, this is not hard stuff in terms of like reporting and, you know, coming up with goals and like being able to keep track of how many they're out there and when they expire and like telling your team, okay, let's try to, you know, get next week, let's try to get two or three of these down. You just gotta know that that's the idea
because otherwise, the good intention is like,
oh, we have a mitigation system in place,
let's use the heck out of it.
Let's use it so much when really you kind of wanna use it
as little as possible.
Hey, got a question for you.
So you mentioned monitoring being obviously, it has to be part of the, let's say, software
delivery, because monitoring is just software as well.
And whoever builds it and configures it in obviously needs to take care of it.
Do you have a strategy on testing monitoring?
Because on the one side, we write unit tests and functional tests to test the functionality
of the software is there anything kind of in the
software engineering practice that also includes testing if the monitoring data itself is actually
what we expect and then going further is there anything that you have as best practices on
how do we test the remediation because it would be kind of risky to say well
we have remediation in place in production but we've never been able to test it so we just assume
it works so kind of these two things can we is there anything i you know that kind of extends
test-driven development to test-driven monitoring i'm not sure and i don't think this is the right
term but i think you get the idea right so testing the monitoring and testing the remediation as part of the pipeline.
So inside of Google, there was a system, I'm sure it still exists too, but there was a
system that was exactly for this.
It was like a monitoring rule test framework.
And it worked fine.
Like you give it input data of, you know, the counter goes from zero to one to two to
three. worked fine. You give it input data of the counter goes from zero to one to two to three, and then
over the course of time, if it exceeds this rate, then fire this alert. And yeah, it worked.
It was notoriously hard to use, but people did it. However, the idea of doing um tests like test-driven monitoring um sort of flies in the face of
of the kind of slightly newer at least you know sre uh type of methodology which is that you don't
actually want to uh focus on every possible cause but instead you want to focus on symptoms. So kind of what you, you know,
the false goal is to be able to enumerate every possible cause
and then define an alert for it,
and then maybe write a test to make sure that the alerts work
or that the monitoring exists for each possible cause,
and then proceed from there,
which is totally intractable,
especially, again, for a distributed system, which is constantly changing because you're always moving those causes.
You know, you're introducing new causes and getting rid of old ones.
And so you're just going to be on this treadmill of adding and removing alerts as well as those tests, which have enforced the validity of those alerts. So we found this to be the case many years ago,
that this was just like this impossible task.
So it took a long time to realize what was going on,
but essentially we realized this is kind of what we would call praying to the wrong god.
So what you do instead is look at symptoms.
This is where SLOs come out, right?
So this is looking at what the end user sees,
if it's fast enough and correct enough,
and it's available enough and all these things like that.
So this is kind of a rehash of probably things you've already talked about, but just because you add SLOs and availability and latency alerts
doesn't mean you get rid of all the other monitoring it just means that you don't react to
it directly um so the the way that this is actually a pretty good parlay into the uh the cause tree we
were talking about but like uh if you if you think about uh when when an slo violation occurs it's
you know something got slow or something became uh slightly unavailable like
it started issuing errors more errors than we like um that's like the top of a tree uh and i mean like
a you know computer science type of tree right um and so like there's this huge graph of of nodes
that may have caused it uh why did it get slow well i don't know is it was the load balancer
was it you know the container ran out of memory? There's a million things that could have possibly caused it. And do is you want to be alerted by the root of the tree,
which is, you know, the system is slow or having errors.
And you want to be able to run down this potentially huge cause tree.
And you want to be able to prune that tree, right?
You want to say, well, you know, this entire right branch of potential causes is not the case.
We've ruled that out. So you don't have to go down
any of this. And then you go down the tree a little bit further and perform some tests and say,
well, we've ruled out these branches as well. So this is, you know, if you're into computer science
and stuff, you know about tree branch pruning. Pretty quickly, you've reduced your search space
to something totally achievable. And you can actually then go to each of the leaves
and test them individually and say, was it this? Was it this? Was it this? Nope. Okay, there's the
one. Okay, it's finally we figured it out. We found the cause. And it might be a novel cause
might be one you've never even considered before. So this is what I mean when I talk about pruning
the cause tree. I used to call it pruning the blame tree. But I think the word blame has like,
you know, personal consequences, which which is not not what I mean I just mean like what part of the system is
is the cause or is to blame so being able to do this generally isn't the case that
you have perfect monitoring rules in place that you've written tests for. What it is, I think,
really is that you have a
fully deployed,
fully usable observability suite,
which isn't something that you have to keep writing new rules for.
So it's something that you're able to
introduce new code, which has been
instrumented by the developers of the code
saying, yeah, these are important, you know, counters, these are important distributions.
And then you just sort of step back. And then when the moment comes that you're that you need
to look something up in the observability suite, like, what is the distribution of this bucket,
you know, over over time, you can just call call for it from the observability suite. I think that's that's the kind of the righteous path forward as well. So it's, it sounds weird. But I think spending a lot of time on making the perfect alerts and the perfect dashboard is kind of the old, the old invested way. And what you really want is a observability suite that allows you to make
sort of just-in-time dashboards for interrogating the system based on symptoms based on based on
slo alerts if you can build that i think you're miles ahead i think you're you're in good shape
yeah i mean you are this it's it sounds like uh it's music to our ears uh what you're saying
because basically you know we we're good from our perspective you know we we work for dynatrace so
this is exactly the um the approach that that we are taking that where it matters most and that's
either the end user because we also include end user monitoring data in our alerting on the top
or service level right your slos and then in case we see an anomaly
and you know we do i think that's what you guys are doing as well and others right we have the
combination of you have let's say hard-coded slos let's say a certain threshold or you which is
baseline and basically learn from from the past and then alert on anomalies and then we basically walk our dependency
tree um into all directions because we are combining distributed tracing with full stack
monitoring information and then we basically walk that tree and figure out where we in which path
do we really have the problem and then we can leave out the other things and then we can also
see you know okay how did this historically where did it start
and what's the true root cause so that's pretty it's it's good to hear from somebody like you
that this is you know obviously really solves the problem and we've known this from our customers
that they also like it but i also have to say a lot of people you know that have been living in
the monitoring space for too long and are very proud of their beautiful dashboards, they still, a lot of them still start with, okay, what can I put on the
dashboard so that I can, you know, I don't know, just, you know, be, feel like I'm under control
if, if, if there's a problem. So this, this is a great example of really good intentions,
but not really focusing on the right problem.
I actually just had a customer the other day who was asking me,
they were performing a load test,
and they were using a system that we have at Google.
It's a managed Redis, basically.
And they were saying, the CPU is too high.
Why?
And I said, okay, okay well that's weird uh
let me check it out and they said why why is the cpu pegged and i kind of went well i don't know
what are you doing with it kind of thing and basically i said well what's the problem instead
well the problem is the cpu is too high well that's that's not a problem that means that you're
using your computer as well you know like you're using exactly as much of the computer as you need
what's the actual problem they said say, well, there's nothing.
We're just worried.
And I understand that.
Like, that's, you know, there's plenty of good attentions because they've seen many times in the past
where elevated CPU leads,
it's a leading indicator of an actual problem later.
And that's no fun.
However, high CPU doesn't really mean anything by itself
until you actually have a user-facing indicator
that actually says, yes, this is a problem.
I mean, again, Brian and I,
we come from a performance engineering background.
And one of the aspects of performance engineering
is obviously performance optimization.
So if you see something like this,
you may wonder, is this normal?
And if not, how can we optimize it?
But then you have to compare it either to historical data.
So was it always at 100% CPU
or did it just jump from 80% to 100%?
Still not having an impact to the end user,
but still, why did it jump from 80% to 100%
because of the last build, right?
So these are obviously fair questions but i i
i i understand what you're saying that people too often are just freaking out on there's a metric
that looks strange to me so let's alert let's ring the alarm bells without knowing if it's
if it's really impacting at the bottom line which is am i are we meeting our our slos um i think that's what it
is in the end right right and and again it's always it's always based on good intentions
sometimes it's it's due to uh you know past experiences that you know they've seen this
type of pattern before sometimes sadly it's just kind of cargo culting you know they've heard that
this is bad so that we should totally change it um and but
but really the the great thing about having something like us in place is that no matter
what their intentions are what their you know maybe misunderstandings are is that you can point
back to this one number and say like is this number okay and they'll go oh yeah that number
is okay and then you can say okay so you know stand down everyone take a deep breath we're good
um and so they can it's
really a matter of practice yeah so if people have these habits you kind of have to give them a way
out you can't just tell them to stop yeah you have to give them something that's better yeah
hey let me ask a question there though because you bring it almost sounds like we're talking
to extremes and i'm wondering about the middle ground here. Obviously, if you're just pointing out
CPU's higher, even if
CPU's higher,
you don't always want to jump on that.
But if you're taking an extreme
end-user view of it and say,
alright, CPU's higher, but
there's no impact to the end-user,
there is
the potential
to be glossing over problems that are going to arise.
If the CPU went from 20% normally at the same load to 80% suddenly with the new code release,
but still there's no impact to the end user,
you've just pushed your system to a place much further than it was,
which could be bringing it closer to the edge.
That means if you have suddenly an increase in, so what is that middle ground and how
do you define that?
Because in the last discussion we had about SLOs and SLIs, a lot of it came down to, again,
what's the impact on the end user and defining your performance or your reliability from
that perspective.
But I feel like there's been this unspoken whitewashing of what the system's doing underneath.
Yeah, so you have to have this first discussion that we just had at first,
and then you can get to exactly this point, which is, yeah, so what happens,
like a great example of this is quota issues, right?
So you can be humming along just fine. And then
you're going to hit some quota somewhere. That's, you know, 200 somethings per second, whatever it
is. And it's like somewhat arbitrary. It's just, it's just whatever the quota was, right? And then
your system is just going to fall off a cliff, right? And then you're going to, you're going to
have this big fire drill and you're going to freak out. And you're going to say, why didn't we know
about this? You know, how come we didn't have a leading indicator, which is kind of what you're getting
at is some of these non SLO metrics could be used as leading indicators. And so we should pay
attention to them. Right. So my, my response to that is like, uh, that's true. However, there's
always going to be leading indicators that you're not watching and there's always going to be leading indicators that you're not watching. And there's always going to be indicators that occasionally lead, right? So do you want to
have a lot of false positives and false negatives? Probably not. So the solution to this is not easy.
But the solution to this is, you know, stay the path with SLOs, except you need to be able to now start investing
in what we kind of call reliability engineering
or resilience engineering,
which is basically,
how do we let our system fail gracefully?
So when we do hit one of these thresholds,
whether it's a quota issue or CPU capping out
or some other form of plateau,
how do we let our system not completely collapse when that happens?
Because that's really the higher level concept that you're referring to here is
when something funny happens and we didn't have a leading indicator that we were tracking,
how do we make sure the whole system doesn't collapse?
Or how do we make sure we don't even have a pretty big outage due to it or,
or, you know, have some badness?
What, you know, the ideal scenario is something unexpected happens, you know, so you hit some,
either it's a quota or it's a scaling limitation that you didn't even know you had in your
system and you had no indicator for it, but you hit it because, you know, you're doing
great in your business and you have a lot of customers and you're making a lot of money.
And one day you hit some number.
Ideally, what happens is graceful degradation happens.
You know, 99% of users still continue to send you checks every second or whatever it is, you know, make you money.
But like some small number of users can't.
Then you can notice that because it's showing up in your error budget and then you can find the scaling limit and adjust it in due course without it being a complete
panic right that would be ideal but i think the path forward to that is really that the way to
get there is by uh essentially sadly it's by finding those like the hard way a few times, and having postmortems
and saying, Oh, we keep following the same pattern, we keep having a bunch of client retries,
maybe we should introduce, you know, exponential back off with jitter. And that will prevent this
type of failure. Or maybe we need to have our load balancer send excess traffic to dev null,
right? And that way, we can keep sending
real traffic to the back end systems and not completely, you know, crash the entire system.
So essentially, what I'm getting at is like, it's hard, I understand that. But like,
the answer isn't monitor more things. It's understand the system better and keep monitoring you know the end user's perspective but find a
way for find a way to to allow your system to adapt which which is not easy but it's it is
possible this is exactly resilience engineering so so i just to uh making some notes here but i'm
if we come back to the if i hear this correctly if i can kind of
repeat what i what what we what the perfect system should be or what a system can be that makes this
all possible is we obviously start with our slos and once we have problems we use the the cost tree
analysis to figure out okay what's the potential what's the real root cause because we get faster
with it and then if we detect hey you know the last three times we had this
outage of slowness it was always the uh the full the 90 full disk on this particular machine or it
was always the connection pool on this particular one then this obviously is new knowledge we'll
convert it into a leading indicator and say, hey, we've observed that once this indicator
crosses a certain threshold,
10 minutes later, there will be a problem.
So let's include this as not an SLO,
but as a leading indicator for pre-alerting
or for preventive alerting.
I actually wouldn't go that direction.
I would say instead, how do we make this thing, which is currently a leading indicator to outage,
how do we prevent this from causing an outage?
I wouldn't say let's alert.
Because the point of an alert is to have a human do something or have a piece of code do some sort of mitigative step, right? Perform some, some,
some change. Um, why not have the system just do that change? Um, and if you, if you look at the
big picture of the system, maybe, maybe, maybe this is what you're describing is that you just
want the system to sort of like do its own sort of maintenance, I guess. Um, but describing that
as an alert, uh, can be taken the wrong way by some operational teams and that what
they'll do is they'll they'll assign a ticket to an on-call human right and expect them the phrase
we use inside google which is kind of silly is this is this is called feeding the machine human
blood you don't want to do that you don't you don't want to make sure that the systems persists
or subsists entirely on human effort in fact you want to do
as little of that as possible so anytime that you say let's add an alert you know you should pinch
yourself like that if instead you say like let's have a mitigation that's that's fine like um so
having a you know cloud function or something like that or like a lambda that like steps in
and performs some mitigative action uh that's it, that's a good mitigation. But
again, like what you said before, that should be a, there should be an expiry on that,
that should give you time, that should give you two weeks to solve what the actual problem is.
And if the actual problem is, every time you run out of disk, then you know, we're unable to
write the buffer out. The real solution is stop writing the buffer to a single disk. The real
solution is start writing to
a distributed disk abstraction of some kind which has like an auto scaling back end and has or maybe
you know change the disk that it's writing to to have like a cleanup function so that way you know
we always have plenty of space and like these are all preventions they're not you know mitigations
so but it's this this again goes back to the post-mortem issue of of how do you take a problem that happened in production even a kind of not a
big deal like not a real outage but like an almost outage how do you get the monitoring you know how
do we uh get the prevention how do we get the mitigation yeah figured out perfect so so i think
what i they're all related yeah but related. Thanks for challenging me on that thought
because I think I just...
Again, I'm taking notes here.
It's just my opinion.
I need to put on...
I need to forward this list to the product team here
because I really like the thing that you just said.
The number of alerts is a reverse indicator
to your maturity.
Well, I'm not sure if reverse is the right,
but it basically is not a good indicator if your maturity. Well, I'm not sure if reverse is the right, but it basically is
not a good indicator if that grows.
However, obviously alerts
might be necessary for a certain
period of time in case you
have not yet found the right
solution, long-term solution to fix
so you can define an alert on a
leading indicator, but it should have an expiry
date because otherwise they're lingering
around and then you're just freaking out certain people
because more and more alerts are coming in
and it's also not manageable anymore.
So that's why this makes a lot of sense.
Yep.
And you know what I really love about that last example
just goes to show you the level of
or the lack of maturity in my thought process on this,
I guess, is a lot of times when i see what we would call maybe a performance
anti-pattern you know my initial suggestion is well
you should stop doing that then right but in your in your example with the
you know like well don't change your code so it doesn't do that so you're not
filling up the disk why are you filling up the disk well
there's probably reasons why you need to do something to disk
and i like the idea of just, don't think of the solution as
stopping the behavior that you're doing, because it might be a required behavior. Think of the
solution outside of, well, you're writing to a single disk, write to distribute it. Taking it to
there are new options out there, especially in the cloud providers. There's always another way you can mitigate that problem.
So it's not necessarily that there's a problem with the code
or the problem with the way the thing is operating.
The problem is with the way it's just being handled
or the system that it's running on.
And that can be improved.
Another good way of thinking about the same problem
is that really what you want to do is just think about what kind of trade trade-offs you're making and maybe you need to just make different trade-offs
um so for example with this with the same same example this actually comes from a colleague of
mine uh you need the actin uh he gave a talk at sre con a couple years ago i think called the sre i
aspire to be actually maybe it was one year ago. Esrikan Europe, I think.
He basically said, do you remember
there's this little thing called RAID that we all know about?
Basically, whenever we would write to
a disk, it was fine
until it wasn't. There was
occasionally a problem where the whole disk just
died. How did we solve it?
We added a second disk.
Oh boy. Or maybe four disks
or something like that.
And so what we're doing here is we're still writing to the disk. You know, we're not really changing the code that says like, I need to open a file and write to a file and everything,
but we're changing the system underneath. And what we're really doing is we're trading off
reliability for something. And in the base case, you're just adding an extra disk. And really,
you're just you're just paying twice as much for your disc so you're trading off dollars or you know euros for uh for reliability which is
a great trade-off we didn't even know that was an option before and then you know rage showed up and
said well if you just spend twice as much on your discs and put them together in this particular way
you never have that type of failure ever again and then you know everyone with a checkbook says
no problem like let's do it. This is
totally worth the trade-off. So finding more trade-offs like that, where you're saying,
I want more reliability and I'm happy to pay with either money or time or compute process
costs, essentially. Finding those trade-offs is really where reliability engineering
lives. And I highly recommend
looking up that talk. If you can put
it in the show notes or something, I'll send you the link.
I just found it, so it's easy
when we Google it.
The SRE I aspire to be, we should add it to it.
SREcon 19.
There you go.
Oh, so it was a couple years ago.
Oh, last year, 19. So SREcon 19, I think that was, as far as I remember, if I look at years oh last year 19 so it's a recon 19
I think that was
as far as I remember
if I look at my
oh geez
it's only 2020
it's still 2020
that's only less than
oh man
I know we all
nobody knows what time it is
time travels differently
we all want to forget 2020
because of the whole thing
that is going on
but it's still 2020
yeah we're past it right
yeah nope
you know it's funny, this whole conversation,
so much of this conversation, I was saying earlier
that I was thinking of a lot of songs when Andy was talking
and so much of the conversation about, you know,
people just did things to do them because that's the way they did that.
It's come up a lot and I just have to mention,
it keeps making me think of the Monty Python sketch,
the operation room where the guy's like,
bring me the machine that goes ping. You know they reveal in this machine it goes ping it's
like there it made the ping noise you know because you're doing what you've been told to do and
you've historically done and you're not doing anything different just marching on with the
same old way um which we know is path to failure yeah Yeah. That's cargo culting. A little dramatic there.
Yeah.
What is that cargo culting?
Oh yeah.
Have you guys heard of,
this is,
this is a great little silly story, which is the idea that,
uh,
during world war two,
there were like,
uh,
you know,
uh,
Pacific Island,
uh,
tribes that,
uh,
were basically invaded by,
you know,
this war showed up on their doorstep.
Yeah.
And so all these planes started landing on these
islands and setting up uh sort of like refueling stations uh and so they'd set up an airstrip and
like they would come in with all these planes with all this stuff in them and and the people
who were like indigenous would you know they'd kind of benefit like they would give them food and
and stuff and then all of a sudden the war was over and these planes stopped showing up and the
airstrips were still there so the indigenous people didn't really understand what the war was or what planes
were or anything and so they said well if we make these uh control towers and we like put our arms
up like this and we move our arms back and forth like the people did before the planes will show up
and so then this is this is what they would do and there's there's these pictures of
these people who had built control towers, small control
towers, but control towers out of essentially wood and sticks and stuff sitting on them.
And they look like air traffic controllers.
And they're trying to get planes to show up because they want the planes to come back
with tons of food and people and supplies.
And they called this the cargo cult because they wanted these cargo planes to show up. And they, they were just trying to sort of redo the the actions that drew them before without really understanding what was happening underneath. So it's actually I'm not even sure how true the story really is. But like, it lends itself to matter to this concept of often, like, I hate to say it, but often, like the operation, you know, centers
within these enterprises are filled with teams who are not not well informed, and they're not well
paid. And they're not experts at the system they're operating. So they they just do like I
said, like best intentions, they do what they've seen works. But at the end of the day, like often
the entire system has been entirely changed under their feet and no one told them and so they do something which is you know basically
the wrong solution because they they don't know any better right no one no one told them to you
know we don't we don't even use that database anymore like why are you spending all this time
making sure that its ram is high or something like that. So this is a really common pattern,
which is totally preventable. And this is part of the problem of cause-based thinking,
as well as throwing code over the wall. These are all bad practices that lead to this kind of
unfortunate position that some of our colleagues are in, where they're just desperately trying to keep the lights on by, you know, throwing RAM at whatever server, you know,
has a red dot next to it on the monitoring system, which is, you know, no good. The real solution is
to have the people who understand the system, be able to interrogate the system to be able to make
changes directly to the system that they wrote, you know, and make improvements directly. That works much better.
It's a lot more rewarding,
frankly, as well.
Don't forget the key is to
change everyone's title to include the word
DevOps.
Please don't persist that. We know that's a bad
idea. I don't even want to put that out as a joke.
Please do not rename
your NOC the SRE team
or the DevOps team. this is a bad plan
all of you managers out there
this is not a good idea
alright
hey we've been it's amazing we've been I think
talking almost an hour
and Steve
the way
we typically do this and
Brian I know you probably wonder what is he doing
today is he summer is he doing today?
Is he summer?
Is he doing the summery or not?
Do it now.
Yeah.
No, I think so.
Let me summarize quickly because, Steve, at the end, I always try to summarize what I learned today in a short fashion.
Hopefully, I got it right. Otherwise, you can scold me later on or correct me. But I really liked, you know, in the beginning when you told your story about how important
it is to start with a post-mortem culture and you have to make it easy for people to
fill it in.
And it shouldn't be a burden or felt like a burden, but make sure people can figure
out and write down how to detect the problem, how to resolve it, how to prevent it and how
to mitigate it and then have kind of a culture around it to you know constantly review it and then learn from it
i think another thing i learned is that what i think that brian and i have also been talking
about in the past the importance of monitoring is part of your code so because in the end we
need to make sure that everything we do can be monitored so it should not just be a siloed exercise where somebody else may or may not provide some monitoring data,
which will be part of the package you deliver.
Now, the other thing that I really liked from you is coming to the cost tree analysis.
So instead of looking and defining many, many different alerts and many different metrics,
we may think are good indicators for problems.
Start at the top top start at your
slos and they are typically close to the end user and in case there is a problem then follow the
traditional follow the path of figuring out where is the true problem coming from by following your
distributed tracing tree or whatever monitoring tool you have in place, your observability platform that can then get you to the root cause.
And I also like the definition of what you
say, what reliability engineering in the end really is.
It's really about how can we learn from these things and how can we make the
system fail gracefully in case there is a true problem?
How can we mitigate problems so we have more time
to to fix it for good but i think that's that's what i like so reliability engineering for me
what i took kind of notes how can we help and build a system that fails gracefully in uncertain
or unknown conditions and not fully crash and that's also very very important. Last thing, this is really what
I'm going to take back to the product team on our end. While custom alerts on certain metrics might
be a good idea for a short period of time, it's in the end a bad indicator for your maturity because
in the end, custom alerts just tell you that you are having a system
and you're not actually investing
in it becoming more reliable
and more resilient,
but you're still fighting more fires in the end
because you get a lot of alerts.
And yeah, really.
So did I get it kind of right?
Yeah, I think so.
One small point I would add is
when you talk about SLOs
being customer and user facing, that's totally accurate.
However, as you become more mature, you're going to have many layers, right?
And then some people say or they ask the question, like, what if I have many layers?
Do I have to have SLOs at every layer?
Or how does this work?
And what if I'm a back end team?
I don't have customers, front end customers.
Your customer can be someone who works at your company. So if you're in charge of the document indexing service
or something like that, your customer is all of the front-ends who call you.
They don't need to know what your SLOs are and how you're tracking your own
service. It's up to you. So when I say
customer-facing, sometimes that means internal customers.
So that's just a minor caveat, especially like for business systems as well. Like if you have an entirely
internal facing system for, you know, like, you know, big data analysis for your marketing
team or something like that, then they're your customers, you know, the marketing team
is your customers.
Well, I guess that's also why they call them SLOs, service level objectives, and not site reliability objectives,
because this is also a thing I want to make.
Why do we call it SRE?
Why don't we call it site reliability engineering
and not service reliability engineering?
The intent is that you're looking at the holistic giant system, right?
Calling it systems reliability engineering is a little, I don't know, dated
feeling. And then service reliability engineering is a little bit too specific, like it's a little
too focused onto one service potentially. Also, it's just a historical artifact. So it's the site.
So Google was referred to as the site. So all of Google, whether it's Gmail or search or ads
or cloud, this was all, you know, the site that we're trying to make more reliable.
I think it's just history.
Awesome.
Well, Brian, anything else from your end?
No, I'm good.
This was excellent, Steve.
I really appreciate you taking the time to come on with us.
Sure thing.
It was really fun.
We've been having a lot of fun SRE conversations lately,
and this one was all kind of a new direction. So it was really fun we've been having we've been having a lot of fun SRE conversations lately and this one was all kind of a new direction
so
it was
yeah
really fun for our side
glad you could
you were able to do it all
successfully from the car
sorry if there was
background little pings
as I was
you know
people would drive by me
and stuff
so I hope there wasn't
too much background noise
no no
everything was great
is there any
any final thoughts
you want to add before we sign off or you got everything? Um, I think, uh, one thing that's just worth
noting is like, uh, you can't buy SRE off the shelf. Uh, you can't, you can barely hire them.
Um, it's really something you have to build over time. Uh, it's, it's, it's specific to your
company and your team. Uh, it's an investment. Uh, you have It's an investment. You have to take it slowly.
You have to build it up.
And one of the most important things you can do is you have to have what we call an executive sponsor.
Someone has to say, yes, as a company, we're going to do this.
We're going to staff it.
We're not going to cancel it in one year.
It's not a project that has a timeline.
It's not going to be done in 16 months. It is just
a new job role that will be there forever. So as long as you take it that way, you will make
continuous improvements, your systems will become more reliable, and you'll be more profitable
business if that's your goal. All right, appreciate it. And do you do, what do you do,
like LinkedIn, Twitter, anything like that, that people want to follow you?
Yeah, I'm Steve McGee on all the things.
S-T-E-V-E-M-C-G-H-E-E on the Twitters and LinkedIn and GitHub and probably all the stuff.
Google Plus?
Google Plus, Wave, all of them.
Yep.
I think I had a, I'm pretty sure Google Reader
had a username functionality
and I had Steve McGee on that too.
But, you know, RIP.
Or went out for Reader.
I heard that one was popular.
I never played with that one.
Anyway, thank you very much
for being on the show.
If anybody has any questions, comments,
you can, you know,
reach out to Steve, obviously,
or just follow him.
If you have any questions, comments for us, you can get us at pure underscore DT on Twitter
or send us an old-fashioned email at pureperformance at dynatrace.com.
I think that's it, right, Andy?
It's been a while.
Anyway, thanks for listening, everybody.
And thank you, Steve, for being on again.
Talk to you all soon.
Bye.
Thanks for having me.
Bye.
Bye.
Bye.
Bye. Thanks for having me.