PurePerformance - Getting Started with Chaos Engineering through Game Days with Mandi Walls
Episode Date: May 30, 2022How do you plan for unplanned work such as fixing systems when they unexpectedly break in production? Just like firefighters – the best approach to practice those situations so that you are better p...repared when they happen.In this episode we have Mandi Walls, DevOps Advocate at PagerDuty, explain why she loves Game Days where she is “practicing for the weird things that might happen”. Prior to her current role she worked for Chef and AOL – picking up a lot of the things she is now advocating for. In our conversation Mandi (@lnxchk) gives us insights into how to best prepare and run game days, shared her thoughts on what good chaos scenarios (unreliable backend, slow dns …) are and which health metrics (team health, # incidents out of hours, …) to look at in your current incident response to figure out what a good game day scenario actually is.Mandi on Linkedin: https://www.linkedin.com/in/mandiwalls/In our talk we mentioned a couple of resources – here they are:Mandi’s talk at DevOpsDays Raleigh: https://devopsdays.org/events/2022-raleigh/program/mandi-wallsOps Guides: https://www.pagerduty.com/ops-guides/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my lovely co-host Andy Grabner.
Andy, how are you doing today?
I'm good. I was trying to make faces to kind of get you off your spiel, but I guess I didn't succeed.
I saw you.
You kept your cool.
Yeah, I kept my cool and maybe it just also means you weren't being funny enough i don't know yeah i guess so right and you need to
try better you need to improve that that comedy performance of yours all right well the thing is
i'm not i'm not i'm not getting paid for being like funny right yeah because you'd be poor i know so i know so i'd make a lot of money
in bad dad jokes though um but see i'm not even going to try see so my favorite thing now lately
between you and i is when we start talking and in the back of my head i think i'm gonna think of a segue and it never happens it doesn't come andy so yeah i was about
to say oh yeah i don't know what to say that was funny i always think of a segue because it was the
first company i worked for that was related to performance segue software back then yeah i think
they already what was the name of that what it was was... Segway software. Was this... Segway.
But the actual, the instance,
it wasn't like how it's like you had Mercury,
you had Lode Runner, Windrunner and all.
Segway.
Oh, Silk Performer.
Silk Performer was the tool.
Yeah, yeah.
Or Silk Test, Silk Performer.
And yeah.
But you know what?
I don't think our listeners like to keep listening to us.
Scroll down memory lane.
Exactly.
From Andy to Mandy.
Mandy Walls, everyone.
I knew you were going to use that.
I peppered it on you.
I peppered it on you.
At least I'll take a little credit.
I know.
It was your idea.
But now, Mandy, welcome to the show.
Hey, thanks, guys.
It's nice to be here.
Thanks for not running away after listening to us for three or four minutes.
I'm still there.
Mandy, for those of the listeners that have never heard about you, you may want to introduce yourself, who you are, what you do, why you think you're on the show.
Yeah.
So thanks for having me.
I am Mandy Walls.
I'm a DevOps advocate at PagerDuty.
And I'm a recovering systems administrator.
That was my original role in technology. And then I spent a long time at
Chef Software doing like automation and infrastructures code and cool stuff like that
in the DevOps world. And yeah, like most recently for PagerDuty, we've been working on like a series
of talks about helping our customers deal with all the stuff that comes up when you put software
into production and what happens and how to manage it and how to get better at it so yeah
so isn't the solution just to not put any don't put anything in production don't attach anything
to the internet just don't right that's just run it all local all local because it always works on
my machine so there was a really i don't know if people are familiar with
Tim and Eric, the comedy duo, who do this really weird
stuff, but they had a fun early internet type
of video, and the internet came on a CD-ROM,
and you could shop for clothes on the CD-ROM.
And then when you wanted to buy one, you would print it out and
mail it in to the company. But that was like
the internet.
No, it was the internet. But anyway, to your point,
if you keep it all turned off and it's self-contained,
what's going to go wrong? Nothing. nothing no problems nothing to talk about anymore but
fortunately there are problems out there but there are solutions to problems so at least
approaches how to make problems less painful right and uh the way we actually got to know
which i think i've followed your work in the past i'm pretty sure i've actually seen you
prior to the last time I saw you,
which was like four weeks ago when we were both speaking at DevOps Days in Raleigh.
And you were, I think, presenting in the morning
and your talk was plan for unplanned work game days with chaos engineering.
Yeah.
And that was really cool.
Now, can you enlighten the listeners on what this all means? Because planning for unplanned work seems really cool because it kind of sounds contradictory.
Yeah.
And then also the concept of game days with chaos engineering. We had people and guests in the past to talk about chaos engineering, but I would love to hear from you. What's the big thing that people need to know about? Yeah, awesome. So we refer to unplanned work as like, obviously the opposite of planned work. So
your planned work is like the sprints that you're working on, the things you're putting into your
products, the features that you're developing sort of intentionally. And unplanned work is like
all the weird stuff that happens when you put software into production and it like meets real users and they do strange things or you get like whatever weird requests come in off the Internet just because.
And you're just not sure what's going to happen. And all the other stuff that can sort of break or wiggle around, be kind of weird in distributed environments. So thinking about
unplanned work as your incidents and alerts and what happens when something goes wrong and like
the system breaks in some way and we kind of lump all that stuff together. And planning for that
is part of the practice that we stress with our customers
in that to get good at responding
to those weird things that happen,
you got to practice,
but like practicing for weird things
requires more weird things to happen
and we don't want that.
So we started leveraging more chaos engineering practice,
which is sort of fault injection,
intentionally putting weird things into the system in a way that we know they're going
to happen.
We know where they're going to go so that we can test, hey, did our monitors work?
Did our alerts go off appropriately?
Did we page the right people when we poked the system in this way?
Did all the stuff that we expected to happen happen?
Or was there something weird and wonky that didn't do what we expected it to do?
So we can improve our response.
We can improve our monitors.
We can improve the overall reliability of all these systems.
So it all sort of hangs together in a bigger practice.
For me, this has a lot of parallels with, at least I assume, what firefighters do.
Firefighters, I guess they don't just wait for the next big fire and then, what do we do? We
haven't done this in three years and then things fail, but they actually practice all the time.
And obviously they don't put fires or they don't put houses on fire where people live in. They
don't do it in production, but they where people live in like they don't do
them production but actually train and exercise now this brings me to i mean you say that but
like i grew up in a place with volunteer firefighters so like you could actually like
donate your house to the fire department and they would do a practice burn if you had a house or a
property that was in disrepair like they would absolutely do that whether it was a tax write-off
or what donation or however that that part of it worked but like I have seen many practice fires
because in a small community we have nothing else to do so you kind of follow the fire department
out and like with the six-pack and the right you just take your lawn chairs and go watch the house
burn super weird um but definitely happens and where i live now um the fire department
uses the culvert across the street from me to practice with the tower with the the tower ladder
so they all roll up with the big fire truck and distract me for an afternoon with uh practicing
using the the tower ladder which is fascinating so everybody has to practice and you're right like
our incident response process
and a lot of these other things are really built off of
emergency response sort of in the real world.
And do you think we should start a business with
donate your production environment for people?
Yes, there you go.
Oh, that'd be hysterical.
We have this whole thing, we're going to sunset.
Or like, here's my AWS keys.
Do whatever you want for the next hours.
You'd be amazed how many Bitcoin miners you can get loaded up in an hour.
But yeah.
You know, it's interesting on that idea, Andy.
I remember when I first met Mark Tomlinson, and this relates to this idea very specifically.
He was working down at the Microsoft Labs in one of the Carolinas, wherever that one is, and they invited us
down to load up our system and do a performance test in a large lab with their experts looking
at everything.
But it's almost as if you could do chaos testing as a service.
You spin up an incident of your thing, there'll be a strike team of chaos engineers who have all these,
you know, lots of chaos experience will come in and just really, you get the team in, you
guys ready, and then you just chaos the crap out of it and make it a fun exercise for the
whole team.
That could be a side business there.
Chaos for hire.
Yeah.
It's like a hitman that you can hire.
I can't call it Captain Chaos.
That was Dom DeLuise in the Cannonball Run.
I think he was, but anyway,
you have some kind of creative chaos.
Anyway, that's my track.
But I mean, to that point, that would be real.
I think people would have a heck of a lot of fun with that.
You know?
Yeah.
So anyone out there is looking for a new entrepreneur, Leo?
There you go.
Free ideas.
Yeah.
Free ideas.
Hey, Mandy, one question that always comes up
when we talk about chaos engineering,
do you really enforce chaos in production
or do you do it in a pre-production environment?
What's your take on it?
You can do it anywhere, really.
And for certain kinds of what we call chaos practices
or the kind of experiments that you can do
with a chaos framework,
you really want to be running at least some of them
is shift that left, right? Things like how does this library respond and do with a chaos framework, you really want to be running at least some of them,
is shift that left, right?
Things like, how does this library respond
when the backend doesn't respond within the tolerance?
What does it do?
And doing that sort of testing super early
in the development cycle,
so you're not putting something into production
that's way out of your tolerance,
and then discovering that when you inject your fault there,
that there's a big problem.
So certain kinds of tests work real well,
pushing them into the dev cycle really early
so that they're part of your initial cycle of tests.
So you can say to the developer,
hey, this is outside of our operational parameters.
Take another look at this component.
And then once you get into a larger environment,
if you have, if you have, if you have, not everyone does have a real rigorously maintained
staging environment, you can absolutely be doing things there before it gets to production.
Not everyone has the time or resources to have an exhaustive staging environment. Hopefully they have
something, but you just never know. But you can be testing, running these kinds of
tests in any environment, really. But I think some people we've talked to in the past,
Andy, right, they've advocated for chaos in production, right? Which I think...
That's going to tell you the most yeah yeah but at the same
time i think even just based on what you're saying with the shift left idea which i think is very
very sound advice you probably wouldn't want to encourage people who are just organizations that
are just starting to experiment with chaos to start in production because probably everything
will fall apart well they haven't been thinking about it. Right, right. So maybe you work out some of the bigger things
and production is for some of the more edge cases.
I don't know.
I don't know, Andy or Mandy,
if you all have ideas on that
because it would seem like people would be like,
yeah, let's do chaos testing
and let's start right in production
where you know everything's going to fall apart
if you're not prepared.
So maybe don't start in production, I think.
Yeah.
Your goal is reliability, right?
Yeah.
So you know we can't tack reliability on at the end of the dev cycle.
We know it has to be built in from the very beginning.
We have to be thinking about it as we're planning our features.
So we want to be able to exercise some of those reliability components as early
as possible. So putting some of your chaos testing, your reliability measurements,
earlier dev cycle is going to help you do that. For folks that are just starting out,
that's like going through the data center and pulling cables. You have no idea what's going
to happen if you haven't really been thinking about planning intentionally for that kind of
reliability and not everyone does until maybe they've had something happen you know it's kind
of a response to a catastrophe or a major outage or some other kind of incident and then like
somebody freaks out and says we have to do this now and tries to jump in with both feet into production.
And that's, isn't that human nature?
That you always push things out until actually something terrible happens.
Absolutely.
You could have avoided it by doing something else.
And then, yeah.
Balancing your risk and resources.
Exactly.
One of the things,
and this was in the last recording we had on the last Peer Performance
podcast.
We also talked a little bit about chaos engineering.
But one of the terms we brought up, and this came in yet another podcast with Anna Medina from Gremlin.
Oh, she's called test-driven operations, which I thought was actually pretty nice. Because if you think about test-driven development, so you write code and you write the tests first to make sure that everything is functionally correct.
But test-driven operations is you're actually taking the chaos engineering as an opportunity to validate if your operations works as expected.
So do you get your telemetmetry do you get your alerts um do the right people get
notified um do you get the root cause information in case you're actually inflicting certain things
and then how fast can people react so test-driven operations i thought was always a really cool term
and i think most of the credit goes to Anna for that term.
But what do you think about it?
Yeah, I think it's super interesting, right?
Because like part of it too,
like are all the things that you've mentioned,
plus are the folks who are responding,
do they have the right access to solve the problem, right?
Like that can be an issue as well.
Do they know how to find the documentation on the service? Is it actually something that's public or has someone decided to hide it behind some kind of sign-in? And all those other components that you don't want to be worrying about if you're actually having a real incident or real outage. You want to make sure that that path is as smooth as possible before anything bad happens so yeah love
that yeah you also i mean i brian i'm not remember i don't remember his name but we had in the very
early days of our podcast and we talked about this um a guy from aws and he i think brought up
the concept of not only bringing chaos to your system, but also bringing chaos to the people.
Meaning, for instance, sending people home that are essential to incident response or
taking away their laptops and see how does the organization itself react to an a chaotic
situation that is then even more chaotic because your key people that you
normally rely on are also
not there because what if you're getting sick
and you cannot just call in?
I think that was Brent from the DevOps
book, the Phoenix Project.
It was in Britain that solved
everything.
Yeah.
I mean, that's
something that you have to
let people know that's going to happen.
Like that can be really passive aggressive.
If you haven't like more people,
then hey, you know what?
We're going to practice this without Mike
because Mike's going out on paternity leave
at the end of the month.
Like having that kind of talk with your team,
that's like, we have this key person,
they're only human,
so something might happen. And like, it's,
you don't really want to get dark with the, like the bus factor things, but like
people on vacation, they have kids, they have other parts of their life. Right. So having a
plan that says, all right, we've been relying on this person. And part of being a good manager of
an operations team is recognizing that that's going on. And if you're the tooling
that you're using, isn't telling you that, Hey, you know what, this one responder is responding
like 80% of your incidents. Like maybe you need to sit down and work with your team on broadening
that out and skilling up so that, you know, you're not overburdening this one person that can lead to
burnout. They can lead to attrition, all those kinds of things come into it too.
But yeah, practicing without the person you always call is definitely a good idea because they've got life.
This is a great metric that you just mentioned.
The number of responders you have,
who is kind of the top responders?
Is this a metric that you keep track of?
And then kind of with this,
you actually see where your hotspots are? are yeah there's a bunch of telemetry in the pgd platform now that gives you team health and number
of uh incidents out of hours and hours on incidents and all that kind of stuff so you can start to get
a picture for it's it's the aggregate health of the team and it's a socio-technical system that is your team and the services that they run together, right?
So that you can see, hey, you've got a lot of things that are going on after hours.
There's a lot of things that are coming to this one particular responder with a lot of escalations into this one person because somebody somewhere feels like they're the only person who knows what's going on here.
So it's time to take a look at that.
And it has to be intentional.
It's not one of those things that will improve passively.
It's something that managers need to take on
if they have an on-call team that they're responsible for
to sort of take a look at those metrics,
pull up the dashboards,
and work through what's going on with the team there.
Hey, coming back to the game days,
which was one of your key talking points at the conference,
for, I'm sure there's a lot of organizations
that come to you, right?
Come to PagerDuty or come to you for advice,
and they say, hey, we've never done this before.
What do you typically suggest?
What do people start with?
Do they run a game day once a year,
once a quarter, once a month?
What happens in the game day?
Can you, like, especially for the listeners
now that are new to this
and they would like to start practicing,
what are the kind of the,
what's the core concept of a game day
and how often do you run it
and who should be involved?
Yeah, that's a lot. Okay. So game days are an intentionally scheduled pre-agreed upon sort of date and time where you are going to inject some kind of fault into the system.
And you may know in advance, hey, we're going to test this one particular thing. So we're
going to pull out this backend and see what happens. Or it might be more generic. We have
this new feature. We're going to test around it and see what happens. But you plan it ahead
so that the team knows to be ready, to be looking, to make sure they're seeing the alerts that they
need to see. And maybe you have a checklist of alerts that come through and telemetry and monitoring and make sure all that stuff works and that's all fine.
For teams that are new to it,
one of the good things about game days
is that they can be very small and contained
or they can be very large.
And the sort of the legends, I guess,
around the origins of this stuff is like crazy people
going through and pulling cables in the data center and good luck, guess what happened and go from there. But with
the modern chaos tooling that you can have, you can be very specific about the components that
you're going to inject fault into because you've got little agents that are mostly going to run on
your systems and take care of that for you. But you can then be very focused in the game day that
you're running. So what I mentioned in my talk is sort of how we do failure Fridays at PagerDuty.
And any team can have their own failure Friday. And they may not even run it on a Friday,
just depending on what their schedule is. But like one team, if they've shipped a new feature,
it's in production, but maybe it's under a dark flag or something like that. if they've shipped a new feature it's in production but maybe it's under
a dark flag or something like that and they want to do some testing around it and eject some fault
and be very conscious of what's going on there they can have their own sort of mini game day
to practice on right and just kind of declare wednesday at noon pacific we're going to be
exercising this widget that lives in the ecosystem and this is what we're going to be exercising this widget that lives in the ecosystem,
and this is what we're going to do. And then they can use that to learn from
and improve their reliability. So for folks that are just starting, you can start with a small
component. You might start with something that is relatively stable, that you know a lot about.
So you can not only practice the practice of game day, but also practice your tooling and make sure that you know the edges and the horizon for what's going on there.
But it gives you an anchor to say, well, we understand this particular component really well.
We don't really understand our tooling and our workflow yet.
So that's the part that we're going to be working on. One of the things that you want to get good at as you practice more and more game days
or these kinds of components
is setting your expectations in advance.
In some of the chaos engineering literature,
you'll see this as hypotheses, right?
You're setting a hypothesis.
We're going to test this piece of the component,
these edge cases, these kinds of behaviors,
and be very specific about it
so that you're getting the most out of where you think your reliability is stopped.
So I can say, well, I have this one backend dependency that is not as reliable as I'd
like it to be.
So we've done some defensive coding in our widget.
And now our game day, we're going to test that. So we're going to black hole that dependency,
see what gets flexed in our widget
and make sure that the reliability
that we wanted out of our stuff is now okay,
regardless of what that crazy backend is doing.
So even if it goes completely offline,
we can still deliver what we need to do.
So you can be very detailed and very specific
about the things that you want to exercise, the things that you want to do. So you can be very detailed and very specific about the things that you want to
exercise, the things that you want to learn, and the goals that you're going to set then for the
reliability of the components that you're working on. Because they're going to be different. Like
distributed systems are wild. So like any given widget might have completely different specs on
it and different expectations and different edge cases for reliability.
So you can flex all of those things with the modern tooling, which is great, but can be overwhelming, definitely.
So Brian and I, we have a big background in performance engineering, load testing, performance
testing.
I would assume that because you mentioned in the beginning, most organizations don't have, let's say, a production environment
where you can just run chaos or a good-sized staging environment
that is under load because of some shadow traffic.
Does it mean you always have to have at least some type of load testing?
And does this mean you also get in the performance engineers
and the site reliability engineers? Because otherwise, if you just pull a cable, And then does this mean you also get in the performance engineers and decide reliability
engineers?
Because otherwise, if you just pull a cable that's like, you know, simulate a black hole
or simulate latency between two components, that's an unrealistic scenario if you don't
really have also realistic load on the system itself.
So how do you deal from a performance engineer, like getting actually load on the system?
Is this this how does
this typically work do you find organizations that have load tests or do you then in preparation of
this actually have to write load tests or also try to generate some load because otherwise it
would be a meaningless test yeah of all the stuff that we sort of run up against i think
some kind of load generation is probably the thing we're most likely to find
in an environment already.
I think in many cases,
that's pretty well understood
as far as we need to generate enough load
to exercise these components.
Where that gets a little bit more sophisticated
is what does that load look like?
Do your users behave differently
if they're logged in versus not logged in?
Do you need to simulate that?
Is it different if the users are coming in
from mobile versus desktop?
And do you need to simulate that?
And being more specific about that mix
versus we're just going to throw
an anonymized version of yesterday's load
at whatever the old school stuff is.
So what this tells me, it seems that,
because we run into quite a lot of organizations
that don't have enough load tests.
And for me, this means like almost a maturity level, right?
It might be, yeah.
So that means people only start thinking
about chaos engineering if they actually have the basics
of performance engineering covered,
because otherwise chaos engineering is a too big step
and a too far leap to take.
So that's great to hear.
But you brought up another great use case or scenario now,
because obviously if you have a chaotic situations,
your end users will change the behavior,
which means you also need to simulate
that type of change behavior and that
again you need to make hypothesis like you need to have a hypothesis hypothesis hypothesis whatever
you know what i mean that works that works um and then to also change the load patterns yeah
that's yeah like thundering herd can be a real issue for some environments.
Back when I was a system man, we had globally distributed load balancing.
And if something happened in one pod and it completely fell down, all the traffic moved to the other pods.
And then those pods would go down and the original pod would recover and the traffic would swing back.
And you could just watch the loops just go and go and go until you like got the whole thing quiesced and it was just absolutely bonkers and hopefully nobody
has to do that anymore but yeah yeah and do you find sorry with andy's hypotheses um
that's the plural that's the plural version of it i think think, is the hypothesis. Do you find organizations have the ability,
based on watching those experiences in production, to put together good concepts of how users react?
Or I would imagine most people have no idea what users are doing when the proverbial poop hits the fan. Right. So, and see Andy,
I kept it clean. Unlike you, uh, the, so it just seems like it's a lot of guesswork unless, I mean,
do you, or do you, do you find customers who actually do study that and how rare is that?
Yeah, they actually do like that hypothesis is a guess, right? Like that's part of what you're
going to discover, uh, potentially in some of these instances and some of the things that you're going to be doing.
So, you know, especially if you get the opportunity to do some of your chaos testing in production to be able to say, all right, this is the module that we're currently targeting.
We're going to turn it off and see what happens, right?
That kind of thing.
We'll turn it off for five minutes and see what users do.
Do they continuously rage click, right?
Like, is that what we're looking for?
They're reloading frantically to find that thing.
Or do they not even notice it's not there?
Like, those are all the kinds of things that you can sort of pick out from what happens
if you have the ability to test these things in production.
It's funny if you know, you mentioned they don't even notice it's not there because that's
a great feature to get rid of then.
Yeah.
Well, yeah, that's one of the things you find, right?
Like if you're spending all your time trying to increase the reliability of this widget
that no one uses, you are wasting your resources.
So knock it off and put them where the things are that people care about.
Not the right monitoring in place from the start because you would already know that nobody's using it
and you would never have a game day around it.
Hey, quick question.
Also, one of the things that I've kept hearing
is they said, well, this is all unrealistic.
Nobody would pull cables.
Nobody would do X, Y, Z, right?
So my question to you now is,
A, are there unrealistic scenarios that you would not do as a start?
Because they might really be very, not unrealistic, but maybe uncommon and less probable.
Are there certain things you would start with?
Are there certain chaos scenarios where you say, these are the two, three that we always start with because they are the most likely to happen.
And therefore, you want to make sure that you are you're testing against those
i think it depends on on the application and like what you know about it already and
in in in my experience with the things that i've run and the things that i've worked on like the
most common things are going to be unreliable backends.
Like you're going to have something that you depend on and it either belongs to another vendor or it's in another data center or something weird happens or that team doesn't update and breaks things.
And just the nature of distributed elements in the environment, it's not as reliable as you need it to be. There's other stuff that are things that are easier to do.
Slow DNS, slow response times from other places,
long running database queries that are really easy to practice with
that are good places to start too.
But it depends on the architecture of your environment.
If you're using a lot of caching,
maybe long running database queries aren't the biggest deal for you. So it just kind of depends on what's already built into
the environment that you're working on. But yeah. What about, what about like other scenarios,
like disgruntled employee scenarios? Does that come into, into the play here?
I haven't seen that as something out of the package, but that's kind of interesting.
So I remember way back, one of the other load testing tools, I don't think it impacted production, but one of the other load testing tools, when a major buyout happened, somebody
went and deleted the database of all, the knowledge base, which isn't quite necessarily
a production.
It's just data gone in one way.
But I don't even know how common it is for disgruntled employees
to actually take revenge on systems, because obviously that's going to open
them to a lot of legal issues.
But yeah, I was curious if there's a disgruntled employee scenario
in chaos testing, which.
Not that I've seen.
That's super interesting, though.
But like that's a question for your DR planning, I think more than your chaos testing.
Well, but that actually would explain the pulling the cables.
I mean, that's something where you say, hey,
if you say pulling the cables is not realistic, well,
if you have a disgruntled employee,
then maybe they go crazy and they do something stupid
like that.
Maybe.
Hopefully, they don't have access to the data center,
but who knows? At least it's a fun theme for a game day.
Yeah.
If nothing else.
Corporate espionage.
Yeah.
Cool.
I really like the scenarios.
Unreliable backend makes a lot of sense.
Slow DNS, long running database queries.
I guess this is stuff that you can inflict through latency, right?
So there's different tools out there that allow you to increase latency.
I guess they did a lot of work on the latency.
I think it's a good thing that they're doing it now.
I think it's a good thing that they're doing it now.
I think it's a good thing that they're doing it now.
I think it's a good thing that they're doing it now.
I think it's a good thing that they're doing it now.
I think it's a good thing that they're doing it now.
I think it's a good thing that they're doing it now.
I think it's a good thing that they're doing it now. I think it's a good thing that they're doing it now. I think it's a good thing that they're doing it now. So there's different tools out there that allow you to increase latency.
I guess the black hole example, I think you call it black hole, right?
Where you basically say nobody, like I cannot connect to that system.
You can't connect to that, yeah.
Yeah.
Is this technically how do you do it?
Do you just configure routing or take change?
How does this typically work?
Yeah, it depends.
Like if you are subscribed to an actual chaos tool,
so there's a handful available,
some open source solutions as well.
There'll be often an agent that works on the host
that will get in the middle of the traffic
and either pull the traffic off the wire
and dev null it or whatever from that perspective
so that it never gets up the stack into the application
and blocks it that way to sort of mimic those behaviors or puts delays on it,
slows it down, that kind of scenario.
So that you're not actually making live changes to the built environment.
You are really just getting in the middle of those inner machine conversations
and inflicting the chaos there.
Hey, and how many organizations did you work with
then not just using these game days to figure out where the weak spots are
and learn on how to manually react to it,
but how many are in a state where they
already start to test their auto remediation, like really figuring out, hey, we actually
have some auto remediation in place, meaning our monitoring detects a problem and is then
triggering a script to, I don't know, fix a certain thing like redirecting traffic to
a different system or something like this.
Do you have an idea of how many organizations are that mature to actually have all the remediation
scripts in place that they test with this?
It feels like that stuff's really just getting started, or I will say maybe just getting
restarted.
Because those are the kinds of things that kind of have to get rebuilt every time you
change an infrastructure paradigm.
It's like you kind of understood what was going on when we all had virtual machines
and powerful load balancers and what happened to have had to happen there.
And then you have like containers and what you can do there, redeploying a container,
moving that load around.
Now that we're in, folks are starting to move more loads into something like Kubernetes.
There's like another set of solutions that need to get built around those components and what decisions folks want to make about expanding a pod or moving
a pod or whatever they're going to do with it. So I think we're kind of in a place where the folks
that are leading the charge into Kubernetes are just getting that stuff built. And folks that are
more well understood, or I don't want to use the term legacy or vintage, but like more established practices are already well on their way with a lot of that stuff.
So it really varies on and it's one of those ironic things where like the old stuff might have the best auto remediation because they know all they have to do is go in and poke that thing and it's fine versus what kind of sophisticated solution do we want to leverage in our newer environments?
So it widely varies for auto remediation around different kinds of infrastructure and it varies for different kinds of runtimes.
Some of them are better at being restarted and doing a hot restart and and those kinds of things too yeah or systems
taking like as you mentioned kubernetes right kubernetes obviously has some built-in mechanisms
already to recycle pots and and do things like that yeah i just came off the call and and there
the topic auto remediation was also a big thing and the use cases that they came up with is
the classical ones like cleaning uh cleaning discs disk full, or database slowness
because of reindexes needed and stuff like this.
Oh, yeah.
The new triggers and things like that.
The classic things that we've kind of been working on,
like there were issues like 15 and 20 years ago
are still kind of under the hood there being issues.
Yeah, exactly.
Same thing with those performance patterns, right,
Andy? Same old ones in new ways.
It is what it is, yeah.
Hey, now, Mandy, let me ask you, kind of getting to the end,
but so from your world that you see from a page of duty perspective
where people are using your product and your services
to really then handle incident response.
Like you're routing the traffic to the right people, right, obviously.
I guess your goal is obviously to reduce MTTR, meantime to respond, meantime to repair.
I really like what you mentioned, what you said earlier, that you have some statistics
or like there's certain metrics in your platform that already give you some indication
how you can then kind of plan your game day.
Is there anything else that we can learn
from especially that part of incident response
on how to improve system architectures
and systems in general?
Like, do you see certain patterns in organizations
where you say,
why are they doing this?
I don't know, anything else from the statistics from the KPIs?
That's kind of an interesting question because, like,
a lot of it does come down to using your game days to give your folks
a place to practice the stressful part of the incident
and like getting to reducing your mtcr in some in some environments is going to come down to
making sure your engineers aren't freaking out right and like not being able to think straight. So it's also helping them with being comfortable,
actually responding to the problem and recognizing that, you know, it's okay,
we're going to work through this. We're going to figure out what's going on, make sure that
we're looking at the right alerts and they're telling us the right things and we have the
right access and getting that
sort of muscle memory around all of that stuff. And all of that helps improve reliability over
time too. Because the worst thing you can end up having there is you've got maybe someone who's
new on the team. They don't know where they're looking. They don't know where to find things
in either PagerDuty or in your wiki or wherever they are. And they're kind of just lost and staring like deer in headlights.
Like, what do I do?
And they kind of spin for a few minutes and then escalate.
And like, you don't want to get into that kind of cycle.
So the whole practice also gives you a way to make sure they know where things are
and lower that barrier for them to be able to triage things,
to know where the auto-remediation scripts even are.
Are they attached to the platform?
Are they on a bastion host?
Where are they?
Like all that kind of practice that comes into it
so that when something does happen,
folks are less barrier to success for that whole thing.
Because it's a lot of really complex environments
out there and like you can't just drop someone in the in the middle and they're going to figure it
out like you really have to lead them through a lot of it so that they understand the context
so it's practice practice practice is what i hear just like
just like the volunteering firefighters. Absolutely.
Just like anything.
Besides computer performance,
I come from a stage performance,
either plays in high school or music.
Oh, yeah.
And it's like you practice and practice and practice.
And then when you go on stage for the first time,
you're nervous and all that.
But after a few times, you get used to it. And I remember even a fun short story back when I was in middle school.
I think it was eighth grade.
We were doing a performance, and the other kid realized he left the prop in the prop room.
So we're in front of the stage, in front of the camera, just the two of us.
He's like, I know just what we were doing in Oklahoma.
And I was so many things to do in that play with an all-white school.
But he left the
prop in the thing so he's like i know just what you need i'll be right back and he runs off stage
and suddenly it's me in front of the curtain in front of the whole you know all the parents i'm
like crap and i look at the director and she's like freaking out so i'm like all right i gotta
do something so i just start like pretending to talk to myself and complain but it's like
because i was comfortable enough from that all those practices from the dress rehearsals and all, I was like, all right, I got to act here.
But to your point, you know, and, and not to make this about me, cause that's what I just did there.
Um, you know, the more you go through those game days, the more comfortable you'll be.
You'll also be able to find like, do you, is there a focus on people who have a lot of pressure and
how you can like work with those people? Like you can, is part of game day identifying the people
who panic and figure out some way to work with them to get them over that panic. Cause I think
some people are just prone to that. Right. Absolutely. And like, sometimes you might
breathing exercises are a good one, you know, but also like recognizing that unless you are actually in a medical environment, like no one's going to die. out some of the external stresses that folks sometimes have about decision making or do I
escalate this or having like a third party there who's going to help you make those decisions and
the incident commander, as well as kicking out people who aren't helpful, like for lack of a
better word, right? Like it doesn't help your responders to have someone there screaming at
them to work faster, right? Like that is counterproductive.
So having a practice around focusing on the people who are solving the problem,
making sure they have the space to work through the troubleshooting and debugging that they're doing
and, you know, being very predictable about what that workflow is going to look like. And like you bring up working with like plays and stuff for me,
it's marching band, right? Like marching band,
you've got like 150 kids on the field and everybody's got a different
instrument and there's flags and batons and like crazy stuff flying around.
And like you practice all summer, right?
Like you learn the music, you think at the beginning,
like here's these 12
pieces I have to learn. Oh my gosh, this is crazy. But by the end of August, like you have most of
them memorized, you know, your steps, you're on your right foot and your left foot and your turn
and all this stuff. And even for like junior high and high school kids, like it seems crazy when you
start out, but like by the end of the football season, you're a pro. So it's the same, same kind
of thing. You want to practice this to the point where like you're a pro,
you know what the music is, you know what your steps are,
and part of them are going to be in the wiki,
part of them in this other platform,
and here's how I get to where I need to go.
And giving everyone the game days give you the opportunity to say to the
newest person on your team, hey, man, here's how we learn how to do this.
And here's how we count our steps off and here's how we play our music and,
and get all that stuff going. So.
Awesome.
Shout out to all the band nerds out there. Yeah.
My wife hates it when I say I wasn't, I wasn't in the marching band.
I was in the drum line.
Oh yeah. Yeah.
We'll allow it hey mandy um in uh conclusio or kind of like to close it off i know right
hypothesis conclusio what else i learned my spanish this morning because I'm on Duolingo.
Silencio.
Silencio, exactly.
No, but for kind of closing words,
if you want to give one piece of advice for people that are not yet sure
is chaos engineering the right thing for them,
should they do a game day,
give them one last reason to say
this is why you want to do it.
And maybe just repeat something you've already said.
But this is kind of your moment to say, do it because of this.
Yeah.
I mean, you're going to do it because you're going to improve.
You want to improve the experience of your product
for your users and for your customers.
You're after the most reliable system you can build, right?
And part of that is trying to get ahead of the bad stuff,
the weird stuff that might happen when you get into production.
And that's the Wild West.
There's crazy stuff out there, man.
Yeah.
And with this, you not only make your product better
and your service better,
but also your work-life balance better, I guess.
Absolutely.
Because that's really cool.
Hey, Mandy, we saw each other at DevOps Stage Rally.
Are there any other appearances coming up?
Do you have any other conference talks where people can see you?
I will be at GlueCon, which is May 17th, 18th, 19th, I think, right?
Then we have PagerDee Summit, which is three hybrid events uh running in san
francisco sydney and london i will be in person in san francisco and london and those are in june
and yeah i don't know after that yet like august is pretty quiet so might actually be home
well you gotta practice right i know right very cool brian any final words from you no nothing uh i mean the
only other thing i'd add the only other reason why i'd add to do the the chaos to the game days is
probably because they're a lot of fun right you talk about making your job better like that's
gonna be really fun and you're gonna learn i mean everything that mandy said you're gonna get out of
it but you're also gonna have have fun at work. Right.
Which is so most people, like if you can find ways to make your job fun, go for it. And it's
going to have a fantastic payoff for the, for the organization. So if, if, if the higher ups are
looking, looking, seeing people like doing this crazy thing and having fun, they're like, what's
going on? Like, Oh no, no, no, no. Look, we're doing something good. Right?
So anyhow.
Mandy, thank you so, so much.
Andy, did you have any final words?
Or if that's it?
No, for me, that's it.
I definitely have some additional great arguments on why chaos engineering,
and especially I really like the common scenarios
and some of the metrics you told me.
I think that's really nice.
It helps me in my conversations when people ask me about it.
What's the reason for it?
What's the benefit?
No, I'm really happy that you were here.
And I'm pretty sure this was not the last time we saw each other.
Wish a chaos engineer recording.
I'm kidding.
No, please don't.
We get that all the time.
That was when we were using those other platforms
So anyway, Mandy, thank you so very much for being on today
We really appreciate it and we love this these topics. These are the real fun ones even even talking about it's fun
Yeah, definitely. That would be fun to do chaos engineering. It's a fun topic to talk about
So thanks to all of our listeners as always always. If you have any comments or questions,
Pure underscore DT on Twitter or Pure Performance at dynatrace.com.
Mandy, I believe you'll probably send us some links on how people can follow and or interact
with you on various platforms if you have any. Yeah, I'm on twitter now as ellen xdhk i'm also on twitch
on the page duty channel a couple days a week so yeah we'll include those in the show notes
yeah and we look forward to good luck with all your travels and um yeah thank you hopefully
we'll talk to you soon all right guys take care thank you bye