PurePerformance - Getting Started with Chaos Engineering through Game Days with Mandi Walls

Episode Date: May 30, 2022

How do you plan for unplanned work such as fixing systems when they unexpectedly break in production? Just like firefighters – the best approach to practice those situations so that you are better p...repared when they happen.In this episode we have Mandi Walls, DevOps Advocate at PagerDuty, explain why she loves Game Days where she is “practicing for the weird things that might happen”. Prior to her current role she worked for Chef and AOL – picking up a lot of the things she is now advocating for. In our conversation Mandi (@lnxchk) gives us insights into how to best prepare and run game days, shared her thoughts on what good chaos scenarios (unreliable backend, slow dns …) are and which health metrics (team health, # incidents out of hours, …) to look at in your current incident response to figure out what a good game day scenario actually is.Mandi on Linkedin: https://www.linkedin.com/in/mandiwalls/In our talk we mentioned a couple of resources – here they are:Mandi’s talk at DevOpsDays Raleigh: https://devopsdays.org/events/2022-raleigh/program/mandi-wallsOps Guides: https://www.pagerduty.com/ops-guides/

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my lovely co-host Andy Grabner. Andy, how are you doing today? I'm good. I was trying to make faces to kind of get you off your spiel, but I guess I didn't succeed. I saw you. You kept your cool. Yeah, I kept my cool and maybe it just also means you weren't being funny enough i don't know yeah i guess so right and you need to
Starting point is 00:00:53 try better you need to improve that that comedy performance of yours all right well the thing is i'm not i'm not i'm not getting paid for being like funny right yeah because you'd be poor i know so i know so i'd make a lot of money in bad dad jokes though um but see i'm not even going to try see so my favorite thing now lately between you and i is when we start talking and in the back of my head i think i'm gonna think of a segue and it never happens it doesn't come andy so yeah i was about to say oh yeah i don't know what to say that was funny i always think of a segue because it was the first company i worked for that was related to performance segue software back then yeah i think they already what was the name of that what it was was... Segway software. Was this... Segway. But the actual, the instance,
Starting point is 00:01:47 it wasn't like how it's like you had Mercury, you had Lode Runner, Windrunner and all. Segway. Oh, Silk Performer. Silk Performer was the tool. Yeah, yeah. Or Silk Test, Silk Performer. And yeah.
Starting point is 00:01:56 But you know what? I don't think our listeners like to keep listening to us. Scroll down memory lane. Exactly. From Andy to Mandy. Mandy Walls, everyone. I knew you were going to use that. I peppered it on you.
Starting point is 00:02:07 I peppered it on you. At least I'll take a little credit. I know. It was your idea. But now, Mandy, welcome to the show. Hey, thanks, guys. It's nice to be here. Thanks for not running away after listening to us for three or four minutes.
Starting point is 00:02:21 I'm still there. Mandy, for those of the listeners that have never heard about you, you may want to introduce yourself, who you are, what you do, why you think you're on the show. Yeah. So thanks for having me. I am Mandy Walls. I'm a DevOps advocate at PagerDuty. And I'm a recovering systems administrator. That was my original role in technology. And then I spent a long time at
Starting point is 00:02:46 Chef Software doing like automation and infrastructures code and cool stuff like that in the DevOps world. And yeah, like most recently for PagerDuty, we've been working on like a series of talks about helping our customers deal with all the stuff that comes up when you put software into production and what happens and how to manage it and how to get better at it so yeah so isn't the solution just to not put any don't put anything in production don't attach anything to the internet just don't right that's just run it all local all local because it always works on my machine so there was a really i don't know if people are familiar with Tim and Eric, the comedy duo, who do this really weird
Starting point is 00:03:27 stuff, but they had a fun early internet type of video, and the internet came on a CD-ROM, and you could shop for clothes on the CD-ROM. And then when you wanted to buy one, you would print it out and mail it in to the company. But that was like the internet. No, it was the internet. But anyway, to your point, if you keep it all turned off and it's self-contained,
Starting point is 00:03:44 what's going to go wrong? Nothing. nothing no problems nothing to talk about anymore but fortunately there are problems out there but there are solutions to problems so at least approaches how to make problems less painful right and uh the way we actually got to know which i think i've followed your work in the past i'm pretty sure i've actually seen you prior to the last time I saw you, which was like four weeks ago when we were both speaking at DevOps Days in Raleigh. And you were, I think, presenting in the morning and your talk was plan for unplanned work game days with chaos engineering.
Starting point is 00:04:20 Yeah. And that was really cool. Now, can you enlighten the listeners on what this all means? Because planning for unplanned work seems really cool because it kind of sounds contradictory. Yeah. And then also the concept of game days with chaos engineering. We had people and guests in the past to talk about chaos engineering, but I would love to hear from you. What's the big thing that people need to know about? Yeah, awesome. So we refer to unplanned work as like, obviously the opposite of planned work. So your planned work is like the sprints that you're working on, the things you're putting into your products, the features that you're developing sort of intentionally. And unplanned work is like all the weird stuff that happens when you put software into production and it like meets real users and they do strange things or you get like whatever weird requests come in off the Internet just because.
Starting point is 00:05:15 And you're just not sure what's going to happen. And all the other stuff that can sort of break or wiggle around, be kind of weird in distributed environments. So thinking about unplanned work as your incidents and alerts and what happens when something goes wrong and like the system breaks in some way and we kind of lump all that stuff together. And planning for that is part of the practice that we stress with our customers in that to get good at responding to those weird things that happen, you got to practice, but like practicing for weird things
Starting point is 00:05:55 requires more weird things to happen and we don't want that. So we started leveraging more chaos engineering practice, which is sort of fault injection, intentionally putting weird things into the system in a way that we know they're going to happen. We know where they're going to go so that we can test, hey, did our monitors work? Did our alerts go off appropriately?
Starting point is 00:06:17 Did we page the right people when we poked the system in this way? Did all the stuff that we expected to happen happen? Or was there something weird and wonky that didn't do what we expected it to do? So we can improve our response. We can improve our monitors. We can improve the overall reliability of all these systems. So it all sort of hangs together in a bigger practice. For me, this has a lot of parallels with, at least I assume, what firefighters do.
Starting point is 00:06:48 Firefighters, I guess they don't just wait for the next big fire and then, what do we do? We haven't done this in three years and then things fail, but they actually practice all the time. And obviously they don't put fires or they don't put houses on fire where people live in. They don't do it in production, but they where people live in like they don't do them production but actually train and exercise now this brings me to i mean you say that but like i grew up in a place with volunteer firefighters so like you could actually like donate your house to the fire department and they would do a practice burn if you had a house or a property that was in disrepair like they would absolutely do that whether it was a tax write-off
Starting point is 00:07:25 or what donation or however that that part of it worked but like I have seen many practice fires because in a small community we have nothing else to do so you kind of follow the fire department out and like with the six-pack and the right you just take your lawn chairs and go watch the house burn super weird um but definitely happens and where i live now um the fire department uses the culvert across the street from me to practice with the tower with the the tower ladder so they all roll up with the big fire truck and distract me for an afternoon with uh practicing using the the tower ladder which is fascinating so everybody has to practice and you're right like our incident response process
Starting point is 00:08:05 and a lot of these other things are really built off of emergency response sort of in the real world. And do you think we should start a business with donate your production environment for people? Yes, there you go. Oh, that'd be hysterical. We have this whole thing, we're going to sunset. Or like, here's my AWS keys.
Starting point is 00:08:27 Do whatever you want for the next hours. You'd be amazed how many Bitcoin miners you can get loaded up in an hour. But yeah. You know, it's interesting on that idea, Andy. I remember when I first met Mark Tomlinson, and this relates to this idea very specifically. He was working down at the Microsoft Labs in one of the Carolinas, wherever that one is, and they invited us down to load up our system and do a performance test in a large lab with their experts looking at everything.
Starting point is 00:08:55 But it's almost as if you could do chaos testing as a service. You spin up an incident of your thing, there'll be a strike team of chaos engineers who have all these, you know, lots of chaos experience will come in and just really, you get the team in, you guys ready, and then you just chaos the crap out of it and make it a fun exercise for the whole team. That could be a side business there. Chaos for hire. Yeah.
Starting point is 00:09:22 It's like a hitman that you can hire. I can't call it Captain Chaos. That was Dom DeLuise in the Cannonball Run. I think he was, but anyway, you have some kind of creative chaos. Anyway, that's my track. But I mean, to that point, that would be real. I think people would have a heck of a lot of fun with that.
Starting point is 00:09:34 You know? Yeah. So anyone out there is looking for a new entrepreneur, Leo? There you go. Free ideas. Yeah. Free ideas. Hey, Mandy, one question that always comes up
Starting point is 00:09:42 when we talk about chaos engineering, do you really enforce chaos in production or do you do it in a pre-production environment? What's your take on it? You can do it anywhere, really. And for certain kinds of what we call chaos practices or the kind of experiments that you can do with a chaos framework,
Starting point is 00:10:02 you really want to be running at least some of them is shift that left, right? Things like how does this library respond and do with a chaos framework, you really want to be running at least some of them, is shift that left, right? Things like, how does this library respond when the backend doesn't respond within the tolerance? What does it do? And doing that sort of testing super early in the development cycle,
Starting point is 00:10:18 so you're not putting something into production that's way out of your tolerance, and then discovering that when you inject your fault there, that there's a big problem. So certain kinds of tests work real well, pushing them into the dev cycle really early so that they're part of your initial cycle of tests. So you can say to the developer,
Starting point is 00:10:37 hey, this is outside of our operational parameters. Take another look at this component. And then once you get into a larger environment, if you have, if you have, if you have, not everyone does have a real rigorously maintained staging environment, you can absolutely be doing things there before it gets to production. Not everyone has the time or resources to have an exhaustive staging environment. Hopefully they have something, but you just never know. But you can be testing, running these kinds of tests in any environment, really. But I think some people we've talked to in the past,
Starting point is 00:11:18 Andy, right, they've advocated for chaos in production, right? Which I think... That's going to tell you the most yeah yeah but at the same time i think even just based on what you're saying with the shift left idea which i think is very very sound advice you probably wouldn't want to encourage people who are just organizations that are just starting to experiment with chaos to start in production because probably everything will fall apart well they haven't been thinking about it. Right, right. So maybe you work out some of the bigger things and production is for some of the more edge cases. I don't know.
Starting point is 00:11:52 I don't know, Andy or Mandy, if you all have ideas on that because it would seem like people would be like, yeah, let's do chaos testing and let's start right in production where you know everything's going to fall apart if you're not prepared. So maybe don't start in production, I think.
Starting point is 00:12:06 Yeah. Your goal is reliability, right? Yeah. So you know we can't tack reliability on at the end of the dev cycle. We know it has to be built in from the very beginning. We have to be thinking about it as we're planning our features. So we want to be able to exercise some of those reliability components as early as possible. So putting some of your chaos testing, your reliability measurements,
Starting point is 00:12:30 earlier dev cycle is going to help you do that. For folks that are just starting out, that's like going through the data center and pulling cables. You have no idea what's going to happen if you haven't really been thinking about planning intentionally for that kind of reliability and not everyone does until maybe they've had something happen you know it's kind of a response to a catastrophe or a major outage or some other kind of incident and then like somebody freaks out and says we have to do this now and tries to jump in with both feet into production. And that's, isn't that human nature? That you always push things out until actually something terrible happens.
Starting point is 00:13:12 Absolutely. You could have avoided it by doing something else. And then, yeah. Balancing your risk and resources. Exactly. One of the things, and this was in the last recording we had on the last Peer Performance podcast.
Starting point is 00:13:29 We also talked a little bit about chaos engineering. But one of the terms we brought up, and this came in yet another podcast with Anna Medina from Gremlin. Oh, she's called test-driven operations, which I thought was actually pretty nice. Because if you think about test-driven development, so you write code and you write the tests first to make sure that everything is functionally correct. But test-driven operations is you're actually taking the chaos engineering as an opportunity to validate if your operations works as expected. So do you get your telemetmetry do you get your alerts um do the right people get notified um do you get the root cause information in case you're actually inflicting certain things and then how fast can people react so test-driven operations i thought was always a really cool term and i think most of the credit goes to Anna for that term.
Starting point is 00:14:25 But what do you think about it? Yeah, I think it's super interesting, right? Because like part of it too, like are all the things that you've mentioned, plus are the folks who are responding, do they have the right access to solve the problem, right? Like that can be an issue as well. Do they know how to find the documentation on the service? Is it actually something that's public or has someone decided to hide it behind some kind of sign-in? And all those other components that you don't want to be worrying about if you're actually having a real incident or real outage. You want to make sure that that path is as smooth as possible before anything bad happens so yeah love
Starting point is 00:15:06 that yeah you also i mean i brian i'm not remember i don't remember his name but we had in the very early days of our podcast and we talked about this um a guy from aws and he i think brought up the concept of not only bringing chaos to your system, but also bringing chaos to the people. Meaning, for instance, sending people home that are essential to incident response or taking away their laptops and see how does the organization itself react to an a chaotic situation that is then even more chaotic because your key people that you normally rely on are also not there because what if you're getting sick
Starting point is 00:15:50 and you cannot just call in? I think that was Brent from the DevOps book, the Phoenix Project. It was in Britain that solved everything. Yeah. I mean, that's something that you have to
Starting point is 00:16:04 let people know that's going to happen. Like that can be really passive aggressive. If you haven't like more people, then hey, you know what? We're going to practice this without Mike because Mike's going out on paternity leave at the end of the month. Like having that kind of talk with your team,
Starting point is 00:16:19 that's like, we have this key person, they're only human, so something might happen. And like, it's, you don't really want to get dark with the, like the bus factor things, but like people on vacation, they have kids, they have other parts of their life. Right. So having a plan that says, all right, we've been relying on this person. And part of being a good manager of an operations team is recognizing that that's going on. And if you're the tooling that you're using, isn't telling you that, Hey, you know what, this one responder is responding
Starting point is 00:16:50 like 80% of your incidents. Like maybe you need to sit down and work with your team on broadening that out and skilling up so that, you know, you're not overburdening this one person that can lead to burnout. They can lead to attrition, all those kinds of things come into it too. But yeah, practicing without the person you always call is definitely a good idea because they've got life. This is a great metric that you just mentioned. The number of responders you have, who is kind of the top responders? Is this a metric that you keep track of?
Starting point is 00:17:21 And then kind of with this, you actually see where your hotspots are? are yeah there's a bunch of telemetry in the pgd platform now that gives you team health and number of uh incidents out of hours and hours on incidents and all that kind of stuff so you can start to get a picture for it's it's the aggregate health of the team and it's a socio-technical system that is your team and the services that they run together, right? So that you can see, hey, you've got a lot of things that are going on after hours. There's a lot of things that are coming to this one particular responder with a lot of escalations into this one person because somebody somewhere feels like they're the only person who knows what's going on here. So it's time to take a look at that. And it has to be intentional.
Starting point is 00:18:13 It's not one of those things that will improve passively. It's something that managers need to take on if they have an on-call team that they're responsible for to sort of take a look at those metrics, pull up the dashboards, and work through what's going on with the team there. Hey, coming back to the game days, which was one of your key talking points at the conference,
Starting point is 00:18:36 for, I'm sure there's a lot of organizations that come to you, right? Come to PagerDuty or come to you for advice, and they say, hey, we've never done this before. What do you typically suggest? What do people start with? Do they run a game day once a year, once a quarter, once a month?
Starting point is 00:18:52 What happens in the game day? Can you, like, especially for the listeners now that are new to this and they would like to start practicing, what are the kind of the, what's the core concept of a game day and how often do you run it and who should be involved?
Starting point is 00:19:14 Yeah, that's a lot. Okay. So game days are an intentionally scheduled pre-agreed upon sort of date and time where you are going to inject some kind of fault into the system. And you may know in advance, hey, we're going to test this one particular thing. So we're going to pull out this backend and see what happens. Or it might be more generic. We have this new feature. We're going to test around it and see what happens. But you plan it ahead so that the team knows to be ready, to be looking, to make sure they're seeing the alerts that they need to see. And maybe you have a checklist of alerts that come through and telemetry and monitoring and make sure all that stuff works and that's all fine. For teams that are new to it, one of the good things about game days
Starting point is 00:19:52 is that they can be very small and contained or they can be very large. And the sort of the legends, I guess, around the origins of this stuff is like crazy people going through and pulling cables in the data center and good luck, guess what happened and go from there. But with the modern chaos tooling that you can have, you can be very specific about the components that you're going to inject fault into because you've got little agents that are mostly going to run on your systems and take care of that for you. But you can then be very focused in the game day that
Starting point is 00:20:25 you're running. So what I mentioned in my talk is sort of how we do failure Fridays at PagerDuty. And any team can have their own failure Friday. And they may not even run it on a Friday, just depending on what their schedule is. But like one team, if they've shipped a new feature, it's in production, but maybe it's under a dark flag or something like that. if they've shipped a new feature it's in production but maybe it's under a dark flag or something like that and they want to do some testing around it and eject some fault and be very conscious of what's going on there they can have their own sort of mini game day to practice on right and just kind of declare wednesday at noon pacific we're going to be exercising this widget that lives in the ecosystem and this is what we're going to be exercising this widget that lives in the ecosystem,
Starting point is 00:21:05 and this is what we're going to do. And then they can use that to learn from and improve their reliability. So for folks that are just starting, you can start with a small component. You might start with something that is relatively stable, that you know a lot about. So you can not only practice the practice of game day, but also practice your tooling and make sure that you know the edges and the horizon for what's going on there. But it gives you an anchor to say, well, we understand this particular component really well. We don't really understand our tooling and our workflow yet. So that's the part that we're going to be working on. One of the things that you want to get good at as you practice more and more game days or these kinds of components
Starting point is 00:21:47 is setting your expectations in advance. In some of the chaos engineering literature, you'll see this as hypotheses, right? You're setting a hypothesis. We're going to test this piece of the component, these edge cases, these kinds of behaviors, and be very specific about it so that you're getting the most out of where you think your reliability is stopped.
Starting point is 00:22:11 So I can say, well, I have this one backend dependency that is not as reliable as I'd like it to be. So we've done some defensive coding in our widget. And now our game day, we're going to test that. So we're going to black hole that dependency, see what gets flexed in our widget and make sure that the reliability that we wanted out of our stuff is now okay, regardless of what that crazy backend is doing.
Starting point is 00:22:37 So even if it goes completely offline, we can still deliver what we need to do. So you can be very detailed and very specific about the things that you want to exercise, the things that you want to do. So you can be very detailed and very specific about the things that you want to exercise, the things that you want to learn, and the goals that you're going to set then for the reliability of the components that you're working on. Because they're going to be different. Like distributed systems are wild. So like any given widget might have completely different specs on it and different expectations and different edge cases for reliability.
Starting point is 00:23:11 So you can flex all of those things with the modern tooling, which is great, but can be overwhelming, definitely. So Brian and I, we have a big background in performance engineering, load testing, performance testing. I would assume that because you mentioned in the beginning, most organizations don't have, let's say, a production environment where you can just run chaos or a good-sized staging environment that is under load because of some shadow traffic. Does it mean you always have to have at least some type of load testing? And does this mean you also get in the performance engineers
Starting point is 00:23:44 and the site reliability engineers? Because otherwise, if you just pull a cable, And then does this mean you also get in the performance engineers and decide reliability engineers? Because otherwise, if you just pull a cable that's like, you know, simulate a black hole or simulate latency between two components, that's an unrealistic scenario if you don't really have also realistic load on the system itself. So how do you deal from a performance engineer, like getting actually load on the system? Is this this how does this typically work do you find organizations that have load tests or do you then in preparation of
Starting point is 00:24:10 this actually have to write load tests or also try to generate some load because otherwise it would be a meaningless test yeah of all the stuff that we sort of run up against i think some kind of load generation is probably the thing we're most likely to find in an environment already. I think in many cases, that's pretty well understood as far as we need to generate enough load to exercise these components.
Starting point is 00:24:38 Where that gets a little bit more sophisticated is what does that load look like? Do your users behave differently if they're logged in versus not logged in? Do you need to simulate that? Is it different if the users are coming in from mobile versus desktop? And do you need to simulate that?
Starting point is 00:24:54 And being more specific about that mix versus we're just going to throw an anonymized version of yesterday's load at whatever the old school stuff is. So what this tells me, it seems that, because we run into quite a lot of organizations that don't have enough load tests. And for me, this means like almost a maturity level, right?
Starting point is 00:25:17 It might be, yeah. So that means people only start thinking about chaos engineering if they actually have the basics of performance engineering covered, because otherwise chaos engineering is a too big step and a too far leap to take. So that's great to hear. But you brought up another great use case or scenario now,
Starting point is 00:25:35 because obviously if you have a chaotic situations, your end users will change the behavior, which means you also need to simulate that type of change behavior and that again you need to make hypothesis like you need to have a hypothesis hypothesis hypothesis whatever you know what i mean that works that works um and then to also change the load patterns yeah that's yeah like thundering herd can be a real issue for some environments. Back when I was a system man, we had globally distributed load balancing.
Starting point is 00:26:11 And if something happened in one pod and it completely fell down, all the traffic moved to the other pods. And then those pods would go down and the original pod would recover and the traffic would swing back. And you could just watch the loops just go and go and go until you like got the whole thing quiesced and it was just absolutely bonkers and hopefully nobody has to do that anymore but yeah yeah and do you find sorry with andy's hypotheses um that's the plural that's the plural version of it i think think, is the hypothesis. Do you find organizations have the ability, based on watching those experiences in production, to put together good concepts of how users react? Or I would imagine most people have no idea what users are doing when the proverbial poop hits the fan. Right. So, and see Andy, I kept it clean. Unlike you, uh, the, so it just seems like it's a lot of guesswork unless, I mean,
Starting point is 00:27:12 do you, or do you, do you find customers who actually do study that and how rare is that? Yeah, they actually do like that hypothesis is a guess, right? Like that's part of what you're going to discover, uh, potentially in some of these instances and some of the things that you're going to be doing. So, you know, especially if you get the opportunity to do some of your chaos testing in production to be able to say, all right, this is the module that we're currently targeting. We're going to turn it off and see what happens, right? That kind of thing. We'll turn it off for five minutes and see what users do. Do they continuously rage click, right?
Starting point is 00:27:48 Like, is that what we're looking for? They're reloading frantically to find that thing. Or do they not even notice it's not there? Like, those are all the kinds of things that you can sort of pick out from what happens if you have the ability to test these things in production. It's funny if you know, you mentioned they don't even notice it's not there because that's a great feature to get rid of then. Yeah.
Starting point is 00:28:07 Well, yeah, that's one of the things you find, right? Like if you're spending all your time trying to increase the reliability of this widget that no one uses, you are wasting your resources. So knock it off and put them where the things are that people care about. Not the right monitoring in place from the start because you would already know that nobody's using it and you would never have a game day around it. Hey, quick question. Also, one of the things that I've kept hearing
Starting point is 00:28:33 is they said, well, this is all unrealistic. Nobody would pull cables. Nobody would do X, Y, Z, right? So my question to you now is, A, are there unrealistic scenarios that you would not do as a start? Because they might really be very, not unrealistic, but maybe uncommon and less probable. Are there certain things you would start with? Are there certain chaos scenarios where you say, these are the two, three that we always start with because they are the most likely to happen.
Starting point is 00:29:03 And therefore, you want to make sure that you are you're testing against those i think it depends on on the application and like what you know about it already and in in in my experience with the things that i've run and the things that i've worked on like the most common things are going to be unreliable backends. Like you're going to have something that you depend on and it either belongs to another vendor or it's in another data center or something weird happens or that team doesn't update and breaks things. And just the nature of distributed elements in the environment, it's not as reliable as you need it to be. There's other stuff that are things that are easier to do. Slow DNS, slow response times from other places, long running database queries that are really easy to practice with
Starting point is 00:29:55 that are good places to start too. But it depends on the architecture of your environment. If you're using a lot of caching, maybe long running database queries aren't the biggest deal for you. So it just kind of depends on what's already built into the environment that you're working on. But yeah. What about, what about like other scenarios, like disgruntled employee scenarios? Does that come into, into the play here? I haven't seen that as something out of the package, but that's kind of interesting. So I remember way back, one of the other load testing tools, I don't think it impacted production, but one of the other load testing tools, when a major buyout happened, somebody
Starting point is 00:30:33 went and deleted the database of all, the knowledge base, which isn't quite necessarily a production. It's just data gone in one way. But I don't even know how common it is for disgruntled employees to actually take revenge on systems, because obviously that's going to open them to a lot of legal issues. But yeah, I was curious if there's a disgruntled employee scenario in chaos testing, which.
Starting point is 00:30:56 Not that I've seen. That's super interesting, though. But like that's a question for your DR planning, I think more than your chaos testing. Well, but that actually would explain the pulling the cables. I mean, that's something where you say, hey, if you say pulling the cables is not realistic, well, if you have a disgruntled employee, then maybe they go crazy and they do something stupid
Starting point is 00:31:19 like that. Maybe. Hopefully, they don't have access to the data center, but who knows? At least it's a fun theme for a game day. Yeah. If nothing else. Corporate espionage. Yeah.
Starting point is 00:31:30 Cool. I really like the scenarios. Unreliable backend makes a lot of sense. Slow DNS, long running database queries. I guess this is stuff that you can inflict through latency, right? So there's different tools out there that allow you to increase latency. I guess they did a lot of work on the latency. I think it's a good thing that they're doing it now.
Starting point is 00:31:38 I think it's a good thing that they're doing it now. I think it's a good thing that they're doing it now. I think it's a good thing that they're doing it now. I think it's a good thing that they're doing it now. I think it's a good thing that they're doing it now. I think it's a good thing that they're doing it now. I think it's a good thing that they're doing it now. I think it's a good thing that they're doing it now. I think it's a good thing that they're doing it now. So there's different tools out there that allow you to increase latency. I guess the black hole example, I think you call it black hole, right?
Starting point is 00:31:53 Where you basically say nobody, like I cannot connect to that system. You can't connect to that, yeah. Yeah. Is this technically how do you do it? Do you just configure routing or take change? How does this typically work? Yeah, it depends. Like if you are subscribed to an actual chaos tool,
Starting point is 00:32:08 so there's a handful available, some open source solutions as well. There'll be often an agent that works on the host that will get in the middle of the traffic and either pull the traffic off the wire and dev null it or whatever from that perspective so that it never gets up the stack into the application and blocks it that way to sort of mimic those behaviors or puts delays on it,
Starting point is 00:32:34 slows it down, that kind of scenario. So that you're not actually making live changes to the built environment. You are really just getting in the middle of those inner machine conversations and inflicting the chaos there. Hey, and how many organizations did you work with then not just using these game days to figure out where the weak spots are and learn on how to manually react to it, but how many are in a state where they
Starting point is 00:33:05 already start to test their auto remediation, like really figuring out, hey, we actually have some auto remediation in place, meaning our monitoring detects a problem and is then triggering a script to, I don't know, fix a certain thing like redirecting traffic to a different system or something like this. Do you have an idea of how many organizations are that mature to actually have all the remediation scripts in place that they test with this? It feels like that stuff's really just getting started, or I will say maybe just getting restarted.
Starting point is 00:33:37 Because those are the kinds of things that kind of have to get rebuilt every time you change an infrastructure paradigm. It's like you kind of understood what was going on when we all had virtual machines and powerful load balancers and what happened to have had to happen there. And then you have like containers and what you can do there, redeploying a container, moving that load around. Now that we're in, folks are starting to move more loads into something like Kubernetes. There's like another set of solutions that need to get built around those components and what decisions folks want to make about expanding a pod or moving
Starting point is 00:34:12 a pod or whatever they're going to do with it. So I think we're kind of in a place where the folks that are leading the charge into Kubernetes are just getting that stuff built. And folks that are more well understood, or I don't want to use the term legacy or vintage, but like more established practices are already well on their way with a lot of that stuff. So it really varies on and it's one of those ironic things where like the old stuff might have the best auto remediation because they know all they have to do is go in and poke that thing and it's fine versus what kind of sophisticated solution do we want to leverage in our newer environments? So it widely varies for auto remediation around different kinds of infrastructure and it varies for different kinds of runtimes. Some of them are better at being restarted and doing a hot restart and and those kinds of things too yeah or systems taking like as you mentioned kubernetes right kubernetes obviously has some built-in mechanisms already to recycle pots and and do things like that yeah i just came off the call and and there
Starting point is 00:35:16 the topic auto remediation was also a big thing and the use cases that they came up with is the classical ones like cleaning uh cleaning discs disk full, or database slowness because of reindexes needed and stuff like this. Oh, yeah. The new triggers and things like that. The classic things that we've kind of been working on, like there were issues like 15 and 20 years ago are still kind of under the hood there being issues.
Starting point is 00:35:41 Yeah, exactly. Same thing with those performance patterns, right, Andy? Same old ones in new ways. It is what it is, yeah. Hey, now, Mandy, let me ask you, kind of getting to the end, but so from your world that you see from a page of duty perspective where people are using your product and your services to really then handle incident response.
Starting point is 00:36:05 Like you're routing the traffic to the right people, right, obviously. I guess your goal is obviously to reduce MTTR, meantime to respond, meantime to repair. I really like what you mentioned, what you said earlier, that you have some statistics or like there's certain metrics in your platform that already give you some indication how you can then kind of plan your game day. Is there anything else that we can learn from especially that part of incident response on how to improve system architectures
Starting point is 00:36:41 and systems in general? Like, do you see certain patterns in organizations where you say, why are they doing this? I don't know, anything else from the statistics from the KPIs? That's kind of an interesting question because, like, a lot of it does come down to using your game days to give your folks a place to practice the stressful part of the incident
Starting point is 00:37:08 and like getting to reducing your mtcr in some in some environments is going to come down to making sure your engineers aren't freaking out right and like not being able to think straight. So it's also helping them with being comfortable, actually responding to the problem and recognizing that, you know, it's okay, we're going to work through this. We're going to figure out what's going on, make sure that we're looking at the right alerts and they're telling us the right things and we have the right access and getting that sort of muscle memory around all of that stuff. And all of that helps improve reliability over time too. Because the worst thing you can end up having there is you've got maybe someone who's
Starting point is 00:37:56 new on the team. They don't know where they're looking. They don't know where to find things in either PagerDuty or in your wiki or wherever they are. And they're kind of just lost and staring like deer in headlights. Like, what do I do? And they kind of spin for a few minutes and then escalate. And like, you don't want to get into that kind of cycle. So the whole practice also gives you a way to make sure they know where things are and lower that barrier for them to be able to triage things, to know where the auto-remediation scripts even are.
Starting point is 00:38:29 Are they attached to the platform? Are they on a bastion host? Where are they? Like all that kind of practice that comes into it so that when something does happen, folks are less barrier to success for that whole thing. Because it's a lot of really complex environments out there and like you can't just drop someone in the in the middle and they're going to figure it
Starting point is 00:38:50 out like you really have to lead them through a lot of it so that they understand the context so it's practice practice practice is what i hear just like just like the volunteering firefighters. Absolutely. Just like anything. Besides computer performance, I come from a stage performance, either plays in high school or music. Oh, yeah.
Starting point is 00:39:16 And it's like you practice and practice and practice. And then when you go on stage for the first time, you're nervous and all that. But after a few times, you get used to it. And I remember even a fun short story back when I was in middle school. I think it was eighth grade. We were doing a performance, and the other kid realized he left the prop in the prop room. So we're in front of the stage, in front of the camera, just the two of us. He's like, I know just what we were doing in Oklahoma.
Starting point is 00:39:38 And I was so many things to do in that play with an all-white school. But he left the prop in the thing so he's like i know just what you need i'll be right back and he runs off stage and suddenly it's me in front of the curtain in front of the whole you know all the parents i'm like crap and i look at the director and she's like freaking out so i'm like all right i gotta do something so i just start like pretending to talk to myself and complain but it's like because i was comfortable enough from that all those practices from the dress rehearsals and all, I was like, all right, I got to act here. But to your point, you know, and, and not to make this about me, cause that's what I just did there.
Starting point is 00:40:14 Um, you know, the more you go through those game days, the more comfortable you'll be. You'll also be able to find like, do you, is there a focus on people who have a lot of pressure and how you can like work with those people? Like you can, is part of game day identifying the people who panic and figure out some way to work with them to get them over that panic. Cause I think some people are just prone to that. Right. Absolutely. And like, sometimes you might breathing exercises are a good one, you know, but also like recognizing that unless you are actually in a medical environment, like no one's going to die. out some of the external stresses that folks sometimes have about decision making or do I escalate this or having like a third party there who's going to help you make those decisions and the incident commander, as well as kicking out people who aren't helpful, like for lack of a
Starting point is 00:41:19 better word, right? Like it doesn't help your responders to have someone there screaming at them to work faster, right? Like that is counterproductive. So having a practice around focusing on the people who are solving the problem, making sure they have the space to work through the troubleshooting and debugging that they're doing and, you know, being very predictable about what that workflow is going to look like. And like you bring up working with like plays and stuff for me, it's marching band, right? Like marching band, you've got like 150 kids on the field and everybody's got a different instrument and there's flags and batons and like crazy stuff flying around.
Starting point is 00:41:57 And like you practice all summer, right? Like you learn the music, you think at the beginning, like here's these 12 pieces I have to learn. Oh my gosh, this is crazy. But by the end of August, like you have most of them memorized, you know, your steps, you're on your right foot and your left foot and your turn and all this stuff. And even for like junior high and high school kids, like it seems crazy when you start out, but like by the end of the football season, you're a pro. So it's the same, same kind of thing. You want to practice this to the point where like you're a pro,
Starting point is 00:42:27 you know what the music is, you know what your steps are, and part of them are going to be in the wiki, part of them in this other platform, and here's how I get to where I need to go. And giving everyone the game days give you the opportunity to say to the newest person on your team, hey, man, here's how we learn how to do this. And here's how we count our steps off and here's how we play our music and, and get all that stuff going. So.
Starting point is 00:42:52 Awesome. Shout out to all the band nerds out there. Yeah. My wife hates it when I say I wasn't, I wasn't in the marching band. I was in the drum line. Oh yeah. Yeah. We'll allow it hey mandy um in uh conclusio or kind of like to close it off i know right hypothesis conclusio what else i learned my spanish this morning because I'm on Duolingo. Silencio.
Starting point is 00:43:27 Silencio, exactly. No, but for kind of closing words, if you want to give one piece of advice for people that are not yet sure is chaos engineering the right thing for them, should they do a game day, give them one last reason to say this is why you want to do it. And maybe just repeat something you've already said.
Starting point is 00:43:46 But this is kind of your moment to say, do it because of this. Yeah. I mean, you're going to do it because you're going to improve. You want to improve the experience of your product for your users and for your customers. You're after the most reliable system you can build, right? And part of that is trying to get ahead of the bad stuff, the weird stuff that might happen when you get into production.
Starting point is 00:44:12 And that's the Wild West. There's crazy stuff out there, man. Yeah. And with this, you not only make your product better and your service better, but also your work-life balance better, I guess. Absolutely. Because that's really cool.
Starting point is 00:44:26 Hey, Mandy, we saw each other at DevOps Stage Rally. Are there any other appearances coming up? Do you have any other conference talks where people can see you? I will be at GlueCon, which is May 17th, 18th, 19th, I think, right? Then we have PagerDee Summit, which is three hybrid events uh running in san francisco sydney and london i will be in person in san francisco and london and those are in june and yeah i don't know after that yet like august is pretty quiet so might actually be home well you gotta practice right i know right very cool brian any final words from you no nothing uh i mean the
Starting point is 00:45:11 only other thing i'd add the only other reason why i'd add to do the the chaos to the game days is probably because they're a lot of fun right you talk about making your job better like that's gonna be really fun and you're gonna learn i mean everything that mandy said you're gonna get out of it but you're also gonna have have fun at work. Right. Which is so most people, like if you can find ways to make your job fun, go for it. And it's going to have a fantastic payoff for the, for the organization. So if, if, if the higher ups are looking, looking, seeing people like doing this crazy thing and having fun, they're like, what's going on? Like, Oh no, no, no, no. Look, we're doing something good. Right?
Starting point is 00:45:47 So anyhow. Mandy, thank you so, so much. Andy, did you have any final words? Or if that's it? No, for me, that's it. I definitely have some additional great arguments on why chaos engineering, and especially I really like the common scenarios and some of the metrics you told me.
Starting point is 00:46:04 I think that's really nice. It helps me in my conversations when people ask me about it. What's the reason for it? What's the benefit? No, I'm really happy that you were here. And I'm pretty sure this was not the last time we saw each other. Wish a chaos engineer recording. I'm kidding.
Starting point is 00:46:20 No, please don't. We get that all the time. That was when we were using those other platforms So anyway, Mandy, thank you so very much for being on today We really appreciate it and we love this these topics. These are the real fun ones even even talking about it's fun Yeah, definitely. That would be fun to do chaos engineering. It's a fun topic to talk about So thanks to all of our listeners as always always. If you have any comments or questions, Pure underscore DT on Twitter or Pure Performance at dynatrace.com.
Starting point is 00:46:52 Mandy, I believe you'll probably send us some links on how people can follow and or interact with you on various platforms if you have any. Yeah, I'm on twitter now as ellen xdhk i'm also on twitch on the page duty channel a couple days a week so yeah we'll include those in the show notes yeah and we look forward to good luck with all your travels and um yeah thank you hopefully we'll talk to you soon all right guys take care thank you bye

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.