PurePerformance - Why you should look into Chaos Engineering with Ana Medina

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my wonderful co-host Andy Krabner. Andy, how are you doing? I am fantastic, knowing that this is our third attempt to get this recording going and now it's even better than the first and the second attempt. Might even be our fifth attempt. And I think, you know what, I just have to call out Ringer on this.

Starting point is 00:00:47 I don't know what the heck's going on, but first of all, it looks like you're not using any APM, at least in the front end. So Ringer, give us a call, dynatrays.com, we'll help you out. But also, Ringer, please, maybe you should listen to this episode, because this is the kind of stuff you need to be looking at possibly, because I think we just found a great scenario for the subject of our topic today. Anyhow, since we've lost so much time, Andy, let's go right on into it. Right on. I think you're talking about a little chaos that we just experienced.

Starting point is 00:01:21 Yes. At least on our end, from the end user perspective of using that service. Chaos engineering is an awesome topic and we've been bringing this up a couple of times in the last couple of months. And I'm very glad that we have a very cool partner with us today, a talking partner, a guest. Hola, Ana Malka.

Starting point is 00:01:44 Talking partner. Talking partner. Is that a guest? I don't hola i'm just laughing as a synonym let me look up a synonym for for guest oh talking partner okay so first of all right english is not my first language it's late it's late in the day and and that's what it is so anyway just to make sure that this is going to be continuing and like as it is right now as fun as it is uh hol anyway, just to make sure that this is going to be continuing as it is right now, as fun as it is. Hola, Ana Margarita Medina. ¿Cómo estás?

Starting point is 00:02:10 Hola, ¿cómo estás? Muy bien. Muchas gracias por tenerme hoy. Perfecto. Creo que... Bienvenido. Bienvenido. Exactly.

Starting point is 00:02:19 That's why you probably have no clue what we're talking about. So let's switch back to... No, I do. I took Spanish in school. Come on. Oh, you did? I took Spanish in school. Come on. Oh, you did? But I'm not fluent. But you had like an elementary conversation right there.

Starting point is 00:02:30 I could follow that one. That's true. I took Spanish... But you didn't... Go on. I took Spanish in Duolingo. So the one thing I always laugh about with colleagues or friends who went through the U.S. school system of taking Spanish as a language or any foreign languages,

Starting point is 00:02:46 you take the lessons and they play a tape, and this was back in the tape days, and the person would be like, Hola, me llamo Miguel. ¿Cómo estás? And then you hear someone speak Spanish, and it's 30 times faster than that. Like, I can't follow. How am I supposed to learn this? Anyway, because that's chaotic. There's a very chaotic situation as well. See what I did there?

Starting point is 00:03:11 So, Anna, welcome on the show. Thanks for being our talking partner, quote unquote, our guest. Anna, may you want to introduce yourself? Yes. Thank you once again for having me. I'm very excited to be here and very excited for the conversation we're about to have. My name is Ana Margarita Medina. I'm a senior chaos engineer at Gremlin. I've been with this company for about two years and seven months, focusing on helping our customers and create a chaos engineering community. So I do a lot of public speaking. I lead our educational program.

Starting point is 00:03:46 I help out customers to get started and run some experiments for them to understand why reliability really needs to matter. Prior to doing Gremlin work, I was a stock reliability engineer at Uber. That is actually how I got started in chaos engineering. They were a huge microservice shop, and they really had to make sure that everything in their bare metals data centers was going to be up and running for their two largest holidays, New Year's and holidays. And of course, the New Year's and Halloween. And then every day in between, they also had to make sure that they were up and running across their data centers. And prior to doing that, I do a lot of different software engineering. I got started early on in my career with just front end technology.

Starting point is 00:04:36 So I built a lot of JavaScript, HTML, CSS websites, then transitioned over to some back end and somehow ended up doing iOS and Android applications before getting into systems and in the SRE space. So very excited to share a little bit of all my knowledge across all the stacks of the space today. Yeah, that's awesome. You have a great history of stuff going on, but I have to say where you're at now is probably my favorite area. If I wasn't a sales engineer, I'd probably be wanting to go into the chaos engineering realm. But I don't think I'll ever leave being a sales engineer. Once you get in, you really can't get out. It's interesting because I kind of feel the same way. When I was leaving Uber, I had transitioned from cyber liability engineer to software engineer.

Starting point is 00:05:24 And as I was looking for my next gig, it was like, what company actually attracts me? Like, what are my gear? What does like the mission statement of these companies make me want to go do? Like, I didn't want to go to big tech or a lot of the other companies are recruiting me. And then this company came about, they had gotten started like a year, years before i joined and i was like yes this makes sense like chaos engineering go back a hundred percent build a platform that lets folks kind of understand why this matters and makes it easy for them to get started very cool hey i mean before we go into topic i definitely learned something new already i never thought that

Starting point is 00:06:03 halloween is the uh the second largest day or one within the two largest days of traffic for Uber. But I guess. Yeah. I think it's such an interesting story to tell because it's like everyone is allowed to have a different traffic day. Like it doesn't have to be the same one for everyone. And that's kind of when it kind of, I had just gotten started into the space. So it kind of clicked for me then of like, yeah, this makes sense. We provide a service that people that are out drinking kids or like hanging out or don't really want to take care of their car.

Starting point is 00:06:36 And they're going to be just on the road. It really, really makes sense. And then when you start seeing what the trends were, when the company started leading up to like the Halloweens that I saw, it was really interesting because Halloween, I think, was some of the largest outages that happened prior to them getting started in the reliability engineering space. Because it was like, oh, our infrastructure is literally on fire, like on this day and like surge is going on and all of this and i think it's really interesting to just kind of like realize that sometimes halloween is going to be more busy than new year's and some years it was new year's more busy than halloween because it all depended on like that weekend um but it's it's really neat because you get to really realize how do you operate systems of such scale that do provide so many rides and servers and all of this. It's really, really, really cool learning.

Starting point is 00:07:32 I love being able to look back on that experience and realize how much I grew as an engineer in just the two years that I was there. Hey, Anna. So for the people that may have not followed what we've been doing together in the past the two of us we've done some work around chaos engineering and dynatrace or gremlin and dynatrace we did a a so-called performance clinic a webinar so in case people are interested in in seeing you quote unquote in her life or recorded on youtube whatever you want to however you want to call it check out the the webinar we did i thought it was really interesting where you gave a great

Starting point is 00:08:09 overview of chaos engineering also more on your history and then how chaos experiments how this all looks like and how it works and how people can get started um i really want to focus today on some of the things that may interest people in moving into this direction. Just as you said, this is an interesting field. Brian keeps repeating, this is a great field. I would like to be a chaos engineer if I wouldn't be a sales engineer. So can you tell us a little bit about chaos engineering and and what it takes to become a chaos engineer what a chaos engineering actually means like a quick one-on-one on what it is and how to get started yes totally so first anyone can be a chaos engineer like you don't really need to

Starting point is 00:08:59 have to go and change your entire career just to start doing chaos engineering or start getting into this way of thinking about your systems and your software as you build it so first starting off with that one-on-one discussion it's it's very much about defining what chaos engineering is like we look at the term and we think chaotic and we're just like breaking things or like anything like that but the purpose of chaos engineering is that it gets created as a science where you then use the scientific method to be thoughtful and plan out experiments that allow for you to understand your system more. But the purpose of it is that you want to make them more resilient, more reliable.

Starting point is 00:09:43 So the end goal is that you just want to them more resilient, more reliable. So the end goal is that you just want to learn more, you want to do better. So that is like really, really neat. And then when we start looking at it, a lot of people kind of get turned off by the term chaos engineering just because of what it could mean. But the practice itself is about planning it out. Like, this is not about just randomly running a chaos engineer experiment without letting anyone know and bringing down production or anything like that. You actually want to tell people that you're running experiments. It's not about being that bad person in your team and just doing this. And with that, too, is that you also have to think about the experiment that you're running.

Starting point is 00:10:25 One of the largest things is that you can say, hey, what happens if my entire data center of the West Coast goes down, but I still have my East Coast data center? That is like a really large experiment if that is your first experiment. You want to start out maybe a little bit small. How about what happens if this one server in this cluster that I have that has 100 servers goes down? Let's start out really small. Let's not go and attack all of our fleet.

Starting point is 00:10:55 So what Chaos Engineering has is that it has some terminology. We have blast radius. Blast radius is what you're attacking in an experiment. So whether that is one host, 10% of your fleet, one Kubernetes cluster, one pod, one deployment. And then you also have magnitude. Magnitude is how intense the experiment that you're running is. So whether you're starting out by increasing CPU, memory, IO, don't inject a hundred percent of increase. Go ahead and start small. Start with just what happens if I spike CPU 10%. And then after you see the results of that, go ahead and make it 15, 20%. And the same thing goes for all the other attacks as like injecting latency, dropping packets and such. It's all about like science you start small you don't just want to

Starting point is 00:11:46 go full blast on it and the other thing is that you always have abort conditions abort conditions are those things that are going to cost you to halt that experiment and i think this is one of those nice topics that we really talked about when we did our our dynatrace gremlin webinar that abort conditions are those things that sometimes are really geared towards your KPIs or what really matters to the business. So that could be like your SLA just kind of not really working out, like just breaking it to maybe your latency spikes up, or you see a traffic rate drop down, or you see that your customer's error rate just goes up a lot. Maybe it's HTTP 400, 500 errors on your front end.

Starting point is 00:12:30 So all these little things, you actually have to define them before you start the experiment. And that's why when you're planning out the chaos engineering experiments, you actually end up learning so much that sometimes you don't even have to execute it to realize like, oh, no, this is actually not going to run. Let me go make my system more reliable. And then I can actually pick up this conversation again and get ready to execute it. And the other one is that conversation of people saying, thinking that chaos engineering, you can only learn things if you're running in production. That is a really bad myth. Like go ahead and do chaos engineering and testing, QA, staging, whatever other pre-plot environments you have around. You're going to learn so much about your systems then.

Starting point is 00:13:15 And sometimes what you learn, you can actually apply to your production systems. At Gremlin, we ran a chaos engineer experiment on our monitoring tooling and staging and we learned that we needed to make our dashboards better and like change some other stuff we changed our staging dashboards and all of a sudden that same change that we put in there gets applied to our production dashboard so all of a sudden we're making our production operations a lot more reliable and trustworthy but we never had to go and go implement these chaos engineering experiment in production to learn that. That's a great point because I agree with you, right?

Starting point is 00:13:55 I mean, a lot of people think, well, are you crazy? You're doing this in production and you're breaking things and then you're impacting our business. So I think the first thing is important to know that you have these abort conditions in case you really do this in production. But from a pre-production perspective, and I brought up this term prior to us starting the recording,

Starting point is 00:14:18 what you were basically explaining is kind of what I kind of, I kind of, you know, name it test-driven operations because basically you want to test drive what, or test-driven or test drive, test-driven operations. You want to figure out, are you, is the system behaving,

Starting point is 00:14:40 or how does the system behave right now in a pre-prod environment? And what would that mean to operations? You don't want to start in operations first. You want to first test it out in a pre-prod environment. And I think that you were doing this with your monitoring tools is actually great because you want to actually know, does your monitoring actually give you all the data that you need?

Starting point is 00:15:01 Do you have the right dashboards? Do the right people get notified and alerted in case there is a problem? And testing this out pre-prod obviously makes much more sense because you don't want to realize that, hey, we don't have the data that we need. We didn't get alerted on an actual problem in production. And then you're scrambling and running around like crazy, like chaos monkeys. Exactly. Yes. Yeah. I think there's a lot of great

Starting point is 00:15:25 points in that just that opening um and i gotta agree the idea testing your monitoring is is is really awesome but it also the idea of pushing it pre-prod to just any kind of testing pre-production right because if you think about it at least the way i think is that everything should be treated like code no matter what you're doing. You put it out and you test it and you make sure everything's working as expected before you put it in production. You find out things that might be completely flawed about your approach before you get to production. Just because we have fast, nimble systems in production doesn't mean the old cliche of production is my new QA. Of course there's going to be situations you can't test for real world conditions, right? You can test as well

Starting point is 00:16:09 as you can, but you know, sometimes we see, you know, people put something to production and I'm like, would you have put a new bit of code to production without checking it first? Or I'm going to drop a new framework in, let's drop it, you know, treat it that way. The one joke I needed to make though, because I didn't want to interrupt you, but it was phenomenally well said that chaos engineering is about planned and purposeful, putting bad things out and making bad things happen in a planned and purposeful way. Because I think if that were not the case, then we could probably call almost every developer a chaos engineer. Yes. You actually bring two big points, like treating everything like code. One of the things that I love saying on that chaos engineering space is that you're testing your people, you're testing your processes, you're testing your configuration.

Starting point is 00:17:05 You're going through the entire cycle of what development and operations is. So like kind of like what Andy said, like test driven operations, it's something that like it's, it's the term that should get be getting picked up more as an actual like industry practice. And for many reasons, like, I think because of the pandemic, you know, like reliability is starting to pop up a little bit more, the systems are being used more and we see the incidents are spiking up a lot more. A lot of new engineers are coming on call that had never met their team.

Starting point is 00:17:32 They've never learned how to be on call. So that is something that's really, really interesting. But the other point that y'all made was of like making sure that we do start off with monitoring as a good point of chaos engineering. One of the other resources that we plug when we have those conversations is actually looking at the site reliability engineering like hierarchy. It's that little triangle and the first one is monitoring. And then when you start going up that cone, you see incident response, you see postmortems, you see testing and release procedures, capacity planning, development products.

Starting point is 00:18:11 So one of the things that we say when we look at that hierarchy, we say, go ahead and start by validating that your monitoring is set up properly. Then go ahead and run some experiments to test out your incident response. Go and look at your postmortems, recreate some of those outages, make sure that if those conditions were to happen again, you're actually resilient to them. Then go see how your release procedures and testing is actually going on. Then you can bring in chaos engineering to your capacity planning. That was actually something that got done at Uber very much for the Halloweens and the New Years. And then really bringing it up to the product of like, how early on can we actually start doing chaos engineering? One of the neat things at Gremlin is that we also do progressive

Starting point is 00:18:56 delivery. So we use feature flags. So by using feature flags and chaos engineering together, we see that we have this really amazing value that gets really, really hard to do where you now have your new features are hidden between feature flags. So we can turn on that feature flag. And then those accounts that are between in the feature flag that are testing out a new product or something new that we're launching, we go ahead and we run chaos engineering experiments on that. And we're able to really nail down that scope and that blast radius is really small. But when we're launching new features that are consuming more resources, that are introducing more complexity into our software, that is where you really, really want

Starting point is 00:19:42 to make sure to run them because you're adding things and now you're about to unleash it to the rest of your servers and you don't know how that's going to behave when you're getting new traffic and new users coming in. So you want to make sure that you're so ready for launch and so reliable by the time you launch and that you've already thought about some of these things that can go wrong. Very cool.

Starting point is 00:20:03 Very well said. Yeah. We both paused. I was like, yeah, yeah. I think there was nothing it was just like yeah makes sense um so uh and i'm coming back to my my my question in the beginning um if somebody what i mean i know you said you were a size reliability engineer at uber and then now you're working as a chaos engineer at Gremlin. Have you, for people that are listening in and say, yeah, this is really cool, I want to get started, do you have any recommendations on some literature that you want to look at

Starting point is 00:20:36 or some talks from certain folks that you want to listen to? Obviously, your talks that you've been doing over the last couple of months and years. But is there anything, like, you know you know basically is there a great book like we always reference when we talk about continuous delivery or devops we always reference a couple of books whether it's from jess humble whether it's from jean kim is there anything equivalent on the chaos engineering sites so we have a pretty good tutorial like not tutorial sorry we have a really good blog that talks about the history and the principles of chaos engineering it's a 15 minute read it's over on grumlin.com slash community that's usually what i point folks to um there is also a book called drift into failure that it talks about how do you actually understand

Starting point is 00:21:26 your complex systems adrian cockcroft really recommends this one as a starting point like how do you really think about filling your systems a complexity for you to actually understand that the last thing you want to do as an engineer ever is to drift into failure um and that really touches base on chaos engineering very, very much where you're able to be like, I already saw this issue happen. I saw this incident happen. How do I make sure that I take all those learnings

Starting point is 00:21:53 so I don't drift to failure in 10 months, two days or anything like that? And in terms of like literature, I know that Nora Jones and Casey Rosenthal recently wrote a book around chaos engineering. I haven't gotten a chance to read it, but it talks about how they started with chaos engineering at Netflix and have started to see other companies kind of like pick up on it. And that's it's a neat thing because a lot of folks kind of say chaos engineering is brand new. This is a buzzword. There's a marketing term.

Starting point is 00:22:26 Like, why are we reinventing all these things? And it's kind of interesting because, yes, there is a little bit of that. There is a little bit of that new word is being created and it's being used in a way that people are like, wait, I've been doing this before. Like, what do you mean? mean. So we see that chaos engineering gets coined when Netflix has to move over to AWS and really think about how do they make sure that their systems are scaling globally and they coin chaos engineering by open sourcing also chaos monkey. So we look at that and we see that as like where the chaos engineering industry kind of really, really got started. But when we look back, I think six years before that over at Amazon, we have Jesse Robbins that is literally just unplugging data centers and running game days in order to have their engineers really think about what happens if something fails in our systems. And that was just kind of part of what the culture was there. You really want to be customer focused and you want to prepare for failure, but they never created a term for it. So sometimes you actually might be doing chaos engineering, but the managers are sitting in with their team and doing

Starting point is 00:23:45 tabletop exercises of like, let's pull up our architecture diagram and let's talk about what happens if this third-party dependency that we have actually has an incident, or maybe that latency is 200 milliseconds. Would that actually cost any timeouts? Are we actually testing for that? So sometimes it's about the way that you look at it. You know what, too? It's funny. We went through the same struggle with when you're talking about terms, right? About chaos, these concepts have been around.

Starting point is 00:24:16 And Andy, I don't know how often you run into it, but I know quite often people go, oh, we're really interested in observability. And every time I hear people talking about observability, initially I'd be like, come on, really? Because it's been around similar chaos like forever. If we take a look back at what the better APM tools did, right? We've been doing observability or at least the pillars of observability since almost the beginning. And suddenly it becomes, I think, very similar, an open-source DIY project for a lot of companies,

Starting point is 00:24:48 and it became a big thing. So instead of getting upset when people talk about observability now, I just really sit back and get really happy because for my whole life in performance testing and stuff, it's always been trying to get people to pay attention to it. And what I've come to realize is, no, this is a really good thing because it means people are really starting to take it seriously. But it's that similar thing where you look back and like, okay, hey, you know, I'm just happy it's taken off. I think you bring an interesting point. Like, you just want people to care about it.

Starting point is 00:25:17 Like, call it whatever you want, but please care about this. And I actually do think that with observability, kind of similar similar where it's like the dev didn't care. The dev was like, it runs on my machine. But that was it. And I think that kind of happens a lot also in this chaos engineering space where it was like, that was a mentality of the dev. I'm not going to be on call for my code. There's somebody else's problem. I don't have to care for it.

Starting point is 00:25:42 We've seen that we've shifted left. A lot of folks are in DevOps, site reliability engineering, and we see that people now have to care, like whether upper management makes them, whether their systems are so complex that they actually need to take notice, whether they're actually going to networking events and picking up all these new industry trends or just watching social media. they're starting to realize that, no, like I could actually still be a developer and be conscious of these things. Like chaos engineering doesn't have to live within ops or site reliability engineering. You can have a developer that actually says, no, I actually want to make sure that I don't

Starting point is 00:26:20 have any memory leaks in my code, or I want to make sure that I'm thinking about what happens if any of these HTTP calls that I'm building into my Java application actually get handled properly. Like, how do I make sure that I'm actually injecting the proper chaos into the software? And I think, too, it goes beyond just the wants and desires of people who are the developers and teams like this. I think businesses are finally starting to understand. In the past, whether it's problems due to chaos or problems because people weren't testing or doing their observability, there was always the idea, we'll just throw people at the problem and have everybody pull all-nighters until we get this problem fixed. And that's a heck of a lot easier than revamping our process. But I don't think it to come work at your organization, you have people in a position just like you were, where you said, where do I want to work? I have a lot of great skill and experience, and I can either go to this place where they're

Starting point is 00:27:32 doing the stuff the old-fashioned way, and I'm going to be pulling all-nighters and pulling all-weekenders, or I can go to a place that's looking forward. And so it's becoming also a tool to bring in talent so that they can stay competitive. Oh, totally. You nailed down things that really speak to me. It's like people are starting to realize that the cost of downtime is really expensive. You may have been someone that didn't calculate it or your business just wasn't sharing. But we also see that this pandemic really made a lot of people go virtual that didn't really want to be a hundred percent virtual on like online driven operations of their companies um so i think that is getting more expensive but

Starting point is 00:28:11 then when it comes to pager fatigue and burnout um there was a lot of that that happened at uber that like really positioned me for me to leave like i felt really burned out and a lot of it was that culture of ops and cyber liability engineering was working night, like late nights and weekends in order to kind of like play catch up. So I do think that the industry is starting to be a little bit more sensitive to that, like, oh, these are actually humans that I hired, not little robots, like, y'all need food and sleep and rest and vacation, hold on, hold on. Like you're having to actually tell engineers, no, like there's a work-life balance. And I mean, especially in this pandemic too,

Starting point is 00:28:50 like it's harder when you have kids and you have like schooling and all this. And I feel bad for any parent that's listening. Thank you for taking some time to take some learning on your free time. I know this pandemic is really hard for them, but I think it's that it's like people didn't realize how the harm that they were causing in their industries by pushing their

Starting point is 00:29:11 engineers to burnout constantly. And we see that a lot in operations. We, we try to make this culture a little bit better, I think in site reliability engineering, but those that haven't gone through that transformation, they still feel that pain. And, and with that is like talking about it, you know, like burnout happens to everyone. And like, you get, you should be having those conversations internally in your organizations, like, whether it's just one person or with your manager and being able to like, talk to them, like, no, this, this application is really paging me a lot. Like, it's a little too much. Like, no, this application is really paging me a lot. Like, it's a little too much. Like, what can we do to make better? And it actually segues to getting started with chaos engineering.

Starting point is 00:29:51 We do say that sometimes you actually do want to look at those applications and high severity incidents that are going on, like the things that you should run to and try to figure out how you can actually do chaos engineering on because you'll be able to actually really see the fruits of what chaos engineering can do so you you might just want to go look at what are the departments that are having the most amount of incidents like maybe there's a service in there that you can actually just focus for like a quarter and just trying to make more reliable, maybe even less time. And then once you make that one more reliable, you will see that maybe those engineers have less pages and then they can actually start supporting other teams

Starting point is 00:30:34 and such and like really scaling out such operations. Yeah, and I took a couple of notes and I have a couple of questions that go in kind of two directions now. And I think I want to start with the first one that says, are there any parallels of chaos engineering of things that have been done in other industries? Meaning if I think back on continuous delivery,

Starting point is 00:30:59 we always draw a lot of analogies to the automobile industry with their, with the way they optimized and automated the delivery of cars. We also draw a lot of analogies to lean management when we talk about DevOps and kind of streamlining our processes. From a chaos engineering perspective, are there any other industries that have done this maybe many, many years already,

Starting point is 00:31:24 but just like you know in different industries in a different way but still kind of enforcing chaos and if so is there anything we can learn from them so i know for sure other industries have i feel like i've not done my research in order to be like hey these are all the industries and this is what they learned like 20 years ago but like some things two things kind of come to mind um you mentioned the automobile automobile industry english is not my first language uh so it's some words are really hard like that one um and and with that it's like you you have to do so much testing and preparing to get that car into anyone's hands. And with that, like it comes out safety testing that like, how does it do with the mileage and gas and all this, but you can kind of think about it when

Starting point is 00:32:12 they do those dummy testings of the airbags and stuff where like, what is the worst thing that can happen to this car? We're going to go ahead and do it and see what those can do when you put all those conditions in place what do we learn and how do we make it better and i think that airbag example is like something really concrete for folks to kind of think of like we're gonna not have the the little dummy person inside the car like wear a seat belt um we're going to have the car going 100 miles per hour we're going to have more weight in the car like what are all these conditions that can happen like what let's just throw it at it and aviation also of course

Starting point is 00:32:51 had to go through this i think aviation is one that the reliability resilience engineering community continues like going back to as like there is a lot to learn from from how pilots kind of get trained to how is it that we learn from the black boxes that are left behind when incidents do happen in aviation. And I mean, the cost of an incident in in an airplane is insanely high because we're talking about lives and you can't really put a money sign to that whatsoever. And the other one that is a little bit touchy due to the current situation, it's that like vaccines where you're injecting something harmful into your system in order to build immunity. So those are kind of like other parallels that we do see that bringing in chaos into a system is only going to make it better because you now get to actually see how this system will behave and whether that's a person um a vaccine or anything like that you do get to to learn from it yeah um then a follow-up on this because i remember when we did the the

Starting point is 00:33:59 webinar we had a couple of questions in the end and i remember one question that was asked i was saying well you were just doing artificial you're making a lot of artificial assumptions about the chaos that you're enforcing right this is not so i think somebody said this is not realistic and why would you ever test something that is completely unrealistic and and i wanted to know, what is your response to that? Well, it's like unrealistic in whose mind? Like, because we're building on software and systems that are so brittle, anything can happen. And it can be like, yeah, there is no way that 50% of my data center is going to catch on fire. But we know Murphy's's law we know that everything can fail will fail so you have to kind of like always go back to that like anything that can go

Starting point is 00:34:52 wrong will go wrong so when you start shifting your mind into that you actually start learning that you can be like oh no wait like I can actually prepare myself for this type of failure and like data centers catching on fire to me was something interesting because it was something that you kind of just assume that you have enough coolers, you have enough people on site that could easily put out a fire. But learning from many companies that data centers caught on fire and like those engineers, like those are some really interesting stories or even like looking back at like no one in 2017 would have thought that us east one for s3 buckets on aws would go down like that is just something that's like oh no like they're aws they're 100 reliable like no but you kind of have to be like pessimistic and like anything that can go wrong will go wrong so how do i make sure that

Starting point is 00:35:46 when and if that happens i am so ready for it like bring it on and that could be like making sure you're testing all of your vendors all your dependencies to what happens if our main cloud provider goes down do we have a failover for it like when was the last time we executed it when is it that um like things can happen where maybe you also kind of want to know what happens if your lead sre goes on vacation for a week like just make him not respond like make them not respond to emails for for a day like what type of chaos can that kind of like provide? Yeah, I remember, Brian, when we had Adrian Hornsby on,

Starting point is 00:36:29 I think he also talked about social chaos engineering, like taking the laptops away from people or as you said, don't allow them to answer emails or something like that, yeah. I think he said he sends them on a week vacation when they come in.

Starting point is 00:36:45 But Andy, I got to say, to that argument that you proposed, I think that's a completely irrelevant argument nowadays. Because anytime someone brings that up, all you got to do is respond and say, hey, 2020. I mean, for real, it's so ridiculous. If you were to take 2020, everything that's gone on this year, put it into a book or try to put it into a movie treatment and go to anybody, they would say that's the most outlandish thing. No one would ever believe that. This is like the most over-the-top ridiculous, like even more crazy than a Jerry Bruckheimer movie, right? And it happened. So when you have an argument, you're like, oh, well, that won't happen.

Starting point is 00:37:20 Dude, come on. Just watch. Just give me a year. That's a very good point. Hey, the other direction, the question that I want to make sure you answer. So we have a lot of listeners, and I think, Brian, you can agree,

Starting point is 00:37:37 we have a lot of listeners that have a background in performance engineering. So I assume a lot of our listeners do some type of performance testing as part of the day job. And if they are now interested in, hey, this chaos engineering is actually pretty cool. Can we give them some ideas on while they're still mainly focusing on their regular job, performance engineering, running performance tests, performance analysis? Are there any, let's say, entry-level chaos experiments that people that they can say hey you know what next time i run a performance test i am trying out this little trick here to learn that this podcast like i don't know you

Starting point is 00:38:10 tell me so what is it what what can people start doing so the hello world of chaos engineering to me is in just injecting a cpu spike into your system because it's something that it's easy to think about. It's easy to do. Like you can use a tool to do it. You can just build a little script that like increases the CPU. And that could actually show some interesting things about your system.

Starting point is 00:38:37 Some other folks use the shutting down a server, shutting down an instance as the other hello world. How is it that maybe you have an application that's running across a cluster, just various different servers,

Starting point is 00:38:49 if you were to lose one of those nodes, one of those hosts, when that server goes down, like you have less resources. So does a new host come up in order to offload that? Or does this application now have less capacity to run on, but all of a sudden you still have the same heavy application on it? So sometimes that is one of them. But there's also that part that you kind of get to do any chaos engineering experiment

Starting point is 00:39:19 alongside a load test. With Uber, they actually did that, like leading up to that Halloween, they did just huge amounts of training with the team that was going to be on call for all that Halloween weekend. And they were running like failovers three times a week, like full, like from one data center to the other, completely moving all of our applications. And what they were doing is that they had,

Starting point is 00:39:46 we had built an internal tool for chaos engineering called UDestroy. And we had also built another internal tool for load testing, Hailstorm. So by using Hailstorm and UDestroy together, they were like really trying to think about how everything kind of happens. And that's like another place

Starting point is 00:40:04 that you kind of get to put those two together. But I would definitely say that if you're just trying to really get started, a CPU experiment could be sometimes like the fastest way to think about it. And if anyone wants something a little bit more advanced, you could always do that. Like what happens if I just lose access

Starting point is 00:40:23 to this dependency that this application uses? Is it that every single call that I have in my application that's trying to reach the server that is not responsive? How do all those logs kind of like ramp up into my system that can make that performance really, really slow down and really have a bad customer impact? Hey, I want to ask you a question about that then. So a friend of our show, and I don't want to say co-founder, but inspiration for our show, Mark Tomlinson, we've had conversations about testing things like microservices and testing scaling systems.

Starting point is 00:41:01 And it almost sounds like the testing of a scaling system would fall under the realm of chaos engineering. But let me find that, you know, when you're an expert in this area, so I wanted to run this by you. So let's say you're doing some performance tests. The idea being if you're running on a scaling system, let's say you have Kubernetes set up or something else where if our CPU hits 70%, scale up another instance. So the concept is, besides just testing performance, test your scaling. Make sure that's working and doing that.

Starting point is 00:41:30 Now, does that fall under just another hello world example? Would something as simple as that even fall under, hey, I'm starting to expand into chaos testing by pushing my system non-traditionally? In a performance test, you're usually doing a set load. But here I'm going to run my test. So it pushes the system into doing its scaling and seeing if it actually is doing what it's supposed to do. Oh my God. Yes. So much. Yes. Because that's a really simple one, I think, right? For any performance tester,

Starting point is 00:41:56 we love just throwing load and trying to break a system. Yeah. And that is the perfect moment to start on it because then that's what I mean. Like CPU to me is something that like anyone can kind of really easily on it because like then that's what i mean like cpu to me is something that like anyone can kind of really easily think about like whether it's your own server at home or your personal laptop it's it's something usable and i do a lot of work in the cloud native space specifically kubernetes and when last year i got a chance to speak at kubecon and we did we sat down and there's websites that have all these Kubernetes failure stories. We sat down and we read postmortems and we were just kind of like giving points and like, was this a DNS outage?

Starting point is 00:42:33 Was this a latency, like a scaling issue, a cloud provider issue? We ended up seeing that like 50% of those incidents were scaling issues. They did not set up auto scaling, whether it was horizontal or vertical or on the cloud. They just didn't even do that. And the thing too is that nobody really thinks that you need to do that because you kind of just buy into the promises of platforms

Starting point is 00:42:57 such as Kubernetes of like, everything's going to be up and running. This thing scales. That's just kind of like what the pitch is. And you don't really kind of like ask questions like what does it mean by scale like do i need to set this up what do i need to give it resource limits do i actually need to make sure that it scales properly and it's scaling at the proper pace because sometimes like i i do i've done like chaos engineering on all sorts of cloud providers and monitoring tools.

Starting point is 00:43:27 And that interesting one is like, all right, I set up my cluster. It's going to auto scale when CPU hits 60% so that when it actually has more traffic coming onto it, we have enough time for the server to ramp up. Well, how long does it take AWS to spin up that EC2 instance? How long does it take your primary Kubernetes node to pick up that new instance and bring it into the cluster? How long does it take your monitoring tool to detect this new instance and be like,

Starting point is 00:43:57 hey, you, you're reporting. Let me bring you in. Like all these things take so much time. And that's kind of when we really do talk about that complexity of our systems now. It just gets larger and larger, and you just have abstraction layers around it, and you need to go back and really think about what does it all take? What are all these granular steps that get me to my end goal of my system being reliable? So yes, go ahead and try to figure out where is that breaking point for your system to make sure

Starting point is 00:44:25 that things are scaling properly and make sure it releases once it's done yeah this is brian i'm not sure how you feel about this but every time i we're making it through these episodes it's just fascinating to learn so many new cool things it's really cool you know i like to think of it andy is i was thinking about it as we were starting the show to learn so many new cool things. It's really cool. You know, I like to think of it, Andy, I was thinking about it as we were starting the show, one of the reasons why I think I love the idea of chaos testing so much, or chaos engineering, is, you know, in the performance side of the world,

Starting point is 00:44:56 we're trying to simulate a load to break a system. Well, when you go into the chaos side, you're trying to simulate a breakage instead of the load, right, under the load. And so it's it's very very much related but we love at least i used to love breaking the system like i wouldn't purposely go out and do it but like when i would get a you know a decent simulation and things would break that's when i would get excited because now we have something to do and the idea chaos engineer is like we're going to design a breakage that's but i don't know that's really awesome it sounds sounds so fun. You know, by the way, I did some,

Starting point is 00:45:25 um, some research while we were talking on the different links that you, we mentioned over the course of this podcast, like the drift into failure, the books, uh, the Gremlin community page, the,

Starting point is 00:45:36 the chaos engineering book, uh, from Nora Jones. And then also I found the Uber blog on New York, uh, New Year's Eve and um and Halloween so that's really cool I will make sure to post them on the proceedings of the podcast Anna is there anything else I know we've been you know as with many topics that we're all passionate about we

Starting point is 00:45:59 could probably talk forever but um kind of getting to a closing of this episode, is there anything else you want to make sure our listeners know if they want to get into this field of chaos engineering, or if they want to have a conversation maybe with the leadership about the importance of chaos engineering? Is there anything else we want to add at the end of this episode? I think I just definitely want to add very much on that. It's really easy to get started. It's really not intimidating. And if you're still trying to figure out how to pitch it to your manager, to your upper leadership, it's that you're just preparing for

Starting point is 00:46:37 those moments that matter. You want to make sure that you're reliable for your Halloween, for your New Year's, like whatever that peak traffic event is, or just for day-to-day operations too, especially in this virtual world of the pandemic that we have for who knows how much longer, you want to make sure that things are kind of like running smooth for them. And it's like not that hard to get started. Like I do work for a vendor, so we do have a tool that does make it simple. But you can easily get started in chaos engineering just by running some tabletop exercises with your team. So you don't need to do this, like reengineering or like buying things

Starting point is 00:47:16 or bringing in some some tooling in, you can actually start having those conversations of what happens to my systems, if my server goes down, if I inject CPU here, if my memory spikes up here, maybe I forget to rotate my logs and my disk fills up, what happens? All these little hypothetical resource layer, network layer conversations you can have without implementing the experiment and you still learn so much. Of course, you want to go ahead and start that experiment in your pre-testing environments and get to the point that you have them automated in production at the end of the day. So it is a little bit of a really long road, but that's kind of like what reliability is about. It's like, it's not just a press a button and all of a sudden you

Starting point is 00:48:01 have 99.999 availability. you you had to go through many other nines before you bring up a good point too and i think that the idea of the conversation about it not just with the team but if from what i've heard learned from some of the people who've been on talking about chaos engineering in the past is part of the exercise whenever you're going to take do a chaos experiment is before you do anything, first, obviously, you outline what you're going to break, but then everybody has to hypothesize what they think is going to happen, what they expect to happen.

Starting point is 00:48:38 I'm just curious if that's the way you see it as well, but the whole discussion part is a huge part of it, is this idea of it's a scientific experiment. We're going to, we're going to do something, we're going to hypothesize on what's going to happen, what we expect to happen based on our designs, and then test it to make sure that it fulfills that. Yeah. And this is like, it's such a cool space to be in because of that. It's not, it's not that hard to actually just get started about talking about hypothesis. And then you then start realizing that everyone just has a different mental model of a system and everyone's hypothesis is going to be different.

Starting point is 00:49:11 Whether it's because you've been at that organization five years, because you're a new grad, because you've used this technology before, like you're bringing in different knowledge and you now get to share it among each other. And that's going to have an interesting conversation. You're sharing knowledge. You're building up your team.

Starting point is 00:49:27 But then you also up-level any new person that's in that organization, whether it's an intern, a new grad, someone returning to the workforce that doesn't know what cloud native is. You're really getting a chance to teach them about your organization, your systems, technology, and really kind of really get started back into it all. So I think it's really valuable. And I think at the end of the day, like one of the largest things that chaos engineering is about is about learning. And you're already having incidents and you're not learning from them sometimes. So how is it that you're spending all this money already? You already invested in this incident.

Starting point is 00:50:06 This downtime was expensive. Why don't you go ahead and learn from it by doing little portions of dissecting that post-mortem, looking at the conditions that happened, running some tabletop exercises around that, and eventually get to the point of recreating that incident with those conditions and see what happens to your system. Very well said. Awesome cool hey um i you know we should probably you know schedule another podcast with you in a couple of months and talk about you know what you know once

Starting point is 00:50:38 this pandemic is over hopefully we'll maybe at one maybe at one point after all this chaos is over we can actually get together and meet in person and then have another conversation that we record on chaos engineering. But normally I'll do a little recap on what I learned. We call it the, you know, I'm summarizing what I learned. There's so many things I learned today. It's hard to summarize. But I still, Brian, do you want to summon the Summoner Raider?

Starting point is 00:51:06 I have a couple of things to say. Do it now. Yes, summon the Summoner Raider. We usually put an Arnold Schwarzenegger quote in there. Because the whole thing is, you know, Andy's Austrian, Arnold's Austrian. The Austrian-English accent sounds very much like Arnold Schwarzenegger. So it just became the Summonermarator instead of the Terminator. For anybody who's wondering about the history and the etymology of the term, there you all go.

Starting point is 00:51:31 It'll be a quiz next week. So a couple of things that I want to repeat or summarize. I think it's very important to learn about our systems. Every system is different, So we want to understand how our systems behave, how they expect to behave, but also potentially what can go wrong.

Starting point is 00:51:50 And I think that's a great opportunity to figure out how do the systems actually behave if things don't go as planned. I also like what you said in the beginning, start small, go big, meaning start with a small experiment,

Starting point is 00:52:04 maybe just turning off a host or turning off a service instance or even just a little CPU spike. I also liked the terminology that you kind of gave us an overview of the blast radius. So what's kind of the impact of your experiments and the magnitude of the experiment.

Starting point is 00:52:23 Also very important when you run experiments that you always have the ability to abort an experiment so they need the abort conditions i think we also learned that everyone can think about chaos engineering in their own line of work right now like we mentioned the performance engineers that may want to just you know as you said run the test but then maybe add a little c CPU spike and see how the system behaves. I think it did a great job in giving a lot of reference material. So we'll definitely make sure

Starting point is 00:52:53 that we have all of these links available. And it seems there's a couple of very easy Hello World examples that everybody can try out. I know there's more advanced use cases that are out there that are especially brought in by tools like Gremlin that you can then use and run your experiments on. I also think that it's important that we think about chaos engineering holistically. It's not just a technology thing,

Starting point is 00:53:26 like enforcing chaos on technology, but also on the humans. I think that's also very important, what I learned. And yeah, after all, I'm just very excited about this topic. It's really cool. And as Brian said, right, if you wouldn't have the chance, there would definitely be a cool a cool future

Starting point is 00:53:46 job to be in yeah yeah yes definitely continue learning it's a good skill to have and i think any form of making sure that you're always thinking about what is the worst thing that can happen in life in the year at work in my system you should you should definitely try to to pick that up as a good exercise. Awesome. Well, Ana, thank you very much for being on. We really love this topic, as you can see. And you just have so much experience and great knowledge about this. So we really appreciate you sharing that with us and to our listeners.

Starting point is 00:54:23 Also, thank you to our listeners for finding what's coming up to about an hour to listen to this. I know it's probably a lot harder to find time now that no one's commuting, but we really appreciate you being our audience and hope you're all doing well. Andy, thanks again for helping make this possible. I really appreciate you as well. And I say that with a little sarcasm, but I really mean it. And yeah, if anybody has any other questions, comments,

Starting point is 00:54:44 you can reach out at pure underscore DT on Twitter, or you can send us an old-fashioned email at pure underscore performance, right? At dynatrace.com. I think it's pure performance, one word, yeah. Well, people can try it out. Let's enforce some chaos on our mail server. Try it and see what what happens see what happens

Starting point is 00:55:07 see if you get a bounce um anna do you have any um place you like people to follow you at all linkedin or twitter or anything yes i'm available on all social medias you can just google anna margarita medina and you'll find a lot of my stuff. You get to find me on Twitter easily as Anna underscore M underscore Medina. Twitter is probably the best way to get a hold of me. So just reach out there. And if you're interested in learning more about chaos engineering, shoot a message. I'm always happy to send out more resources, jump on a call and share a little bit more about my experience. Gremlin does have a freemium model. So if you're interested in getting a quick start within chaos engineering, go ahead and check Gremlin out and always around for any questions too. So whether it's about getting started in

Starting point is 00:55:53 site reliability engineering, transitioning from somewhere else in the software engineering space to operations and chaos engineering, feel free to shoot a message out. And I would like to thank Brian and Andy for having me today. Super fun conversation. I missed a little chaos that 2020 brought us and our little platform issues earlier. It's part of the fun of getting to be in the space. Excellent. Thank you. And be sure to check out, we'll have a bunch of those links

Starting point is 00:56:20 down in the show description on the Spreaker page. So make sure to check those out. Or as Ana said, reach out to her directly. Thank you again so much, Ana. Thanks for taking the time today. Muchas gracias. Adios.

PurePerformance - Why you should look into Chaos Engineering with Ana Medina

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.